Convert Blogger Site to Hugo

It is time to have a place to post my ideas again. Most of my old articles and posts have been sitting in an archive of an old Blogger site for near a decade now. I want to update to something modern to host my site, so I am taking the opportunity to convert all of it to Hugo with an automated deployment pipeline to host the site on Cloudflare.

Never one to pass up the chance to code up a solution to solve a problem, I spent a lot of time figuring out how to script the migration. These are the steps I took to extract the post content from the XML file export from Blogger backup file. The first two steps could probably have been done just using the yq tool, but it was faster for me to use jq since I’m already very familiar with it.

Converting Blogger to Hugo #

First, we need to convert the XML to a JSON to decode HTML escapes and then use jq for the next step.

yq . -o=json the-big-v-blog.xml > the-big-v-blog.json

Extract the fields we want: dates, link (to make the slug), title, content.

jq '.feed.entry[] | select(.category|.["+@term"] == "http://schemas.google.com/blogger/2008/kind#post") | {p: .published, u: .updated, t: (.title|.["+content"]), c: (.content|.["+content"]), l: (.link[] |select(.["+@rel"] == "alternate") | .["+@href"]) }' the-big-v-blog.json > the-big-v-posts.js

This outputs a list of article content as JSON that looks like this:

{
  "p": "2007-03-19T12:02:00.000-04:00",
  "u": "2007-03-19T12:14:30.116-04:00",
  "t": "Keeping Track of FreeBSD Kernels Configurations",
  "c": "I keep track of my kernel configurations in [...]"
  "l": "http://vivek.khera.org/the-big-v/2007/03/i-keep-track-of-my-kernel.html"
},
 ... more entries ...

The quickest way to fix it up for the next step is to manually edit the file. We will use the link from the original to create the slug, and convert the whole thing to a JSON array named p by adding export const p = [ to the top and closing the array at the bottom with ].

In the editor of choice (I like Emacs), search and replace the http:// parts inside the l JSON key and delete them.

The result looks like this:

export const p = [
{
  "p": "2007-03-19T12:02:00.000-04:00",
  "u": "2007-03-19T12:14:30.116-04:00",
  "t": "Keeping Track of FreeBSD Kernels Configurations",
  "c": "I keep track of my kernel configurations in [...]"
  "l": "i-keep-track-of-my-kernel"
},
 ... more entries ...
];

I then wrote a small Javascript program to split the entries into files using the slug as the file name. It also creates the necessary front matter metadata to preserve the original publication date.

import { p } from "./the-big-v-posts.js";
const category = `historical`;
function writepost(bp) {
    const fn = bp.l + ".md";

    let post = `---\ntitle: ${bp.t}\ndate: ${bp.p}\nlastmod: ${bp.u}\ncategories:\n - ${category}\n---\n\n${bp.c}\n`;

    Deno.writeTextFileSync(fn, post);
}
p.map(writepost);

# run the JS script to split it into files
deno run -A split-posts.js

An extra step I took was to edit the resulting files to convert the HTML styling such as bold or italic into markdown. I did not use that in many posts, so I manually did this rather than automating it.

Converting Google Sites to Hugo #

I also had some pages in a Google Sites website. To export out of Google Sites, the first step is to use Google Takeout to get the pages. The trick here is to first make a new folder, then copy the Sites file into it on the Google Drive. The copy will export as a bunch of HTML files.

The flaw with this plan is that every page has about 180k worth of HTML cruft in it, and finding your desired content is difficult. Luckily, the pandoc program can strip all that out and leave us with just the content text and conveniently convert it to markdown. For each file to preserve run this command:

pandoc -t gfm-raw_html ishouldhavepatentedthat.html|grep -v ^:::

The core content and publish date is easily found and copied into a new Hugo document in the front matter section.

Since I only wanted to keep a handful of the documents, I did not automate it any further. You may wonder why not just copy/paste from the live Google Sites pages? Because they rewrite every HREF link.

Creating the Hugo Site #

Following the above steps I am left with a bunch of markdown files. From here I followed an online tutorial to create a Hugo web site. All of these articles were just moved into the content/posts folder of the Hugo site. I moved a couple of them into the content/pages directory to fill out the home page and some other content like my résumé.

Once the site was built out in Hugo with my chosen theme, I uploaded it to a new private project in Github. I use the Cloudflare feature where they will track Github and rebuild my site on their hosting infrastructure any time I push an update. The entire site renders as a set of static pages, so is infinitely scalable using their CDN.

I also found a really nice plugin to Visual Studio Code called Front Matter which gives me a fairly clean UI to create and manage posts and their metadata. The Github integration in VS Code lets me seamlessly update my blog by clicking a few buttons to push the changes to Github.

Next step: start posting!