Managing This Blog — 3

I was thoroughly discouraged with the geek factor of static site generators. I decided to create my own site and support.

I am no HTML or web guru. I do not play one on TV. My related skills are nominal. I consider my shell scripting skills above average but not wizard or guru class. I do not know perl or python. Going my own way would be an uphill journey.

By chance I ran across a blog site that was formatted much like I had envisioned and not like the standard WordPress template blog. Or what is now the common generic “mobile compatible” nonsense. I copied the top page, one blog article, and the CSS file.

I downloaded an XML dump of my wordpress.com blog.

So far so good. Little did I know what lied ahead.

My first stumbling steps was converting the XML file into individual HTML files. After a few days of sweat equity I had a decent shell script to provide the conversion. The base conversion left more to be desired.

I massaged the script to look for file names in my original local published directory to ensure converted XML files had similar base names. That helped me know I had a 1:1 correspondence with the conversions.

The next phase was writing a series of small scripts to convert the individual files into a web site. Later I merged the collection of small scripts into a single script.

I spent much of my time running one-liners to parse information, moving those one-liners into the scripts, then testing. Over and over. At times the process was concurrently exhausting and fulfilling.

The cost was about three weeks of tinkering, weeping, and gnashing of teeth. I learned that XML and HTML are not well suited to using the standard scripting tools of grep, awk, and, sed. Yet somehow I managed to write scripts with those foundations to provide what I wanted.

In the end I am happy with the results. I have automation. The web site is a simple static layout. The site uses only flat files with no database backend. There are no dependencies on third parties, such as Disqus, or Google dependencies for fonts, icons, JavaScript APIs, etc. I write draft articles in text with simple HTML tagging. I do not deal with Markdown, reStructuredText, AsciiDoc or other markup fad languages. No conversions back to HTML to prove I am a geek. There is no JavaScript. I have an RSS feed.

I am able to write draft articles using a text editor until ready to post. I have an ability to schedule posts. I am able to maintain a local copy of the web site as my master and backup. I use ssh and rsync to push to the hosting site. In the beginning of this project I did not think I would fare so well.

The design is simple. I use a single directory $BLOG_DIR to house my text articles and support files. Scripts are stored in the same place I store all custom scripts, /usr/local/bin.

I assign a four digit sequence as part of each article file name.

In the $BLOG_DIR directory are two important subdirectories, Pending and Published. When I am ready to post new articles I move the files into the Pending directory. When I move the files I add simple meta data at the top of the article text:

PostDate: YYYY-MM-DD

Categories: XXXXXX

Tags: YYYYYY, ZZZZZZZ

My script validates this meta data. I run the script manually or as a daily cron job. When the PostDate in the meta data is not the same as the current day then nothing happens. When the dates coincide then the script converts the file to full HTML using a base template file. This meta data mechanism allows me to schedule articles for posting. The meta data allows me to use categories and tags, which are used to create respective indexes.

The script verifies the sequence number and halts when there is a problem. The script updates links for Previous and Next articles. The script updates the site index as well as affected Category and Tag indexes. When completed the plain text base article is moved from the Pending directory to the Published directory.

The web site directory tree is stored at $WEB_SITE_DIR. This is where the final HTML file is stored. The directory tree mimics the directory tree of the online host. I use an rsync wrapper script to sync the files. In my drafts I use upper case and spaces in the file names. When converted to HTML the file names are converted to all lower case and dashes are substituted for the spaces to provide acceptable web server file names.

For local quality assurance I locally view and test the HTML files. Because the files will appear online, a server cron job syncs the files in $WEB_SITE_DIR to a virtual host directory on my local server and adjusts file permissions for the server. While I can view the static files in $WEB_SITE_DIR using a web browser, that is not the same as how they will appear online. An example is the RSS feed XML file, which displays differently in a browser when static rather than on a server.

I use a YYYY/MM directory structure to store both the Published articles and web site files. Thus, in addition to the article sequence numbers, this directory structure lets me know at a glance when I posted an article without needing a web browser to view the site index.

The script validates links using the linkchecker program. When run from a cron job I send myself an email reminding me of the event as well as check link errors or other problems.

The script supports testing a queued file. This is handy because I can preview new articles in a web browser without actually moving files from Pending to Published. I wrote another script to restore the files to the previous condition, as though nothing had been tested or published.

The script supports variables to link check all site files or just the modified files. Similarly with running tidy to validate the HTML. The script can regenerate all indexes or just the ones affected by the new article.

There are other details but hopefully this basic description suffices.

The new blog site is not fancy but more than functional. Being static all pages fast. For now I avoid a search engine by creating indexes for all categories and tags. The main blog page is an index too. Perhaps one day I’ll look into a full fledged search engine, which requires backend dynamic scripting. Not really high on my priority list though.

Several times since going live I have tweaked the script. Little things that are challenging to anticipate or foresee. More robustness and error checking. As I am the sole user of the scripts, I have not tested for corner case issues. At the moment I run my script under the presumption of posting only one article per day. I have not tested posting more than one per day. I have not thought about or tested supporting more than one blog.

The nice part is all of this is automated. I have tested to the point that now I move a draft article to the Pending directory and let the cron jobs do the remainder of the work, which includes syncing the local master files online to the host server. My script wrapper to push online runs with a single one-liner in a terminal or cron job. Fortunately, the host provider supports SSH keys. I had to resolve some ssh-agent hurdles to get the syncing script to work automated.

In all only four shell scripts.

Convert a WordPress XML export file to individual HTML files.
Convert a text draft article to HTML and update local HTML source files.
Restore the local HTML files to a previous condition
Sync local source files online.

Low geek factor.

Not bad.

I am somewhat surprised by my effort. I had not expected to accomplish this much. I no longer am stressed about using wordpress.com. My administrative overhead is much less with my own system. For the most part now I just write. Conversely, this was a challenging project for three weeks. I do not want to go through another project like this for a long while. Time to enjoy the remainder of the summer.

Posted: August 16, 2016 Category: Usability Tagged: General, Migrate

Next: Slow Boiling Frogs

Previous: Managing This Blog — 2