It’s no secret that I’m a big fan of Textile, and much prefer Textile over Markdown. Yet I find myself working in Markdown all the time, because while you can find any number of helpful Markdown tools, the app selection for Textile is shit.
The syntax for the two is similar, but different in key points, and using both in different places means it’s easy for a tired brain to start typing in one in an environment that supports the other. For this reason I’ve decided to standardise on Markdown as much as possible1, and that includes converting my WordPress-based blog from Textile to Markdown.
My first course of action is, of course, to search for a ready-made solution. A completely automated all-in-one conversion tool, just install a WordPress plugin and go!
I can find nothing suitable, or at least nothing suitable that has been updated in the last couple of years. I’m very wary of WordPress plugins that aren’t kept up-to-date; more than once such plugins have brought my network down hard. I’d rather not deal with the hassle.
On to the next solution.
The swiss army knife
I’m highly allergic to PowerPoint and Word gives me hives, so I’m already intimately familiar with pandoc for dealing with documents in all sorts of formats. Pandoc will happily eat Textile and shit Markdown, so there’s big job sorted.
This isn’t a very neat solution, mind you, as I have no automated way of getting the Textile out of WordPress and the Markdown back in. On the other hand, since I intentionally trashed the majority of my blog a few years back, I don’t have too many posts to edit. In other words: Fuck it, I’ll do that bit by hand.
So I copy out the contents of a couple of posts, paste it as text files, and shove it through pandoc. Success! Well, almost. There’s a couple of snags.
First is that I’m getting setext-style headers. I have to force atx-style headers.
Second is that my chosen Markdown implementation, Jetpack’s Markdown module, outputs a line break at every line break in the source. Standard Markdown only generates a line break at a blank line or a line ending in a double space. I have to force no wrapping in the output.
Third is that I’m getting a weird table syntax that pandoc loves, but is not at all supported by MarkdownExtra. I have to disable every table extension I don’t want to make sure I get the one syntax I can use.
Fourth and definitely worst, pandoc is forcing smart punctuation on me, and no amount of double-checking the spelling of
--no-tex-ligatures will fix it. This is a pretty big problem, as it breaks any WordPress shortcode with attributes. Fuck sorting that out by hand. I’m going to need a serious replacement tool.
I’m not familiar with any suitable replacement tools, so I ask cute boy his recommendation. His immediate response is sed. I grab the binaries and dependencies, open up the manual …
… and realise I have been handed a chainsaw for sharpening a pencil. Oh dear.
Since the manual fails at explaining even the basic invocation and even more so at providing some simple examples a poor newbie can work with, it’s time to go searching, testing, swearing, and searching some more.
An entirely unreasonable amount of time and Coke Zero later, I’ve managed to get a basic search and replace to work. I’m slowly drowning in temporary files that aren’t removed automatically, but it works, and I can move on to the task of matching the fancy Unicode punctuation riddling my files.
Further searching on this specific topic finally leads me to the UTF-8 encoding table and Unicode characters, and finally I can get this goddamned fucking shit to work properly.
I’m almost there, but there’s still work to be done.
Pandoc’s conversion is not perfect. It randomly pukes out link syntax where there is no link, deletes a space here and there, and escapes underscores — even underscores in links. There’s no obvious pattern to this, so there’s only one solution left: Manual editing. A nice Markdown friendly editor does the job: MarkdownPad or MdCharm does it for me, but as already mentioned, there’s no shortage of good Markdown editors out there.
The final problem comes in the form of undocumented behaviour in Jetpack’s Markdown. The big one for me is that it doesn’t support tables without a header row. At all. Tables that don’t warrant a header row (or would be better with a header column) have to get one, because making shit up after a few hours of beating pandoc with a shovel and suddenly learning sed is the easiest task ever.
So here’s the final workflow:
- Copy contents of post(s) to text files on local machine.
- Run batch script on text files to convert to Markdown and do some cleanups with sed.
- Do manual cleanup in a Markdown friendly editor.
- Copy Markdown output and paste back into each post.
- Pat self on head, because throat is sore from all the swearing, and I damn well deserve a pat.
REM Convert all *.txt files (containing Textile) in current folder to *.md (containing Markdown) FOR %%i IN (*.txt) DO pandoc -f textile -t markdown-simple_tables-multiline_tables-grid_tables --atx-headers --no-wrap -o "%%~ni.md" "%%i" REM Do all the replacements in the 'replacements' file to all *.md files sed -i -f replacements *.md REM Delete all the temporary files sed leaves behind del sed*
# Left single quotation mark s/\xe2\x80\x98/'/g # Right single quotation mark s/\xe2\x80\x99/'/g # Left double quotation mark s/\xe2\x80\x9c/"/g # Right double quotation mark s/\xe2\x80\x9d/"/g # Horizontal ellipsis s/\xe2\x80\xa6/.../g # Em dash s/\xe2\x80\x94/--/g # <hr> to --- s/<hr>/---\n/g # Prettify lists s/^- /* /g # Fix escaped dollar signs s/\\\$/$/g # Fix un-converted headers s/^h1\. /# /g # Fix stupid underscore fix s/attachment\\_/attachment_/g
So that was fun, and I probably have to do it again for a couple of other sites in my network, since the Textile plugin was activated globally for … a few years. I’m already looking forward to it.