Formatting Markdown to find parser bugs
While writing articles and notes for this site, I've been frustrated by problems like broken links and unterminated styles only showing up after I preview them in a browser. I've also grown accustomed to the format-on-save systems of programming languages like Zig and Go.
There are a lot of programs that already solve this problem, including Markdown linters, formatters, and LSP servers.11 Like rumdl, mado, and Marksman. These support a much broader array of Markdown features than I use and come with their opinions baked in, which you have to tune manually. Instead, I wanted something with zero configuration and tailored to the kind of Markdown I write.
So, instead of generating HTML, I made the Markdown parser in lift, my site generator, produce Markdown in addition to the two other kinds of files it already produced22 HTML for site content and JSON for search indexing.. This is called source-to-source translation and how many (or most?) language formatters tend to work.
┌────────────────────┐
│ site-in/**/*.md │
└──────────┬─────────┘
┌─────▼─────┐
│ lift │
└─────┬─────┘
┌──────────────┼──────────────┐
┌─────▼────┐ ┌─────▼────┐ ┌─────▼─────┐
│ HTML │ │ JSON │ │ Markdown │
└──────────┘ └──────────┘ └───────────┘
As I developed this, I discovered a lot of parser bugs. Whenever the formatter unexpectedly introduced a change in files that I already expected to be formatted well, it indicated there was a bug somewhere. Once the bugs were fixed, I reformatted all Markdown files on my site and integrated it into my editors so it runs whenever I save Markdown files going forward.
Parsing Markdown
Markdown is annoying to parse because it's exceedingly permissive and the process is expected to be infallable.
The specification came out of a series of ad-hoc regular expression match-and-replace rules, leading to some unintuitive edge cases.
When I wrote the parser for my website, I intentionally limited the scope of what was acceptable to keep the code simple, especially since I don't have a strong background in text parsing.33 My shining moment in parser engineering was writing a series of LPeg parsers for various macOS performance tool outputs (heap, vmmap, zprint).
Unfortunately, that meant I couldn't use John Gruber's tests or the CommonMark test cases to qualify my work.
Instead, I wrote some unit tests and spot checked the HTML from my more complicated notes.
By the end, I thought I nailed it and had a pretty fast and deterministic parser that could be well-integrated into my site generator.
I knew there were test gaps, but I wasn't thrilled with Zig's built-in testing output44 It's sometimes useful to have a stack trace on failures, but most of the time I would rather just keep the output constrained to what's explicitly printed by the test (and the assertion of course).
It'd be nice if this behavior was configurable, but I'm suspecting I need to write my own test runner (I tried using std_options but it didn't work for me). so I didn't want to spend much time on them.
It was only when I started working on the formatter and linter that I noticed a whole crop of new issues, including a few that were actively impacting notes on my site.
Parser bugs
Here are a few of the bugs I fixed while working on the formatter:
Any
_or*inside of a link's display text would not close at the end of the line, continuing the style until HTML's implicit rules turned it off at the end of a paragraph.Nested lists had all kinds of issues, especially with mixed ordered and unordered lists. These didn't notice they were actually nested and would leave one or the other missing closing or opening tags.
Blockquotes were handled completely incorrectly, never actually allowing them to be continued to multiple lines. Any subsequent blockquote line would just introduce a new blockquote with no ending tag. For instance, this Markdown:
> First. > Second.Would turn into this HTML:
<blockquote><p>First.</p> <blockquote><p>Second.</p>
I also had to add a few affordances to the tokenizer and lexer to carry through more context from the source file.
The main example of this is "loose" vs. "tight" lists: whether there's a blank line between list items or not.
I wanted some lists to stay tight and dense because the text in them was so short, like the notes page.
But usually, if there's a paragraph in a list item, I want there to be spacing around it to more clearly set it apart.
I added a blank field to the list item tokens that indicated whether it was preceded by a blank line.
Testing
Originally, lift's unit tests used a custom "expect-lexed" helper that made sure an input Markdown produced a sequence of lexemes, either in the block or inline parsers. This worked pretty well for regression tests, but it wasn't the most convenient to iterate on. Despite Zig's lovely syntax for literals, the lexemes were still a lot of typing and line noise.
With the formatter though, I could write a new kind of quasi-snapshot test that looks like this:
test "formatMindown: spacing and indentation of nested paragraphs" {
try expectIdempotentFormat(
\\- Nested.
\\
\\ - Multiline.
\\ Paragraph.
\\
\\ Original level.
\\
\\ More paragraph.
\\
);
}
expectIdempotentFormat ensures that running the formatter on the input string introduces no changes to the text.
This is way less test code than checking lexemes manually and kind of fun to write.
In the future, I'll be adding an integration test to ensure there are no changes to any of the site's input files after running the formatter.
Editor integration
I'm currently using a few different editors, so I had to set each of them up separately.55 In practice, all settings but Nova's are generated by a single module in my Nix configuration that's enabled if I'm on my personal machine.
I've translated them to their outputs here.
The formatter runs as a subcommand of the lift program I use to generate the site and takes the file to read as its positional argument, or - to use stdin.
Helix: This was the easiest and just needed a new entry in the
languages.tomlfile:name = "markdown" auto-format = true formatter = { command = "lift", args = [ "format", "-", "..." ] }It doesn't present errors when I have unused definitions, but it's good enough. I think I would need to write a Language Server Protocol server if I wanted that support.
Nova: I already maintain a plugin for publishing this site easily, so I extended that with a format-on-save option. First, I added an "issue matcher" regular expression in the
extension.json:"issueMatchers": { "mattwidmann.lift": { "pattern": { "regexp": "^lift: error: [^:]+:(\\d+):(\\d+): (.*)\\s*$", "line": 1, "column": 2, "message": 3, "severity": "error" } } }That can be used to report errors from the
lift formatcommand:let parser = new IssueParser("mattwidmann.lift") let p = new Process(liftPath, { args: ["format", "-", "..."], cwd: nova.workspace.path, }) let lines = [] p.onStderr((line) => { parser.pushLine(line.trim()) }) p.onStdout((line) => { lines.push(line) }) p.onDidExit((code) => { if (code == 0) { const formattedContent = lines.join("") editor.edit((edit) => { if (formattedContent !== content) { edit.replace(textRange, formattedContent) } }) } // Return the issues (e.g. to a Promise). }) p.start() p.stdin.getWriter().ready.then(() => { writer.write(content); writer.close(); });And
parser.issuesare appended to anIssueCollectionto present them in the UI.Sublime Text: Sublime Text's plugin ecosystem favors language- or tool-specific plugins as much as possible, which feels very fractured and overlapping. Luckily, there's a semi-abandoned-but-generic sublime-fmt plugin that can run any tool on save, similar to Helix's built-in formatting support. I configured it with a
Fmt.sublime-settings:{ "rules": [ { "selector": "text.html.markdown", "cmd": ["lift", "format", "-", "..."], "format_on_save": true, "merge_type": "diff" } ] }To get errors from the formatter printed inline, I had to create a "build system" I called
Lift.sublime-build:{ "selector": "text.html.markdown", "cmd": ["lift", "format", "$file", "..."], "line_regex": "^lift: error: [^:]+:(\\d+):(\\d+): (.*)\\s*$" }
Now, no matter where I end up tweaking the site's content, the Markdown will remain tidy.