Formatting Markdown to find parser bugs

8 June 2026

While writing articles and notes for this site, I've been frustrated by problems like broken links and unterminated styles only showing up after I preview them in a browser. I've also grown accustomed to the format-on-save systems of programming languages like Zig and Go.

There are a lot of programs that already solve this problem, including Markdown linters, formatters, and LSP servers.¹¹Like rumdl, mado, and Marksman. These support a much broader array of Markdown features than I use and come with their opinions baked in, which you have to tune manually. Instead, I wanted something with zero configuration and tailored to the kind of Markdown I write.

So, instead of generating HTML, I made the Markdown parser in lift, my site generator, produce Markdown in addition to the two other kinds of files it already produced²²HTML for site content and JSON for search indexing.. This is called source-to-source translation and how many (or most?) language formatters tend to work.

          ┌────────────────────┐
          │  site-in/**/*.md   │
          └──────────┬─────────┘
               ┌─────▼─────┐
               │   lift    │
               └─────┬─────┘
      ┌──────────────┼──────────────┐
┌─────▼────┐   ┌─────▼────┐   ┌─────▼─────┐
│   HTML   │   │   JSON   │   │ Markdown  │
└──────────┘   └──────────┘   └───────────┘

As I developed this, I discovered a lot of parser bugs. Whenever the formatter unexpectedly introduced a change in files that I already expected to be formatted well, it indicated there was a bug somewhere. Once the bugs were fixed, I reformatted all Markdown files on my site and integrated it into my editors so it runs whenever I save Markdown files going forward.

Parsing Markdown

Markdown is annoying to parse because it's exceedingly permissive and the process is expected to be infallable. The specification came out of a series of ad-hoc regular expression match-and-replace rules, leading to some unintuitive edge cases. When I wrote the parser for my website, I intentionally limited the scope of what was acceptable to keep the code simple, especially since I don't have a strong background in text parsing.³³My shining moment in parser engineering was writing a series of LPeg parsers for various macOS performance tool outputs (heap, vmmap, zprint).

Unfortunately, that meant I couldn't use John Gruber's tests or the CommonMark test cases to qualify my work. Instead, I wrote some unit tests and spot checked the HTML from my more complicated notes. By the end, I thought I nailed it and had a pretty fast and deterministic parser that could be well-integrated into my site generator. I knew there were test gaps, but I wasn't thrilled with Zig's built-in testing output⁴⁴It's sometimes useful to have a stack trace on failures, but most of the time I would rather just keep the output constrained to what's explicitly printed by the test (and the assertion of course). It'd be nice if this behavior was configurable, but I'm suspecting I need to write my own test runner (I tried using std_options but it didn't work for me). so I didn't want to spend much time on them.

It was only when I started working on the formatter and linter that I noticed a whole crop of new issues, including a few that were actively impacting notes on my site.

Parser bugs

Here are a few of the bugs I fixed while working on the formatter:

Any _ or * inside of a link's display text would not close at the end of the line, continuing the style until HTML's implicit rules turned it off at the end of a paragraph.
Nested lists had all kinds of issues, especially with mixed ordered and unordered lists. These didn't notice they were actually nested and would leave one or the other missing closing or opening tags.
Blockquotes were handled completely incorrectly, never actually allowing them to be continued to multiple lines. Any subsequent blockquote line would just introduce a new blockquote with no ending tag. For instance, this Markdown:
```
    > First.
    > Second.
```
Would turn into this HTML:
```
    <blockquote><p>First.</p>
    <blockquote><p>Second.</p>
```

I also had to add a few affordances to the tokenizer and lexer to carry through more context from the source file. The main example of this is "loose" vs. "tight" lists: whether there's a blank line between list items or not. I wanted some lists to stay tight and dense because the text in them was so short, like the notes page. But usually, if there's a paragraph in a list item, I want there to be spacing around it to more clearly set it apart. I added a blank field to the list item tokens that indicated whether it was preceded by a blank line.

Testing

Originally, lift's unit tests used a custom "expect-lexed" helper that made sure an input Markdown produced a sequence of lexemes, either in the block or inline parsers. This worked pretty well for regression tests, but it wasn't the most convenient to iterate on. Despite Zig's lovely syntax for literals, the lexemes were still a lot of typing and line noise.

With the formatter though, I could write a new kind of quasi-snapshot test that looks like this:

test "formatMindown: spacing and indentation of nested paragraphs" {
    try expectIdempotentFormat(
        \\- Nested.
        \\
        \\    - Multiline.
        \\    Paragraph.
        \\
        \\    Original level.
        \\
        \\    More paragraph.
        \\
    );
}

expectIdempotentFormat ensures that running the formatter on the input string introduces no changes to the text. This is way less test code than checking lexemes manually and kind of fun to write.

In the future, I'll be adding an integration test to ensure there are no changes to any of the site's input files after running the formatter.

Editor integration

I'm currently using a few different editors, so I had to set each of them up separately.⁵⁵In practice, all settings but Nova's are generated by a single module in my Nix configuration that's enabled if I'm on my personal machine. I've translated them to their outputs here. The formatter runs as a subcommand of the lift program I use to generate the site and takes the file to read as its positional argument, or - to use stdin.

A screenshot of Nova with an editor in the left pane and a preview in the right pane showing an error for an unused link definition. — Nova shows a gutter icon and the error message on hover.

Helix: This was the easiest and just needed a new entry in the languages.toml file:
```
    name = "markdown"
    auto-format = true
    formatter = { command = "lift", args = [ "format", "-", "..." ] }
```
It doesn't present errors when I have unused definitions, but it's good enough. I think I would need to write a Language Server Protocol server if I wanted that support.

Nova: I already maintain a plugin for publishing this site easily, so I extended that with a format-on-save option. First, I added an "issue matcher" regular expression in the extension.json:

    "issueMatchers": {
        "mattwidmann.lift": {
            "pattern": {
                "regexp": "^lift: error: [^:]+:(\\d+):(\\d+): (.*)\\s*$",
                "line": 1,
                "column": 2,
                "message": 3,
                "severity": "error"
            }
        }
    }

That can be used to report errors from the lift format command:

    let parser = new IssueParser("mattwidmann.lift")
    let p = new Process(liftPath, {
      args: ["format", "-", "..."],
      cwd: nova.workspace.path,
    })
    let lines = []
    p.onStderr((line) => { parser.pushLine(line.trim()) })
    p.onStdout((line) => { lines.push(line) })
    p.onDidExit((code) => {
      if (code == 0) {
        const formattedContent = lines.join("")
        editor.edit((edit) => {
          if (formattedContent !== content) {
            edit.replace(textRange, formattedContent)
          }
        })
      }
      // Return the issues (e.g. to a Promise).
    })
    p.start()
    p.stdin.getWriter().ready.then(() => {
      writer.write(content);
      writer.close();
    });

And parser.issues are appended to an IssueCollection to present them in the UI.

Sublime Text: Sublime Text's plugin ecosystem favors language- or tool-specific plugins as much as possible, which feels very fractured and overlapping. Luckily, there's a semi-abandoned-but-generic sublime-fmt plugin that can run any tool on save, similar to Helix's built-in formatting support. I configured it with a Fmt.sublime-settings:
```
    {
        "rules": [
            {
                "selector": "text.html.markdown",
                "cmd": ["lift", "format", "-", "..."],
                "format_on_save": true,
                "merge_type": "diff"
            }
        ]
    }
```
To get errors from the formatter printed inline, I had to create a "build system" I called Lift.sublime-build:
```
    {
        "selector": "text.html.markdown",
        "cmd": ["lift", "format", "$file", "..."],
        "line_regex": "^lift: error: [^:]+:(\\d+):(\\d+): (.*)\\s*$"
    }
```

A screenshot of Sublime Text showing this article being edited with an error displayed for an unused link definition. — Sublime Text build systems only present errors after building, but displaying the text inline is helpful.

Now, no matter where I end up tweaking the site's content, the Markdown will remain tidy.