Type to search.

    Formatting Markdown to find parser bugs

    While writing articles and notes for this site, I've been frustrated by problems like broken links and unterminated styles only showing up after I preview them in a browser. I've also grown accustomed to the format-on-save systems of programming languages like Zig and Go.

    There are a lot of programs that already solve this problem, including Markdown linters, formatters, and LSP servers.11 Like rumdl, mado, and Marksman. These support a much broader array of Markdown features than I use and come with their opinions baked in, which you have to tune manually. Instead, I wanted something with zero configuration and tailored to the kind of Markdown I write.

    So, instead of generating HTML, I made the Markdown parser in lift, my site generator, produce Markdown in addition to the two other kinds of files it already produced22 HTML for site content and JSON for search indexing.. This is called source-to-source translation and how many (or most?) language formatters tend to work.

              ┌────────────────────┐
              │  site-in/**/*.md   │
              └──────────┬─────────┘
                   ┌─────▼─────┐
                   │   lift    │
                   └─────┬─────┘
          ┌──────────────┼──────────────┐
    ┌─────▼────┐   ┌─────▼────┐   ┌─────▼─────┐
    │   HTML   │   │   JSON   │   │ Markdown  │
    └──────────┘   └──────────┘   └───────────┘
    

    As I developed this, I discovered a lot of parser bugs. Whenever the formatter unexpectedly introduced a change in files that I already expected to be formatted well, it indicated there was a bug somewhere. Once the bugs were fixed, I reformatted all Markdown files on my site and integrated it into my editors so it runs whenever I save Markdown files going forward.

    Parsing Markdown

    Markdown is annoying to parse because it's exceedingly permissive and the process is expected to be infallable. The specification came out of a series of ad-hoc regular expression match-and-replace rules, leading to some unintuitive edge cases. When I wrote the parser for my website, I intentionally limited the scope of what was acceptable to keep the code simple, especially since I don't have a strong background in text parsing.33 My shining moment in parser engineering was writing a series of LPeg parsers for various macOS performance tool outputs (heap, vmmap, zprint).

    Unfortunately, that meant I couldn't use John Gruber's tests or the CommonMark test cases to qualify my work. Instead, I wrote some unit tests and spot checked the HTML from my more complicated notes. By the end, I thought I nailed it and had a pretty fast and deterministic parser that could be well-integrated into my site generator. I knew there were test gaps, but I wasn't thrilled with Zig's built-in testing output44 It's sometimes useful to have a stack trace on failures, but most of the time I would rather just keep the output constrained to what's explicitly printed by the test (and the assertion of course). It'd be nice if this behavior was configurable, but I'm suspecting I need to write my own test runner (I tried using std_options but it didn't work for me). so I didn't want to spend much time on them.

    It was only when I started working on the formatter and linter that I noticed a whole crop of new issues, including a few that were actively impacting notes on my site.

    Parser bugs

    Here are a few of the bugs I fixed while working on the formatter:

    I also had to add a few affordances to the tokenizer and lexer to carry through more context from the source file. The main example of this is "loose" vs. "tight" lists: whether there's a blank line between list items or not. I wanted some lists to stay tight and dense because the text in them was so short, like the notes page. But usually, if there's a paragraph in a list item, I want there to be spacing around it to more clearly set it apart. I added a blank field to the list item tokens that indicated whether it was preceded by a blank line.

    Testing

    Originally, lift's unit tests used a custom "expect-lexed" helper that made sure an input Markdown produced a sequence of lexemes, either in the block or inline parsers. This worked pretty well for regression tests, but it wasn't the most convenient to iterate on. Despite Zig's lovely syntax for literals, the lexemes were still a lot of typing and line noise.

    With the formatter though, I could write a new kind of quasi-snapshot test that looks like this:

    test "formatMindown: spacing and indentation of nested paragraphs" {
        try expectIdempotentFormat(
            \\- Nested.
            \\
            \\    - Multiline.
            \\    Paragraph.
            \\
            \\    Original level.
            \\
            \\    More paragraph.
            \\
        );
    }
    

    expectIdempotentFormat ensures that running the formatter on the input string introduces no changes to the text. This is way less test code than checking lexemes manually and kind of fun to write.

    In the future, I'll be adding an integration test to ensure there are no changes to any of the site's input files after running the formatter.

    Editor integration

    I'm currently using a few different editors, so I had to set each of them up separately.55 In practice, all settings but Nova's are generated by a single module in my Nix configuration that's enabled if I'm on my personal machine. I've translated them to their outputs here. The formatter runs as a subcommand of the lift program I use to generate the site and takes the file to read as its positional argument, or - to use stdin.

    Now, no matter where I end up tweaking the site's content, the Markdown will remain tidy.