xigoi

Why I created a markup language

What is a markup language?

(If you know the answer, you can skip this section.)

Definition. A markup language is a computer language whose primary purpose is to create documents, describing their structure and contents.
Example. Examples of markup languages include HTML (Hyper Text Markup Language), LaTeX, Markdown, RST (ReStructured Text), AsciiDoc, roff and many others.

So why yet another one?

It's clear that there are many markup languages around and trying to create one that's the best for everyone would just fulfill xkcd 927 (Standards). However, I wanted to use a markup language for writing static websites, school notes and other documents, I wasn't satisfied with any of the existing solutions and I decided to create something that's useful for me. What exactly are the problems I was trying to solve? Let's see:

HTML

LaTeX

Markdown

RST

These are not all markup languages, but I think the idea is clear.

What do I want from a markup language?

These are the design goals for xidoc, a markup language that I set out to create. (By the way, this article is written in xidoc.) Incidentally, the same principles could also apply when creating a programming language, but let's leave that for another time.

How to go about it?

To make the language consistent, there is no better place to look for inspiration than the old programming language Lisp. It has pretty much only one syntactic element: a function/command call expressed as space-separated things in parentheses, where the first thing is the command name and the rest is the arguments. To adapt this syntactic paradigm to markup, I had to make just two modifications: using square brackets instead of parentheses as delimiters and semicolons instead of spaces as argument separators. This is because spaces and parentheses are common in text, so there would be too much escaping.

This is actually not the whole story. Originally I wanted to have a LaTeX-like syntax, but it turned out that the Lisp version would require fewer special characters and be easier to parse. Consider the following example: This is the it[original] syntax. It looks just as simple as the current version, but what if we didn't want a space between “the” and “original”? There would need to be a special “no-op” character so that it doesn't look like we're invoking the command theit. And how would one invoke a command without parameters? That's another special syntactic element. The Lisp version solves both of these issues.

When it comes to power, there's really nothing better to do than implement a command for every useful functionality. Obviously a good programming language and a good layout of the codebase makes that much easier — more on that later. And since it's not possible to implement everything, it's important to provide “escape hatches” that allow the user to directly generate code in the target language. And even that can be made easy — for example, this is how you can generate a div with a given id and class when compiling to HTML: [<div> .my-class; #my-id; Hello there!]

It turns out that the combination of simple syntax and powerful commands also make the language ergonomic, so that goal doesn't even need to be taken care of separately! It also helps that square brackets and semicolons are easy to type on a standard keyboard. But obviously, the language can only achieve so much and you also need a good editor — that's why I use Neovim

A part of making the language powerful is also templating. Since everyone has different needs, xidoc provides the ability to define new commands and to include files in other files.

Now let's talk about the implementation, which will also explain how I achieved the remaining two goals.

Why did I make it in Nim?

For those who aren't aware, Nim is a compiled, statically typed programming language whose motto is “Efficient, Expressive, Elegant” — and it achieves all of these goals perfectly. It compiles to native code, using C as and intermediate representation. This allows xidoc to be speedy, even though the implementation is not particularly efficient. But why not just code it in a more mainstream language like C++ or Java? The answer is simple: these languages are a pain to code in. Nim's metaprogramming facilities allow me to do things that I couldn't even dream of in other languages. For example, when I want to parse xidoc, I can simply import the NPeg library and write code like this:

const xidocParser = peg("text", output: XidocNodes):
  textChars <- >+xidoc.textChar:
    output.add XidocNode(kind: xnkString, str: $1)
  whitespace <- +Space:
    output.add XidocNode(kind: xnkWhitespace)
  command <- '[' * >*xidoc.commandChar * >xidoc.unparsedText * ']':
    output.add XidocNode(kind: xnkCommand, name: $1, arg: $2)
  chunk <- command | textChars | whitespace
  text <- *chunk * !1

Yes, that's it. I don't have to implement a parser manually. No long, unreadable chains of strtok, if (s[++i] == ';') and who knows what else. I just write the grammar and a parser will be generated for me, most likely more efficient than I'd be able to write myself. But it doesn't end there. I need to implement a lot of commands for xidoc and most of them are similar in thair basic logic, which would mean a lot of repetitive code. However, Nim allows me to create custom syntax for defining commands that does everything behind the scenes. For example, this is how the [color] command is defined:

command "color", (color: expand, text: render), rendered:
  case doc.target
  of tHtml:
    htg.span(style = &"color:{color}", text)
  of tLatex:
    doc.addToHead.incl "\\usepackage[svgnames]{xcolor}"
    "\\textcolor{$1}{$2}" % [color, text]

It couldn't be simpler. Except, what are those expand and render? That leads us to another topic…

What problems did I encounter?

Expand and render

A necessary part of expressing some text in a markup language is escaping special characters. As already mentioned, xidoc has three special characters ([, ;, ]), and it also has a way to escape them, but what I want to talk about is escaping characters in the target language.

Obviously, this it something the markup language should take care of. Nobody wants to write &lt; and &amp; all over the place, especially when doing complicated things like embedding code snippets or writing LaTeX equations.

Anyway, what's the big deal? I just make it so that when there's literal text, special characters inside it will be escaped, right?

Well, not quite. Sometimes you don't want text to be escaped, such as when passing it to an internal function that does some transformations on it before escaping it. And since commands can be arbitrarily nested inside other commands, you need to track which text has already been escaped and which commands require what kind of text. Then, text will be escaped whenever it's passed to a command that expects escaped text, but the argument is marked as unescaped.

This is why I came up with these pseudo-“type signatures” on commands. However, I'm not quite happy with the current system. It often requires different versions of the same command for escaped and unescaped text, which means that the user will have to be aware of this implementation detail. If anyone has an idea how to deal with this situation, please let me know.

JavaScript libraries

I want xidoc to have a plenty of useful features, which include LaTeX math rendering and syntax highlighting. These features would be hard to implement myself, but fortunately, there are great libraries for them: KaTeX and Prism. There's just one problem — these libraries are made in JavaScript. So how could I integrate them into a project made in Nim?

One option would be to make use of the fact that Nim can compile to JavaScript. This is a great feature of the language, and it's what enables me to have a limited version of xidoc available on the web, but I don't want JavaScript to be the primary target because that would throw away the performance benefits and require users to have something like Node.js installed.

Luckily, I found out that there is this awesome C library called Duktape which implements an small, embeddable JavaScript interpreter. It's easy to interoperate with C libraries in Nim, so it took me just about an hour to get it working. There were a few issues with it. By default, it just crashes whenever there's an uncaught error in the JavaScript code, so I had to monkey-patch the code to make it at least produce an error message; trying to get it to raise a Nim exception that could be caught in the wrapping code would presumably be a futile effort. Also, it only supports the ES5 version of JavaScript. I originally tried to use highlight.js rather than Prism, but I just couldn't get it to work, even after somehow transpiling it with Babel (which, by the way, is much harder than it looks, since Babel doesn't even try to work out of the box). So this is why I turned to Prism, which is written in ES5-compatible JavaScript and ended up being a better fit for the project anyway.

What is the current state of xidoc?

I don't really feel like updating this article every time I update xidoc, so if you want to find out, head to the xidoc website.

Is anyone else going to use it?

To be honest, I don't really care. I primarily made it for myself and I'm happy with it just for the productivity improvements it brings me. The reason why I made it open source is that xidoc might be also useful to other people and it costs me almost nothing to make it available for everyone. And I believe this sentiment is shared by a big part of the free software community.

So if this article convinced you to start using xidoc, Nim, or any other software mentioned here, that's great. If not, I hope you had an interesting read anyway, and sorry for wasting your time if that's not the case. Peace!