Why I created a markup language

So why yet another one?

It's clear that there are many markup languages around and trying to create one that's the best for everyone would just fulfill xkcd 927 (Standards). However, I wanted to use a markup language for writing static websites, school notes and other documents, I wasn't satisfied with any of the existing solutions and I decided to create something that's useful for me. What exactly are the problems I was trying to solve? Let's see:

HTML

HTML is unnecessarily verbose. Every tag is repeated twice, which creates “stutter” when reading.
HTML has no templating capabilities. If you have a complex structure that repeats itself, there's no choice but to have multiple copies of it in your code.
HTML is actually three languages in a trench coat — HTML, CSS and JavaScript. These languages have completely different syntax and interact in weird ways.

L^aT_eX

L^aT_eX can't be used to create webpages, which makes it not fit my use case. There are L^aT_eX to HTML converters, but their output tends to be messy.
L^aT_eX is inconsistent and at times cryptic. Some things are commands, other things are environments, with no clear distinction between the two. Some commands such as \large are put inside curly braces, rather than outside like most other commands.

Markdown

Markdown is not standardized. Every Markdown implementation has different syntax and semantics, so every time some tool claims to “use Markdown”, one needs to learn a new dialect and remember the differences from the hundreds of other dialects. Some dialects like CommonMark try to solve this, but that's just xkcd 927 (Standards) again.
Markdown is not powerful enough. I want to have fine control over my documents, so a language that aims to be simple is not the right fit.

RST

The syntax of RST is complex, ugly and full of inconsistencies.
There doesn't really seem to be an official implementation.

These are not all markup languages, but I think the idea is clear.

What do I want from a markup language?

Consistency. The language's syntax should be as simple as possible, with few things to remember. It should be easy to infer how something is done without reading the documentation.
Power. The language should allow doing everything that could be desired from a markup language.
Ergonomics. The language should create as little friction as possible, so it's possible to convert thought to text quickly.
Templating. The language should have constructs for eliminating code repetition, which makes code easier to read and reason about.
Speed. The language should compile in an instant, so one doesn't have to wait when recompiling frequently.
Clear error messages. When something goes wrong, it should be clear what it is.

These are the design goals for xidoc, a markup language that I set out to create. (By the way, this article is written in xidoc.) Incidentally, the same principles could also apply when creating a programming language, but let's leave that for another time.

How to go about it?

To make the language consistent, there is no better place to look for inspiration than the old programming language Lisp. It has pretty much only one syntactic element: a function/command call expressed as space-separated things in parentheses, where the first thing is the command name and the rest is the arguments. To adapt this syntactic paradigm to markup, I had to make just two modifications: using square brackets instead of parentheses as delimiters and semicolons instead of spaces as argument separators. This is because spaces and parentheses are common in text, so there would be too much escaping.

This is actually not the whole story. Originally I wanted to have a L^aT_eX-like syntax, but it turned out that the Lisp version would require fewer special characters and be easier to parse. Consider the following example: This is the it[original] syntax. It looks just as simple as the current version, but what if we didn't want a space between “the” and “original”? There would need to be a special “no-op” character so that it doesn't look like we're invoking the command theit. And how would one invoke a command without parameters? That's another special syntactic element. The Lisp version solves both of these issues.

When it comes to power, there's really nothing better to do than implement a command for every useful functionality. Obviously a good programming language and a good layout of the codebase makes that much easier — more on that later. And since it's not possible to implement everything, it's important to provide “escape hatches” that allow the user to directly generate code in the target language. And even that can be made easy — for example, this is how you can generate a div with a given id and class when compiling to HTML: [<div> .my-class; #my-id; Hello there!]

It turns out that the combination of simple syntax and powerful commands also make the language ergonomic, so that goal doesn't even need to be taken care of separately! It also helps that square brackets and semicolons are easy to type on a standard keyboard. But obviously, the language can only achieve so much and you also need a good editor — that's why I use Neovim

A part of making the language powerful is also templating. Since everyone has different needs, xidoc provides the ability to define new commands and to include files in other files.

Now let's talk about the implementation, which will also explain how I achieved the remaining two goals.

Why did I make it in Nim?

For those who aren't aware, Nim is a compiled, statically typed programming language whose motto is “Efficient, Expressive, Elegant” — and it achieves all of these goals perfectly. It compiles to native code, using C as and intermediate representation. This allows xidoc to be speedy, even though the implementation is not particularly efficient. But why not just code it in a more mainstream language like C++ or Java? The answer is simple: these languages are a pain to code in. Nim's metaprogramming facilities allow me to do things that I couldn't even dream of in other languages. For example, when I want to parse xidoc, I can simply import the NPeg library and write code like this:

const xidocParser = peg("text", output: XidocNodes):
  textChars <- >+xidoc.textChar:
    output.add XidocNode(kind: xnkString, str: $1)
  whitespace <- +Space:
    output.add XidocNode(kind: xnkWhitespace)
  command <- '[' * >*xidoc.commandChar * >xidoc.unparsedText * ']':
    output.add XidocNode(kind: xnkCommand, name: $1, arg: $2)
  chunk <- command | textChars | whitespace
  text <- *chunk * !1

Yes, that's it. I don't have to implement a parser manually. No long, unreadable chains of strtok, if (s[++i] == ';') and who knows what else. I just write the grammar and a parser will be generated for me, most likely more efficient than I'd be able to write myself. But it doesn't end there. I need to implement a lot of commands for xidoc and most of them are similar in thair basic logic, which would mean a lot of repetitive code. However, Nim allows me to create custom syntax for defining commands that does everything behind the scenes. For example, this is how the [color] command is defined:

command "color", (color: expand, text: render), rendered:
  case doc.target
  of tHtml:
    htg.span(style = &"color:{color}", text)
  of tLatex:
    doc.addToHead.incl "\\usepackage[svgnames]{xcolor}"
    "\\textcolor{$1}{$2}" % [color, text]

It couldn't be simpler. Except, what are those expand and render? That leads us to another topic…

What problems did I encounter?

Expand and render

A necessary part of expressing some text in a markup language is escaping special characters. As already mentioned, xidoc has three special characters ([, ;, ]), and it also has a way to escape them, but what I want to talk about is escaping characters in the target language.

Obviously, this it something the markup language should take care of. Nobody wants to write < and & all over the place, especially when doing complicated things like embedding code snippets or writing L^aT_eX equations.

Anyway, what's the big deal? I just make it so that when there's literal text, special characters inside it will be escaped, right?

Well, not quite. Sometimes you don't want text to be escaped, such as when passing it to an internal function that does some transformations on it before escaping it. And since commands can be arbitrarily nested inside other commands, you need to track which text has already been escaped and which commands require what kind of text. Then, text will be escaped whenever it's passed to a command that expects escaped text, but the argument is marked as unescaped.

This is why I came up with these pseudo-“type signatures” on commands. However, I'm not quite happy with the current system. It often requires different versions of the same command for escaped and unescaped text, which means that the user will have to be aware of this implementation detail. If anyone has an idea how to deal with this situation, please let me know.

JavaScript libraries

I want xidoc to have a plenty of useful features, which include L^aT_eX math rendering and syntax highlighting. These features would be hard to implement myself, but fortunately, there are great libraries for them: KaTeX and Prism. There's just one problem — these libraries are made in JavaScript. So how could I integrate them into a project made in Nim?

One option would be to make use of the fact that Nim can compile to JavaScript. This is a great feature of the language, and it's what enables me to have a limited version of xidoc available on the web, but I don't want JavaScript to be the primary target because that would throw away the performance benefits and require users to have something like Node.js installed.

Luckily, I found out that there is this awesome C library called Duktape which implements an small, embeddable JavaScript interpreter. It's easy to interoperate with C libraries in Nim, so it took me just about an hour to get it working. There were a few issues with it. By default, it just crashes whenever there's an uncaught error in the JavaScript code, so I had to monkey-patch the code to make it at least produce an error message; trying to get it to raise a Nim exception that could be caught in the wrapping code would presumably be a futile effort. Also, it only supports the ES5 version of JavaScript. I originally tried to use highlight.js rather than Prism, but I just couldn't get it to work, even after somehow transpiling it with Babel (which, by the way, is much harder than it looks, since Babel doesn't even try to work out of the box). So this is why I turned to Prism, which is written in ES5-compatible JavaScript and ended up being a better fit for the project anyway.

Why I created a markup language

What is a markup language?

So why yet another one?

HTML

L^aT_eX

Markdown

RST

What do I want from a markup language?

How to go about it?

Why did I make it in Nim?

What problems did I encounter?

Expand and render

JavaScript libraries

What is the current state of xidoc?

Is anyone else going to use it?

Why I created a markup language

What is a markup language?

So why yet another one?

HTML

LaTeX

Markdown

RST

What do I want from a markup language?

How to go about it?

Why did I make it in Nim?

What problems did I encounter?

Expand and render

JavaScript libraries

What is the current state of xidoc?

Is anyone else going to use it?

L^aT_eX