The different parts of a template

I have been thinking on and off about how to make Marpa parse a template language. I tried many different ways to create a tokenizer for this parser, but until this week I couldn't get something reliable. The week I started with the tokenizer from the example code from Jeffrey Kegler's blog post about how to develop a parser iteratively. This got me on the right track to create my parser. Let's start with 'specifying' the language.

The template language consists of two different parts. We'll call them literals and tags. The literal part will become part of the output. The tags are instructions on how to combine the literal parts. To keep this simple we only try to differentiate between those two parts. An example would be:

[% IF title %]<h1>[% title %]</h1>[% END %]

To create a solution to a problem I sometimes try to make a simpler work first. Here a simple solution would be to create a small program that takes a string and creates a list of those two parts.

A first pass with a simple parser through this example will produce this.

tag     => "[% IF title %]"
literal => "<h1>"
tag     => "[% title %]"
literal => "</h1>"
tag     => "[% END %]"

The following code creates this output.

my $test = "[% IF title %]<h1>[% title %]</h1>[% END %]";

while ($test) {
    if ($test =~ s/^(?=\[%)(.*?)(?=\[%|\Z)//xms) {
        print qq{literal => "$1"\n};
    if ($test =~ s/^(\[%.*?%\])//xms) {
        print qq{tag     => "$1"\n};

This code (especially the regexes) allows me to build the parser, because I now know how to get the literal bits from the template without having to parse the HTML that's hidden inside.

Now that I know how to take this string apart, I can move on and take the tags apart, which is where the parser will shine.


My name is Peter Stuifzand. You're reading my personal website.