I released a HTML parser

I created a HTML parser written in C++. It's based on the HTML5 specification. The code can be downloaded with git:

git clone http://code.peterstuifzand.nl/git/htmlparser.git/



I use this HTML parser in a internal searchengine. It parsed everything the spider could find.

If you like to use this parser, look for the Parser class. This class expects an Emitter object. This can be a ListEmitter or a subclass of Emitter. I wrote a simple Emitter that finds all <a href=""> tags and inserts them into a MySQL table.

Other examples of Emitter that you could write are a tag remover.

class TagRemover : public Emitter {
        virtual void emit_char(char c) {
            std::cout << c;
        virtual void emit_multichar(std::string s) {
            std::cout << s;

        virtual void emit_tag(const Tag& tag) {}
        virtual void emit_comment(std::string comment) {}

This will show all characters in a HTML page (including whitespace). Removing consecutive whitespace is left as an exercise for the reader.


My name is Peter Stuifzand. You're reading my personal website.