I released a HTML parser

I created a HTML parser written in C++. It's based on the HTML5 specification. The code can be downloaded with git:

git clone http://code.peterstuifzand.nl/git/htmlparser.git/

or:

http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi

I use this HTML parser in a internal searchengine. It parsed everything the spider could find.

If you like to use this parser, look for the Parser class. This class expects an Emitter object. This can be a ListEmitter or a subclass of Emitter. I wrote a simple Emitter that finds all <a href=""> tags and inserts them into a MySQL table.

Other examples of Emitter that you could write are a tag remover.

class TagRemover : public Emitter {
    public:
        virtual void emit_char(char c) {
            std::cout << c;
        }
        virtual void emit_multichar(std::string s) {
            std::cout << s;
        }

        virtual void emit_tag(const Tag& tag) {}
        virtual void emit_comment(std::string comment) {}
};

This will show all characters in a HTML page (including whitespace). Removing consecutive whitespace is left as an exercise for the reader.

Mentions

Welcome

My name is Peter Stuifzand. You're reading my personal website.

Profiles

Peter Stuifzand
peter@peterstuifzand.nl
Zwolle, The Netherlands