Peter Stuifzand

I released a HTML parser

I created a HTML parser written in C++. It’s based on the HTML5 specification. The code can be downloaded with git:

git clone http://code.peterstuifzand.nl/git/htmlparser.git/

or:

http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi

I use this HTML parser in a internal searchengine. It parsed everything the spider could find.

If you like to use this parser, look for the Parser class. This class expects an Emitter object. This can be a ListEmitter or a subclass of Emitter. I wrote a simple Emitter that finds all <a href=""> tags and inserts them into a MySQL table.

Other examples of Emitter that you could write are a tag remover.

class TagRemover : public Emitter {
    public:
        virtual void emit_char(char c) {
            std::cout << c;
        }
        virtual void emit_multichar(std::string s) {
            std::cout << s;
        }

        virtual void emit_tag(const Tag& tag) {}
        virtual void emit_comment(std::string comment) {}
};

This will show all characters in a HTML page (including whitespace). Removing consecutive whitespace is left as an exercise for the reader.

© 2023 Peter Stuifzand