I created a HTML parser written in C++. It’s based on the HTML5 specification. The code can be downloaded with git:
git clone http://code.peterstuifzand.nl/git/htmlparser.git/
or:
http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi
I use this HTML parser in a internal searchengine. It parsed everything the spider could find.
If you like to use this parser, look for the Parser
class. This class
expects an Emitter
object. This can be a ListEmitter
or a subclass of
Emitter
. I wrote a simple Emitter
that finds all <a href="">
tags and
inserts them into a MySQL table.
Other examples of Emitter that you could write are a tag remover.
class TagRemover : public Emitter {
public:
virtual void emit_char(char c) {
std::cout << c;
}
virtual void emit_multichar(std::string s) {
std::cout << s;
}
virtual void emit_tag(const Tag& tag) {}
virtual void emit_comment(std::string comment) {}
};
This will show all characters in a HTML page (including whitespace). Removing consecutive whitespace is left as an exercise for the reader.