I just released a small C++ utility library. At the moment it contains three modules: an URL class, a config file reader, and a testing framework.
The code can be found at http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi
or use git to clone the project:
git clone http://code.peterstuifzand.nl/git/cpputil.git
The URL class is used in a search engine that I am currently writing. The code doesn’t do much more than I need at the moment. I want to make this code better.
The config file reader reads a config file and returns a std::map
. A config
file could look like:
url.protocol=http
The key and value are seperated by an equals sign. You can also include comments by
putting a #
as the first character of a line.
The testing framework is based on the Test Anything Protocol. This is the text based protocol used by Perl. The nice thing is that you can use tools that are normally used to test Perl code, for example:
$ prove --exec '' ./test_url
This command will execute the prove command on a test written in C++. At the moment this has the following output:
./test_url....ok
All tests successful.
Files=1, Tests=37, 0 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)
Result: PASS
As you can see I have written 37 tests for the URL class.
I will try to release a C++ HTML parser in the next few days. This parser is based on the specification of the HTML5 parser. It’s is not complete (not all states are implemented) and it only works with a string containing the whole HTML document.
The states that are not implemented are mainly related to the parsing of the doctype. Currently the Doctype is skipped when found. I already used the parser to parse 7007 unique HTML files without problems. It didn’t parse pages from the greater web and at the moment I’m not planning to.
I have to say that parsing HTML this way is really nice. The code for finding
all URL’s is only four lines (not including class
, method declartions and
curly braces, and index generation):
if (tag.get_name() == "a") {
for (std::vector<Attr>::const_iterator it = tag.get_attrs().begin(); it != tag.get_attrs().end(); ++it) {
if (it->get_name() == "href") {
std::string url = base.relative(URL(it->get_value())).to_string();
// ...
}
}
}
This code uses the URL class that I mentioned before. This code could still be
a lot smaller by improving the Tag
class.