The writings of Peter Stuifzand

Archive for August 2008

Do you know Ack? It's a grep-like program. That uses perl regular expressions instead of the normal Posix ones. You can find it on the CPAN.

The following Ack call will check your perl code for problem with a space missing behind a controlstatement keyword.

ack --perl '(?\@<!\w)(if|while|elsif|return)('

If you use Vim you can also use the following piece of vimscript in your .vimrc file:

highlight WHITE_ON_RED ctermfg=white ctermbg=red

function! BadNonInvocations ()
    2match WHITE_ON_RED /\w\@<!(if\|elsif\|while\|return\|for)(/
endfunction
call BadNonInvocations()

I created a small Perl program to convert relative dates to absolute dates in the format that I use for my calendar. My current calendar file looks like this.

2008
    08
        2008-08-12
        2008-08-13
    09
        2008-09-10
        ...

If I want to add a date and I don't know the actual numbers, I can use the following program to convert the date. It will also respect the whitespace in front of the text.

#!/usr/bin/perl -w
use v5.10;

use strict;
use warnings;

use Date::Manip;

Date_Init('Language=Dutch');

my $inp = <>;

if (my ($ws, $date) = $inp =~ m/^(\s*)(.+)$/) {
    say $ws . UnixDate($date, "%Y-%m-%d");
}
else {
    print $inp;
}

I use Date::Manip for parsing the date. It works with human language style dates like thursday. I added Date_Init so it will parse Dutch days like donderdag. Also I use Perl 5.10, because I can. It has some nice features.

To use it put this script in the path. I called it refdate.pl. To use it type a date and type !!refdate.pl<Enter>. In Vim this will call the program on the current line. The date will be formatted in the YYYY-MM-DD format.

In 2005 I wrote a small post about the programs that I use on a regular basis. I will try and find out what has changed and what stayed the same.

I still use:

Programs that I don't use anymore:

Programs that I use now:

This isn't a complete list of all the programs that I use. There are more programs, that I use but those are more on the backend of things. These are all user programs. I could also add Apache to this list or Perl, but are not really useful here.

I created a HTML parser written in C++. It's based on the HTML5 specification. The code can be downloaded with git:

git clone http://code.peterstuifzand.nl/git/htmlparser.git/

or:

http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi

I use this HTML parser in a internal searchengine. It parsed everything the spider could find.

If you like to use this parser, look for the Parser class. This class expects an Emitter object. This can be a ListEmitter or a subclass of Emitter. I wrote a simple Emitter that finds all <a href=""> tags and inserts them into a MySQL table.

Other examples of Emitter that you could write are a tag remover.

class TagRemover : public Emitter {
    public:
        virtual void emit_char(char c) {
            std::cout << c;
        }
        virtual void emit_multichar(std::string s) {
            std::cout << s;
        }

        virtual void emit_tag(const Tag& tag) {}
        virtual void emit_comment(std::string comment) {}
};

This will show all characters in a HTML page (including whitespace). Removing consecutive whitespace is left as an exercise for the reader.

I just released a small C++ utility library. At the moment it contains three modules: an URL class, a config file reader, and a testing framework.

The code can be found at http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi

or use git to clone the project:

git clone http://code.peterstuifzand.nl/git/cpputil.git

The URL class is used in a search engine that I am currently writing. The code doesn't do much more than I need at the moment. I want to make this code better.

The config file reader reads a config file and returns a std::map. A config file could look like:

url.protocol=http

The key and value are seperated by an equals sign. You can also include comments by putting a # as the first character of a line.

The testing framework is based on the Test Anything Protocol. This is the text based protocol used by Perl. The nice thing is that you can use tools that are normally used to test Perl code, for example:

$ prove --exec '' ./test_url

This command will execute the prove command on a test written in C++. At the moment this has the following output:

./test_url....ok     
All tests successful.
Files=1, Tests=37,  0 wallclock secs ( 0.01 usr +  0.00 sys =  0.01 CPU)
Result: PASS

As you can see I have written 37 tests for the URL class.

I will try to release a C++ HTML parser in the next few days. This parser is based on the specification of the HTML5 parser. It's is not complete (not all states are implemented) and it only works with a string containing the whole HTML document.

The states that are not implemented are mainly related to the parsing of the doctype. Currently the Doctype is skipped when found. I already used the parser to parse 7007 unique HTML files without problems. It didn't parse pages from the greater web and at the moment I'm not planning to.

I have to say that parsing HTML this way is really nice. The code for finding all URL's is only four lines (not including class, method declartions and curly braces, and index generation):

if (tag.get_name() == "a") {
    for (std::vector<Attr>::const_iterator it = tag.get_attrs().begin(); it != tag.get_attrs().end(); ++it) {
        if (it->get_name() == "href") {
            std::string url = base.relative(URL(it->get_value())).to_string(); 
            // ...
        }
    }
}

This code uses the URL class that I mentioned before. This code could still be a lot smaller by improving the Tag class.

View archived entries