Posted August 30, 2008
Do you know Ack? It's a grep-like program. That uses perl regular expressions instead of the normal Posix ones. You can find it on the CPAN.
The following Ack call will check your perl code for problem with a space missing behind a controlstatement keyword.
ack --perl '(?\@<!\w)(if|while|elsif|return)('
If you use Vim you can also use the following piece of vimscript in your .vimrc file:
highlight WHITE_ON_RED ctermfg=white ctermbg=red
function! BadNonInvocations ()
2match WHITE_ON_RED /\w\@<!(if\|elsif\|while\|return\|for)(/
endfunction
call BadNonInvocations()
Posted August 18, 2008
I created a small Perl program to convert relative dates to absolute dates in
the format that I use for my calendar. My current calendar file looks like this.
2008
08
2008-08-12
2008-08-13
09
2008-09-10
...
If I want to add a date and I don't know the actual numbers, I can use the
following program to convert the date. It will also respect the whitespace in
front of the text.
#!/usr/bin/perl -w
use v5.10;
use strict;
use warnings;
use Date::Manip;
Date_Init('Language=Dutch');
my $inp = <>;
if (my ($ws, $date) = $inp =~ m/^(\s*)(.+)$/) {
say $ws . UnixDate($date, "%Y-%m-%d");
}
else {
print $inp;
}
I use Date::Manip for parsing the date. It works with human language style
dates like thursday. I added Date_Init so it will parse Dutch days like
donderdag. Also I use Perl 5.10, because I can. It has some nice features.
To use it put this script in the path. I called it refdate.pl. To use it type
a date and type !!refdate.pl<Enter>. In Vim this will call the program on the
current line. The date will be formatted in the YYYY-MM-DD format.
Posted August 8, 2008
In 2005 I wrote a small post about the programs that I use on a regular
basis. I will try and
find out what has changed and what stayed the same.
I still use:
Programs that I don't use anymore:
Programs that I use now:
This isn't a complete list of all the programs that I use. There are more
programs, that I use but those are more on the backend of things. These are all
user programs. I could also add Apache to this list or Perl, but are not really
useful here.
Posted August 6, 2008
I created a HTML parser written in C++. It's based on the HTML5 specification.
The code can be downloaded with git:
git clone http://code.peterstuifzand.nl/git/htmlparser.git/
or:
http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi
I use this HTML parser in a internal searchengine. It parsed everything the
spider could find.
If you like to use this parser, look for the Parser class. This class
expects an Emitter object. This can be a ListEmitter or a subclass of
Emitter. I wrote a simple Emitter that finds all <a href=""> tags and
inserts them into a MySQL table.
Other examples of Emitter that you could write are a tag remover.
class TagRemover : public Emitter {
public:
virtual void emit_char(char c) {
std::cout << c;
}
virtual void emit_multichar(std::string s) {
std::cout << s;
}
virtual void emit_tag(const Tag& tag) {}
virtual void emit_comment(std::string comment) {}
};
This will show all characters in a HTML page (including whitespace). Removing
consecutive whitespace is left as an exercise for the reader.
Posted August 5, 2008
I just released a small C++ utility library. At the moment it contains three
modules: an URL class, a config file reader, and a testing framework.
The code can be found at http://code.peterstuifzand.nl/cgi-bin/gitweb.cgi
or use git to clone the project:
git clone http://code.peterstuifzand.nl/git/cpputil.git
The URL class is used in a search engine that I am currently writing. The code
doesn't do much more than I need at the moment. I want to make this code better.
The config file reader reads a config file and returns a std::map. A config
file could look like:
url.protocol=http
The key and value are seperated by an equals sign. You can also include comments by
putting a # as the first character of a line.
The testing framework is based on the Test Anything
Protocol. This is the text based protocol used by
Perl. The nice thing is that you can use tools that are normally used to test
Perl code, for example:
$ prove --exec '' ./test_url
This command will execute the prove command on a test written in C++. At the
moment this has the following output:
./test_url....ok
All tests successful.
Files=1, Tests=37, 0 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)
Result: PASS
As you can see I have written 37 tests for the URL class.
I will try to release a C++ HTML parser in the next few days. This parser is
based on the specification of the HTML5 parser. It's is not complete (not all
states are implemented) and it only works with a string containing the whole
HTML document.
The states that are not implemented are mainly related to the parsing of the
doctype. Currently the Doctype is skipped when found. I already used the parser
to parse 7007 unique HTML files without problems. It didn't parse pages from
the greater web and at the moment I'm not planning to.
I have to say that parsing HTML this way is really nice. The code for finding
all URL's is only four lines (not including class, method declartions and
curly braces, and index generation):
if (tag.get_name() == "a") {
for (std::vector<Attr>::const_iterator it = tag.get_attrs().begin(); it != tag.get_attrs().end(); ++it) {
if (it->get_name() == "href") {
std::string url = base.relative(URL(it->get_value())).to_string();
// ...
}
}
}
This code uses the URL class that I mentioned before. This code could still be
a lot smaller by improving the Tag class.