The writings of Peter Stuifzand

Archive for April 2011

In the article about why we don't use more data structures in web applications I showed an example of PHP code. I copied the code below. The example shows how we start to write web applications. I will rewrite the code so it says one thing: insert a person into the database.

<?php
    $name    = $_POST['name'];
    $address = $_POST['street'];
    $city    = $_POST['city'];
    db_person_insert($db, $name, $address, $city);
?>

Two things happen in this example. First we get the variables from the $_POST variable. And then we insert the values into the database. First I'll add the 'code' of the db_person_insert function as well.

<?php
function db_person_insert($db, $name, $address, $city) {
    // some SQL code...
}
?>

The function accepts four parameters: one database connection, a name, an address and a city. The last three arguments are related because they talk about the same person. The code however doesn't show us that. Let's make a change so it does.

<?php
function db_person_insert($db, $person) {
    // some SQL code...
    // use $person['name'], $person['address'] and $person['city']
}
?>

The interface of the function is much cleaner now and doesn't have to change when we change what data we want to know about people. In the previous article I showed what happened when we added the phone number. Now only the implementation of the function needs to change.

Let's go back to the first example. This code has to change to work with the new db_person_insert function.

<?php
    $name    = $_POST['name'];
    $address = $_POST['street'];
    $city    = $_POST['city'];
    db_person_insert($db, 
        array('name' => $name, 'address' => $address, 'city' => $city)
    );
?>

This code is already better, because it shows that there is only one argument that is used as a whole. But there is another change we can make, the three variables at the top refer to the same person. Why don't we have one variable representing that person?

<?php
    $name    = $_POST['name'];
    $address = $_POST['street'];
    $city    = $_POST['city'];
    $person = array(    
      'name'    => $name,
      'address' => $address,
      'city'    => $city
    )
    db_person_insert($db, $person);
?>

The structure of the program becomes a bit clearer again. Let's take it one step further and create a function that gets the values from the post variable.

<?php
    $person = person_from_postarray($_POST);
    db_person_insert($db, $person);

    function person_from_postarray($arr) {
        $name    = $arr['name'];
        $address = $arr['street'];
        $city    = $arr['city'];
        $person = array(    
          'name'    => $name,
          'address' => $address,
          'city'    => $city
        )
        return $person;
    }
?>

This still isn't the best version of the code, because there are still a few pieces of duplication. The person_from_postarray function is a bit sad, because it only makes a copy of particalur fields from $arr to the new array $person. So lets rename the function.

function person_copy($arr) {
    $name    = $arr['name'];
    $address = $arr['street'];
    $city    = $arr['city'];
    $person = array(    
      'name'    => $name,
      'address' => $address,
      'city'    => $city
    )
    return $person;
}

This piece of code is more general than the previous version, because it can be used for more problems, while it still makes sense. Now we make the change to make it more like a function that copies a person. The first step.

function person_copy($arr) {
    $person = array();
    $person['name'] = $arr['name'];
    $person['address'] = $arr['street'];
    $person['city'] = $arr['city'];
    return $person;
}

This makes the structure very obvious, but there is one field that's problematic, the street or address field. The nice thing about where we're going what this, is that we can make our lives simpler by making the fields of the person the same everywhere. So one thing we could do would be to change the street field to the address field, or the other way around. The solution depends on what you can change. Let's say we name the field 'address' and change the web interface.

function person_copy($arr) {
    $person = array();
    foreach (array('name', 'address', 'city') as $field) {
        $person[$field] = $arr[$field];
    }
    return $person;
}

Done! This function now creates a new person based on values from another array. By representing a person as one thing, we can simplify the code that works with it.

Doc Searls says in Let's move tweeting off Twitter that most online communication isn't owned by one company:

Blogging, emailing and messaging aren’t owned by anybody. Tweeting is owned by Twitter. That’s a problem.

It is one of the points I tried to make a few days ago. I don't think we have to move away from Twitter. But I do think we need to at least make it possible for every company (and person) to have their own microblogging instance. And all these instances should be able work together.

The best solution is to make many thousands of interoperating services, just like email. This allows each company to host their own version. And all these services will work together. That way I can subscribe to your messages and you can subscribe to mine. If no subscriptions exist between two services, there are no messages transferred between them.

There is this question that keeps coming back when I see the code of web applications. It keeps coming back, probably because it relates to the basics of our field.

Why do we forget we learned about data structures, the moment we start to write web applications?

The question points at three things. Why do we forget about data structures? Why don't we write better code for web applications? The last point is a passive-aggressive jab at all web developers. And I'm one of them.

This question seems to come up a lot for me. Probably because more often than not, I write web applications. Last year I wrote about this problem and I tried to think of a different way to tackle the web application problem. Without much progress I might say.

Let's take a look at some of the things written in the past about the importance of data structures and algorithms.

Rob Pike wrote in Notes on Programming in C:

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self evident. Data structures, not algorithms, are central to programming. (See Brooks p. 102.)

And of course Knuth wrote "The Art of Computer Programming", four huge volumes on algorithms and data structures. So they must be important. Even the title of the book from Wirth, "Data Structures + Algorithms = Programs" says as much. Without data structures or algorithms we aren't even writing programs.

That's a bit lazy, but I think the most important thing we can learn from this, and we all already know this, is that data structures and algorithms are the basic building blocks of the programs we write.

But if data structures and algorithms are so important, why don't we use more of them in web applications?

There are two reasons for this to happen. First there is more emphasis on the algorithms, on the how, than on the data structures. And second, many of these algorithms and data structures are implicit.

Let's start with an example of an implicit data structure (in PHP).

<?php
    $name    = $_POST['name'];
    $address = $_POST['street'];
    $city    = $_POST['city'];
    db_person_insert($db, $name, $address, $city);
?>

Here the data structure of the person is implicit. There are three variables that are related. Only two things that show they're related: the grouping of the variables and the db_person_insert function. There is no mention of something called a person, except in the name of the db_person_insert function.

And now the example of a implicit algorithm (again in PHP).

<?php
    $name    = $_POST['name'];
    $address = $_POST['street'];
    $city    = $_POST['city'];
    db_person_insert($db, $name, $address, $city);
?>

You could think I made an error by showing you the same code. I'm sorry to disappoint you. Let me explain. The algorithm contains two steps. (1) Get the person object from the $_POST variable. (2) Insert the person object into the database.

One problem with this code is that the moment you want to change what it means to be a person in your application, you need to make many changes all over. Let's say your form also needs to handle phone numbers. Let's make the change in the code.

<?php
    $name    = $_POST['name'];
    $address = $_POST['street'];
    $city    = $_POST['city'];
    $phone   = $_POST['phone'];
    db_person_insert($db, $name, $address, $city, $phone);
?>

I added one line to get the phone number and one parameter to the db_person_insert function, but I also have to check every other place where I call the db_person_insert function. You can make this change for every field that needs to be added, but you need to make a lot of changes in a lot of places. That sucks, but now imagine that the db_person_insert function was actually a little bit of SQL code, sprinkled around your code.

The other problem is that it's not obvious that we are working with people (as in more than one person). This code shows how we do it, but not what we do.

Especially in web applications there seems to be going no thought at all into the structure of the data that's used inside the applications. There are some objects, or maybe there is a database, but these things are not the actual data the program operates on.

Code in web applications feels shoddily written, without a kind of bigger picture. For example if you display a product in the interface, why don't you have a data structure representing a product? Or if you have a form for creating a new product, why don't you have a data structure representing that?

Without data structures you have to write certain code again and again. And while this apparently is not a big enough problem to be solved once and for all, I think some of us know there is something wrong.

Without data structures you can't have a group of operations working on a certain kind of data. You also can't use code because the data is just different enough.

Let's look at few more examples. How many times have you written a login form and authentication? Isn't every login screen the same? The user provides a username and password. That could be a data structure. Which user object is identified by these two values? Could that be a function?

Or an interface to order certain objects. There is a structure to ordering things that transcends the type of objects. For example if I want to order images, or photo albums, or comments, or videos, or products, or menu items. The way to do this is always to same. If the data is in memory we know how to do this.

// C++, compile with:
//    gcc -o test test.cpp -Wall -std=c++0x -lstdc++

#include <vector>
#include <iostream>
#include <iterator>

using namespace std;
int main()
{
    vector<int> numbers = { 1, 2, 3, 4 };
    swap(numbers[0], numbers[1]);
    copy(numbers.begin(), numbers.end(),
            ostream_iterator<int>(cout, "\n"));
    return 0;
}

If I want to change to order of items in a vector, I call the swap function. Other languages change the order with different functions.

But somehow when we start to create web applications we need to create a new solution for this. It seems almost as if there is a need for people to be original in every line of code they write. Maybe there is something called False Originality, in the tradition of False Laziness and friends. Or maybe it's "Not Invented Here".

If we don't have a data structure, we can't write a function that modifies that data. So instead that we create more general functions, we write functions that call other functions until we get to the bottom of our software stack, which could be the database or the file system.

Without the interface of your data structure to write a function for, there is no way you can write code without duplication. And without a similar looking interface you can't even see the duplication that's there. Without the data structure we can't say that certain things are similar. We can't describe it. We have no place to write the code.

For example, I want to create an interface that allows the user to reorder photos. The web interface could be really simple for this example. Every item has an id and a priority. Sort the ids according to the priorities.

The web framework calls a controller method. This controller method parses the arguments. We can write a general fuction that parses the arguments and creates a list of number pairs: [ (id, prio)... ]. I hope this notation makes sense. It describes one pair with two fields id and prio contained in a list of these pairs. The parser function returns that list and another function will write this order to a database, one at a time.

The code that reorders items could be used for every table that has a primary key and a priority field. But before that can work, we need ways to describe data structures in the code. If we don't have data structures, we can't write general algorithms to work with them.

Without data structures and general algorithms, we need to create a new solution for this problem every time. Maybe even multiple times in the same codebase.

The reason is that we don't look further than the libraries and frameworks that are handed to us. In the person example above we use the $_POST variable to get the values. This works and there is no reason to look further, the code is already very simple.

The problem is that if we don't look further, we don't evolve the craft and we have to solve the same problems over and over again. The even bigger problem is, that if we don't create data structures to work with, we can't even see the way out of the mess we're in.

Update: Added a new article about the PHP example.

Decentralized services, like email, have a big advantage for users. I can send email to anyone that I have the email address of. The advantage for me is that even if my friends use a different email provider, I can still send them emails. It would be a huge problem if I'd need an email account with every provider that I wanted to send emails to. That I can use one account to send email to anyone that reads email is an advantage for me.

Lets assume for a moment that you could only send email to the people at the provider where you're a customer. My friends would need to be customers at the same provider as well if I wanted to send an email to them. If I wanted to send email to someone completely different, maybe at the other side of the globe, it would probably be impossible. It's important that everyone you want to send email to is a customer at the same provider as you.

Now lets assume that another provider starts to offer a new and cool email related service, maybe they send a free song every week. Now a few of my friends will move to this new service, they like the free songs. And I can't send email to them anymore. So, now I have to move as well. And then a few more people move. And now the parents move and the grandparents move. They don't like the songs, but they want to send email to their children and grandchildren.

Every time a new and better provider appears people will move, first slowly, but after some time, faster. The centralized structure of this email system makes people move in groups and because some people are part of multiple groups, they'll make other groups move as well, until everyone has moved.

Another disadvantage that appears is that a centralized system needs to be one size fits all. And while this somewhat works for t-shirts, this doesn't work at all for software and websites.

Luckily email doesn't work this way, but some websites do. For example, it's impossible to have multiple Twitter providers. Their terms of service don't allow it. A new service that provides micro blogging can't work together with Twitter. So, if at some point a new service comes along and Twitter isn't the hot new thing anymore, then people will flock to the new service, because everyone is.

On the other hand, if Twitter would interoperate with other services, then there is no reason for people to move, because they can still send messages to their friends and everyone they care about. The people who like Twitter will stay and the people who don't like it, will move. You'll only switch if a different service better fits your needs.

David Siegel tweets:

Looking to make a positive change in your life? Disable comments on your blog.

When you want to respond to a blog post, get your own blog and write a blog post there.

Let's say you run nmap on your local box and you see an open port, but don't know which program is listening on that port. Wouldn't it be great to be able to find out? The program you need to find this information is lsof, or list open files. This program can show you a lot of information about open files, ports and directories.

Let's go back to the original question. Run the following command.

sudo lsof -i4:80

With this command we get a list of programs which listen or connect to port 80. Its output should look like this:

COMMAND     PID     USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
apache2    3080     root    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2    3123 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2    3124 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2    3125 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2    3126 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2    3127 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2   18335 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)
apache2   18337 www-data    4u  IPv4   13546      0t0  TCP *:www (LISTEN)

This shows you all local running processes that LISTEN on the www port. This could also show open connections to other web servers.

Now that we have two programs that parse log files, we can start to take a look at how many lines the program parses per second. First we have to make the two programs as similar as possible. In pseudocode it looks likes this.

  1. Load all modules
  2. Take the start time using Time::HiRes
  3. Put the code of the program here
  4. Set line_count = 0
  5. Using stdin: loop through all lines
    1. Parse the line
    2. Set line_count++
  6. Find the time difference
  7. Divide and line_count / time as n lines/s

In Perl this looks like:

use Time::HiRes 'gettimeofday', 'tv_interval';

my $start = [gettimeofday];

# Your program

my $line_count = 0;

while (<>) {
    # Parse one line using your software
    $line_count++;
}

my $diff = tv_interval($start);
printf "%.2f lines/s\n", $line_count / $diff;

Now run the two programs a few times and look at the parsing speed. In my case there was a big difference between the speed of the two programs. I expect a difference in your run as well.

Writing an apache access log parser isn't that hard. Below is a parser that does just that. It creates Data::Dumper output of all the lines. No warranty.

use Data::Dumper;
use Parse::RecDescent;

$Parse::RecDescent::skip = '';

my $grammar = q{
line: ip ws '-' ws user ws datetime ws request ws status ws responsesize
            ws referrer ws useragent "\n" 
{ $return = {
            ip        => $item[1],
            user      => $item[5],
            datetime  => $item[7],
            method    => $item[9]->{method},
            url       => $item[9]->{url},
            protocol  => $item[9]->{protocol},
            status    => $item[11],
            size      => $item[13],
            referrer  => $item[15],
            useragent => $item[17],
        } }
user: '-' | /\w+/
request: '"' method ws url ws protocol '"' 
    { $return = { method => $item[2], url => $item[4], protocol => $item[6] } }
datetime: '[' date ':' time ws timezone ']' 
    { $return = $item[2] . ' ' . $item[4] . ' ' . $item[6] }
status: /\d{3}/
protocol: 'HTTP/' version
method: 'GET' | 'POST' | 'PUT' | 'DELETE'
ws: /[ ]+/
url: /\S+/
referrer: quotedstring2
responsesize: '-' | /\d+/
useragent: quotedstring2
date: day '/' month '/' year
    { $return = join('/', $item[1], $item[3], $item[5]) }
day: /\d+/
month: 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' |
    'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec'
year: /\d{4}/
time: /\d{2}:\d{2}:\d{2}/
timezone: ('+'|'-') /\d{4}/  { $return = $item[1].$item[2] }
octet: /\d+/
ip: octet ('.' octet)(3) { $return = $item[1] . '.' . join('.', @{$item[2]}) }    
version: /\d.\d/
quotedstring2: '"' /[^"]+/ '"'   {$return = $item[2]}
};

my $parser = Parse::RecDescent->new($grammar) or die "Bad Grammer";
while (<>) {
    my $ret = $parser->line($_) or print "Parse error\n";
    print Dumper($ret);
}

Today I tried to create a report of some basic statistics about Abacus downloads. Normally I would use grep, awk and a few other commandline tools to find a rough estimate of these numbers. However this time I needed a bit more information than these tools could give me. A problem in need of a solution.

My first question was: how many people have downloaded Abacus? The answer is

grep '/abacus/files/Abacus' | grep -v '<localip>' \
    | grep -v 'somebots' | awk '{print $1}'
    | sort | uniq | wc -l

The pattern here is the following. First find the lines you want. Then remove the lines you don't want. Print the first field-the client-and makes this list unique. I don't want to count multiple downloads from the same ip.

The next questions was: where do people who download Abacus come from? For this I take the answer from the last question (without wc -l) and write it to a file. Now I can use the file as extra argument for grep like this:

grep -f abacus-downloads.txt -F logs/access.log 

This makes grep use the lines in abacus-downloads.txt as the patterns that it needs to find in logs/access.log. Now I need to find the first line where a match appears, which should contain the referrer where the person comes from. How to do that? I did the following:

  1. Download Parse::AccessLogEntry from CPAN
  2. Write a little script

This script will only print a line if it's the first line containing a client.

use Parse::AccessLogEntry;
my $p = Parse::AccessLogEntry->new();

my %hosts;

while (<>) {
    my $line = $p->parse($_);
    if (!$hosts{$line->{host}}) {
        print;
        $hosts{$line->{host}} = 1;
    }
}

I pipe the output of the previous grep through this program and now I have the lines with the referrers I'm looking for. A small improvement could be to filter out favicons because in my case one browser downloaded the favicon before it got the page itself.

Just add

next if $line->{file} =~ m{^/favicon};

at the appropriate spot. Now I need a list of the referrers from these lines. I could change the print statement in this program to that for me. That wouldn't be the unix way. So I wrote another small program that prints the field from the log if it's specified in the arguments.

use Parse::AccessLogEntry;
my $p = Parse::AccessLogEntry->new();
my @args = @ARGV;
@ARGV=();
while (<>) {
    my $line = $p->parse($_);
    print join("\t", map { $line->{$_} } @args) . "\n";
}

This program can be called using one or more arguments. The argument should be a key from the $line hashref, like host, user, date, time, diffgmt, rtype, file, proto, code, bytes, refer or agent.

Using refer as an argument, the program ave me a list of the referrers from the log file. Using sort | uniq -c | sort -rn on this gave me a top X list of the referrer where the people who downloaded Abacus came from.

I'm aware that this is really obvious to some of you. I link to a video here of a presentation by Mike Montiero of Mule Design about how to get clients to pay you for your services. If you work with clients then you should watch this. It's about how it should be done. Just good advice.

Then after you enjoyed the video, you should listen to this interview of Mike Montiero by Dan Benjamin.

In the first 5 minutes of the video The Storage Technologies Behind Facebook Messages the bald manager talks about messages and shoeboxes, and how he can't have all his messages in one place. He says: "Where is my box of letters? It's locked up in a phone, it's locked up in email. It's not in one place. Until now." The one place being Facebook. So I ask, how is it better if we lock up all our messages in Facebook?

Friday night I implemented a new feature for my Pompiedom empire for realtime RSS feeds. At some point on saturday I realised that I had made a small mistake (it depends how you look at it, of course).

Today I thought a bit about the different parts of a Realtime RSS ecosystem and I found the following parts

  • The cloud, a mechanism for subscription and notification of feeds. You can subscribe to a feed and the cloud will send a notification if something changes.
  • Storage, stores new feed items and provides feeds to readers. Pings the cloud if a new item is added.
  • Authoring, an interface for create new items in a feed. Sends the new items to the storage.
  • Reading, tools for reading the feeds and can be notified by the cloud when new items appear in feeds.

Parts of this system can be combined into one program, but it's not needed, because of the protocols that are used. Each part can be a separate program and multiple programs can implement the same part.

I'm not sure when it happened, but yesterday and part of today I found myself in the company of a blinking cursor. Normally blinking cursors are not a big problem. The thing is, however, that when I'm writing code, or text in my favorite text editor, I want to know where the next character I type will land on the screen. With this blinking cursor, it seemed my mind started to blink in unison. Not very useful when trying to write.

Apperantly in a previous update, Vim (or probably the Terminal) started respecting the preferences for the blinking cursor. It could also be that I was experimenting with a setting for the $TERM environment variable. A few days ago I set it to gnome-256color which enabled 256 colors in the Terminal, which I tried to get for some time now.

Now how to get rid of this blinking cursor. Goto System > Preferences > Keyboard > General > Cursor Blinking. Uncheck the checkbox named Cursor blinks in text fields. This works in Ubuntu 10.10 and probably in other versions as well.

Get rid of the blinking cursor

Sometimes you need to create a large tree of subdirectories. But why? Two examples that I think of are structured directories for weblogs, e.g. /[year]/[month]/[day]/, or the automatic backing up of files e.g. invoices/[company]/[year].

Before you begin, you know that using a split and chdir, or some other combination of built-ins, will just make big mess. A call to mkdir -p could also work, but let's use the available modules this time.

If you use Perl there is always the CPAN that can help you. So, also this time. Enter File::Path.

use File::Path 'make_path';
make_path('posts/2011/04/04');

This will create this structure below the current directory. Simple. There are many of these modules hidden (or less hidden) in the CPAN. THey like the light.

A list of Firefox keybindings. I couldn't find this list, because I was searching for key bindings.

Yesterday I created a small program called Windowpipe. IT allows you to drop files in a window and it will a script on that file. To give you a better idea of how this works, I created a small screencast.

So what's happening? First I drag the shoes.jpg file to the windowpipe window, which I started before with createthumb.pl. After the drag a new file is created called thumb_shoes.jpg. Then I show the two files. Simple.

The premise is simple: create a window the runs a script on the items that were dropped on the window. Simple idea, simple program. I took a first crack at it and uploaded the code.

The first release just runs a script and passes the dropped thing.

I have been building a system of posting short posts on top of RSS and rssCloud. This system helps people to create link blogs and short posts about anything they like. Now I was thinking maybe we can build a programming news network on top of this system. There are already two systems that I know of that work with this system. One is my own, called Pompiedom. The other is the system created by Dave Winer.

This news network should be only about programming: languages, tools, code, software and similar things. Just stuff programmers are interested in, just the stuff that bores normal people.

Programmers writing for programmers, a way to stay up to date and become better at programming. No startups, companies, VC or business stuff, just programming. All source open and free. Ready for anyone to use, improve and give away.

How do we make this happen?

A big part of the software is ready to support this, and if your interested I can help set things up. I will free all the software that I create for this project and a big part is already opened up on github.

The parts:

The last part is still a work in progress and needs some more work. I will release that code shortly.

Interested?

Contact me

I like it that I have to deal with every item in Google Reader. I think it is a good way to do a river of news, especially if it is sorted oldest to newest. Dealing with these items is really simple.

Google Reader sorts my items from oldest to newest and I view every item in turn. To go to the next item I press 'J'. This way I can skim every item. If I find something that I want to read (especially if it's longer) I open it in a new tab.

URLs are the addresses of the internet. Each URL points to a location and your user can go there. However on the internet the addresses are not that important to the users and browsers. Especially since we can read almost any page on such a location. So the location doesn't matter as long as we find what we expect.

The URL and the link text both build an expectation in us. The text and the URL will imply what we will find after we click the link. I think it makes sense to write links without showing the URL.

For example when I wrote some code and pushed it to GitHub, I could write it like this:

I posted my new code to GitHub: http://github.com/pstuifzand/pompiedom-river.

Or I could write:

I just released a new of version of Pompiedom, an rssCloud-enabled realtime river.

The second example explains what you will find when you click the link. Another reason for writing text instead of URLs is that the layout engine has more to work with when breaking lines, especially when using justified text.

find . | awk -F/ '{ print $2 }' | sort | uniq |\
   grep '^\.' | sed -e 's/^./+./'
View archived entries