In the article about why we don't use more data structures in web
applications I showed an example of PHP code. I copied the code
below. The example shows how we start to write web applications. I will rewrite the
code so it says one thing: insert a person into the database.
Two things happen in this example. First we get the variables from the
$_POST variable. And then we insert the values into the database.
First I'll add the 'code' of the db_person_insert function as well.
<?php
function db_person_insert($db, $name, $address, $city) {
// some SQL code...
}
?>
The function accepts four parameters: one database connection, a name, an
address and a city. The last three arguments are related because they talk
about the same person. The code however doesn't show us that. Let's make a
change so it does.
<?php
function db_person_insert($db, $person) {
// some SQL code...
// use $person['name'], $person['address'] and $person['city']
}
?>
The interface of the function is much cleaner now and doesn't have to change
when we change what data we want to know about people. In the previous
article I showed what happened when we added the phone number. Now only the
implementation of the function needs to change.
Let's go back to the first example. This code has to change to work with the
new db_person_insert function.
This code is already better, because it shows that there is only one
argument that is used as a whole. But there is another change we can make,
the three variables at the top refer to the same person. Why don't we have
one variable representing that person?
The structure of the program becomes a bit clearer again. Let's take it one
step further and create a function that gets the values from the post
variable.
This still isn't the best version of the code, because there are still a few
pieces of duplication. The person_from_postarray function is a bit sad,
because it only makes a copy of particalur fields from $arr to the new
array $person. So lets rename the function.
This piece of code is more general than the previous version, because it can
be used for more problems, while it still makes sense. Now we make the
change to make it more like a function that copies a person. The first step.
This makes the structure very obvious, but there is one field that's
problematic, the street or address field. The nice thing about where we're
going what this, is that we can make our lives simpler by making the fields
of the person the same everywhere. So one thing we could do would be to
change the street field to the address field, or the other way around.
The solution depends on what you can change. Let's say we name the field
'address' and change the web interface.
function person_copy($arr) {
$person = array();
foreach (array('name', 'address', 'city') as $field) {
$person[$field] = $arr[$field];
}
return $person;
}
Done! This function now creates a new person based on values from another
array. By representing a person as one thing, we can simplify the code that
works with it.
Blogging, emailing and messaging aren’t owned by anybody. Tweeting is owned
by Twitter. That’s a problem.
It is one of the points I tried to make a few days ago. I
don't think we have to move away from Twitter. But I do think we need to at
least make it possible for every company (and person) to have their own
microblogging instance. And all these instances should be able work
together.
The best solution is to make many thousands of interoperating services, just
like email. This allows each company to host their own version. And all
these services will work together. That way I can subscribe to your messages
and you can subscribe to mine. If no subscriptions exist between two
services, there are no messages transferred between them.
There is this question that keeps coming back when I see the code of web
applications. It keeps coming back, probably because it relates to the
basics of our field.
Why do we forget we learned about data structures, the moment
we start to write web applications?
The question points at three things. Why do we forget about data
structures? Why don't we write better code for web applications? The last
point is a passive-aggressive jab at all web developers. And I'm one of
them.
This question seems to come up a lot for me. Probably because more often
than not, I write web applications. Last year I wrote about this
problem and
I tried to think of a different way to tackle the web application problem.
Without much progress I might say.
Let's take a look at some of the things written in the past about the importance
of data structures and algorithms.
Rule 5. Data dominates. If you've chosen the right data structures and
organized things well, the algorithms will almost always be self evident.
Data structures, not algorithms, are central to programming. (See Brooks
p. 102.)
And of course Knuth wrote "The Art of Computer Programming", four huge
volumes on algorithms and data structures. So they must be important. Even
the title of the book from Wirth, "Data Structures + Algorithms = Programs"
says as much. Without data structures or algorithms we aren't even writing
programs.
That's a bit lazy, but I think the most important thing we can learn
from this, and we all already know this, is that data structures and
algorithms are the basic building blocks of the programs we write.
But if data structures and algorithms are so important, why don't
we use more of them in web applications?
There are two reasons for this to happen. First there is more emphasis on
the algorithms, on the how, than on the data structures. And second, many of
these algorithms and data structures are implicit.
Let's start with an example of an implicit data structure (in PHP).
Here the data structure of the person is implicit. There are three variables
that are related. Only two things that show they're related: the grouping of
the variables and the db_person_insert function. There is no mention of
something called a person, except in the name of the db_person_insert
function.
And now the example of a implicit algorithm (again in PHP).
You could think I made an error by showing you the same code. I'm sorry to
disappoint you. Let me explain. The algorithm contains two steps. (1) Get
the person object from the $_POST variable. (2) Insert the person object
into the database.
One problem with this code is that the moment you want to change what it
means to be a person in your application, you need to make many changes all
over. Let's say your form also needs to handle phone numbers. Let's make the
change in the code.
I added one line to get the phone number and one parameter to the
db_person_insert function, but I also have to check every other place
where I call the db_person_insert function. You can make this change for
every field that needs to be added, but you need to make a lot of changes in
a lot of places. That sucks, but now imagine that the db_person_insert
function was actually a little bit of SQL code, sprinkled around your code.
The other problem is that it's not obvious that we are working with people
(as in more than one person). This code shows how we do it, but not
what we do.
Especially in web applications there seems to be going no thought at all
into the structure of the data that's used inside the applications. There
are some objects, or maybe there is a database, but these things are not the
actual data the program operates on.
Code in web applications feels shoddily written, without a kind of bigger
picture. For example if you display a product in the interface, why don't
you have a data structure representing a product? Or if you have a form for
creating a new product, why don't you have a data structure representing
that?
Without data structures you have to write certain code again and again. And
while this apparently is not a big enough problem to be solved once and for
all, I think some of us know there is something wrong.
Without data structures you can't have a group of operations working on a
certain kind of data. You also can't use code because the data is just
different enough.
Let's look at few more examples. How many times have you written a login
form and authentication? Isn't every login screen the same? The user
provides a username and password. That could be a data structure. Which user
object is identified by these two values? Could that be a function?
Or an interface to order certain objects. There is a structure to ordering
things that transcends the type of objects. For example if I want to order
images, or photo albums, or comments, or videos, or products, or menu items.
The way to do this is always to same. If the data is in memory we know how
to do this.
If I want to change to order of items in a vector, I call the swap
function. Other languages change the order with different functions.
But somehow when we start to create web applications we need to create a new
solution for this. It seems almost as if there is a need for people to be
original in every line of code they write. Maybe there is something called
False Originality, in the tradition of False Laziness and friends. Or maybe
it's "Not Invented Here".
If we don't have a data structure, we can't write a function that modifies
that data. So instead that we create more general functions, we write
functions that call other functions until we get to the bottom of our
software stack, which could be the database or the file system.
Without the interface of your data structure to write a function for, there
is no way you can write code without duplication. And without a similar
looking interface you can't even see the duplication that's there. Without
the data structure we can't say that certain things are similar.
We can't describe it. We have no place to write the code.
For example, I want to create an interface that allows the user to reorder
photos. The web interface could be really simple for this example. Every
item has an id and a priority. Sort the ids according to the priorities.
The web framework calls a controller method. This controller method parses
the arguments. We can write a general fuction that parses the arguments
and creates a list of number pairs: [ (id, prio)... ]. I hope this
notation makes sense. It describes one pair with two fields id and prio
contained in a list of these pairs. The parser function returns that list
and another function will write this order to a database, one at a time.
The code that reorders items could be used for every table that has a
primary key and a priority field. But before that can work, we need ways to
describe data structures in the code. If we don't have data structures, we
can't write general algorithms to work with them.
Without data structures and general algorithms, we need to create a new
solution for this problem every time. Maybe even multiple times in the same
codebase.
The reason is that we don't look further than the libraries and frameworks
that are handed to us. In the person example above we use the $_POST
variable to get the values. This works and there is no reason to look
further, the code is already very simple.
The problem is that if we don't look further, we don't evolve the craft and
we have to solve the same problems over and over again. The even bigger
problem is, that if we don't create data structures to work with, we
can't even see the way out of the mess we're in.
Update: Added a new article about the PHP example.
Decentralized services, like email, have a big advantage for users. I can
send email to anyone that I have the email address of. The advantage for me
is that even if my friends use a different email provider, I can still send
them emails. It would be a huge problem if I'd need an email account with
every provider that I wanted to send emails to. That I can use one account
to send email to anyone that reads email is an advantage for me.
Lets assume for a moment that you could only send email to the people at the
provider where you're a customer. My friends would need to be customers at
the same provider as well if I wanted to send an email to them. If
I wanted to send email to someone completely different, maybe at the other
side of the globe, it would probably be impossible. It's important that
everyone you want to send email to is a customer at the same provider as
you.
Now lets assume that another provider starts to offer a new and cool email
related service, maybe they send a free song every week. Now a few of my
friends will move to this new service, they like the free songs. And I can't
send email to them anymore. So, now I have to move as well. And then a few more
people move. And now the parents move and the grandparents move. They don't
like the songs, but they want to send email to their children and
grandchildren.
Every time a new and better provider appears people will move, first slowly,
but after some time, faster. The centralized structure of this email system
makes people move in groups and because some people are part of multiple
groups, they'll make other groups move as well, until everyone has moved.
Another disadvantage that appears is that a centralized system needs to be
one size fits all. And while this somewhat works for t-shirts, this doesn't
work at all for software and websites.
Luckily email doesn't work this way, but some websites do. For example, it's
impossible to have multiple Twitter providers. Their terms of service
don't allow it. A new service that provides micro blogging can't work
together with Twitter. So, if at some point a new service comes along and
Twitter isn't the hot new thing anymore, then people will flock to the new
service, because everyone is.
On the other hand, if Twitter would interoperate with other services, then
there is no reason for people to move, because they can still send messages
to their friends and everyone they care about. The people who like Twitter
will stay and the people who don't like it, will move. You'll only switch if
a different service better fits your needs.
Let's say you run nmap on your local box and you see an open port, but
don't know which program is listening on that port. Wouldn't it be great to
be able to find out? The program you need to find this information is
lsof, or list open files. This program can show you a lot of information
about open files, ports and directories.
Let's go back to the original question. Run the following command.
sudo lsof -i4:80
With this command we get a list of programs which listen or connect to port
80. Its output should look like this:
Now that we have two programs that parse log files, we can start to take a
look at how many lines the program parses per second. First we have to make
the two programs as similar as possible. In pseudocode it looks likes this.
Load all modules
Take the start time using Time::HiRes
Put the code of the program here
Set line_count = 0
Using stdin: loop through all lines
Parse the line
Set line_count++
Find the time difference
Divide and line_count / time as n lines/s
In Perl this looks like:
use Time::HiRes 'gettimeofday', 'tv_interval';
my $start = [gettimeofday];
# Your program
my $line_count = 0;
while (<>) {
# Parse one line using your software
$line_count++;
}
my $diff = tv_interval($start);
printf "%.2f lines/s\n", $line_count / $diff;
Now run the two programs a few times and look at the parsing speed. In my case
there was a big difference between the speed of the two programs. I expect a
difference in your run as well.
Writing an apache access log parser isn't that hard. Below is a parser that
does just that. It creates Data::Dumper output of all the lines. No warranty.
Today I tried to create a report of some basic statistics about
Abacus downloads. Normally I would use
grep, awk and a few other commandline tools to find a rough estimate of
these numbers. However this time I needed a bit more information than these
tools could give me. A problem in need of a solution.
My first question was: how many people have downloaded Abacus? The answer is
The pattern here is the following. First find the lines you want. Then remove the
lines you don't want. Print the first field-the client-and makes this list
unique. I don't want to count multiple downloads from the same ip.
The next questions was: where do people who download Abacus come from?
For this I take the answer from the last question (without wc -l) and
write it to a file. Now I can use the file as extra argument for grep like
this:
grep -f abacus-downloads.txt -F logs/access.log
This makes grep use the lines in abacus-downloads.txt as the patterns that
it needs to find in logs/access.log. Now I need to find the first line
where a match appears, which should contain the referrer where the person
comes from. How to do that? I did the following:
This script will only print a line if it's the first line containing a
client.
use Parse::AccessLogEntry;
my $p = Parse::AccessLogEntry->new();
my %hosts;
while (<>) {
my $line = $p->parse($_);
if (!$hosts{$line->{host}}) {
print;
$hosts{$line->{host}} = 1;
}
}
I pipe the output of the previous grep through this program and now I have
the lines with the referrers I'm looking for. A small improvement could be
to filter out favicons because in my case one browser downloaded the favicon
before it got the page itself.
Just add
next if $line->{file} =~ m{^/favicon};
at the appropriate spot. Now I need a list of the referrers from these
lines. I could change the print statement in this program to that for me.
That wouldn't be the unix way. So I wrote another small program that prints
the field from the log if it's specified in the arguments.
use Parse::AccessLogEntry;
my $p = Parse::AccessLogEntry->new();
my @args = @ARGV;
@ARGV=();
while (<>) {
my $line = $p->parse($_);
print join("\t", map { $line->{$_} } @args) . "\n";
}
This program can be called using one or more arguments. The argument should
be a key from the $line hashref, like host, user, date, time, diffgmt,
rtype, file, proto, code, bytes, refer or agent.
Using refer as an argument, the program ave me a list of the referrers
from the log file. Using sort | uniq -c | sort -rn on this gave me a top X
list of the referrer where the people who downloaded Abacus came from.
I'm aware that this is really obvious to some of you. I link to a video here
of a presentation by Mike Montiero of Mule
Design
about how to get clients to pay you for your services. If you work with
clients then you should watch this. It's about how it should be done. Just
good advice.
In the first 5 minutes of the video The Storage Technologies Behind
Facebook Messages the bald
manager talks about messages and shoeboxes, and how he can't have all
his messages in one place. He says: "Where is my box of letters? It's locked
up in a phone, it's locked up in email. It's not in one place. Until now."
The one place being Facebook. So I ask, how is it better if we lock up all our messages in
Facebook?
Friday night I implemented a new feature for my Pompiedom empire for
realtime RSS feeds. At some point on saturday I realised that I had made a
small mistake (it depends how you look at it, of course).
Today I thought a bit about the different parts of a Realtime RSS
ecosystem and I found the following parts
The cloud, a mechanism for subscription and notification of
feeds. You can subscribe to a feed and the cloud will send a notification
if something changes.
Storage, stores new feed items and provides feeds to readers. Pings
the cloud if a new item is added.
Authoring, an interface for create new items in a feed. Sends the new
items to the storage.
Reading, tools for reading the feeds and can be notified by the cloud
when new items appear in feeds.
Parts of this system can be combined into one program, but it's not needed,
because of the protocols that are used. Each part can be a separate program
and multiple programs can implement the same part.
I'm not sure when it happened, but yesterday and part of today I found
myself in the company of a blinking cursor. Normally blinking cursors are
not a big problem. The thing is, however, that when I'm writing code, or
text in my favorite text editor, I want to know where the next character I
type will land on the screen. With this blinking cursor, it seemed my mind
started to blink in unison. Not very useful when trying to write.
Apperantly in a previous update, Vim (or probably the Terminal) started
respecting the preferences for the blinking cursor. It could also be that
I was experimenting with a setting for the $TERM environment variable. A
few days ago I set it to gnome-256color which enabled 256 colors in the
Terminal, which I tried to get for some time now.
Now how to get rid of this blinking cursor. Goto System >
Preferences > Keyboard > General > Cursor Blinking. Uncheck
the checkbox named Cursor blinks in text fields.
This works in Ubuntu 10.10 and probably in other versions as well.
Sometimes you need to create a large tree of subdirectories. But why? Two
examples that I think of are structured directories for weblogs, e.g.
/[year]/[month]/[day]/, or the automatic backing up of files e.g.
invoices/[company]/[year].
Before you begin, you know that using a split and chdir, or some other combination of
built-ins, will just make big mess. A call to mkdir -p could also work,
but let's use the available modules this time.
If you use Perl there is always the CPAN that can help
you. So, also this time. Enter
File::Path.
use File::Path 'make_path';
make_path('posts/2011/04/04');
This will create this structure below the current directory. Simple. There
are many of these modules hidden (or less hidden) in the CPAN. THey like the
light.
Yesterday I created a small program called Windowpipe. IT allows you to drop
files in a window and it will a script on that file. To give you a better
idea of how this works, I created a small screencast.
So what's happening? First I drag the shoes.jpg file to the windowpipe
window, which I started before with createthumb.pl. After the drag a new
file is created called thumb_shoes.jpg. Then I show the two files. Simple.
The premise is simple: create a window the runs a script on the items that were
dropped on the window. Simple idea, simple program. I took a first crack at it and
uploaded the code.
I have been building a system of posting short posts on top of RSS and
rssCloud. This system helps people to create link blogs and short posts about
anything they like. Now I was thinking maybe we can build a programming news
network on top of this system. There are already two systems that I know of
that work with this system. One is my own,
called Pompiedom. The other is the system created by Dave Winer.
This news network should be only about programming: languages, tools, code,
software and similar things. Just stuff programmers are interested in, just the
stuff that bores normal people.
Programmers writing for programmers, a way to stay up to date and become better
at programming. No startups, companies, VC or business stuff, just programming.
All source open and free. Ready for anyone to use, improve and give away.
How do we make this happen?
A big part of the software is ready to support this, and if your interested I
can help set things up. I will free all the software that I create for this project and
a big part is already opened up on github.
I like it that I have to deal with every item in Google Reader. I think it is a
good way to do a
river of news, especially
if it is sorted oldest to newest. Dealing with these items is really simple.
Google Reader sorts my items from oldest to newest and I view every item in
turn. To go to the next item I press 'J'. This way I can skim every item. If I
find something that I want to read (especially if it's longer) I open it in a
new tab.
URLs are the addresses of the internet.
Each URL points to a location and your user can go there. However on the
internet the addresses are not that important to the users and browsers.
Especially since we can read almost any page on such a location. So the
location doesn't matter as long as we find what we expect.
The URL and the link text both build an expectation in us. The text and the URL
will imply what we will find after we click the link. I think it makes sense to
write links without showing the URL.
For example when I wrote some code and pushed it to
GitHub, I could write it like this:
The second example explains what you will find when you click the link. Another
reason for writing text instead of URLs is that the layout engine has more to
work with when breaking lines, especially when using justified text.