This is an article that I wrote about two years ago. I like to share it with you because I think that it contains a few nice rules for writing better software.
When creating websites, there is a time when you want to put some dynamically generated output to the web browser. With a CGI script this is done by printing to stdout. When you don't take some simple precautions it will be easy for crackers to take over your website.
The simple rule is: encode all output. This means that if data is going from your program to another place, you should encode it. The way in which the data will be encoded is dependent on where it's going. I will show this with three examples.
Rule #1: Encode all dynamic output
HTML
On a webpage with HTML it's possible for an evil person to change parts of the website. Sometimes this isn't a very big problem, like when it's only possible to have the HTML on his own page. But when it's possible to get user input on another users page, then there is a possibility for cross site scripting (XSS), which is something you don't want.
Next I will show a simple php program with a problem.
<html> <body> <form action="badscript.php" method="post" > Email: <input type="text" name="email" /> <input type="submit" value="Mail me!" /> </form> </body></html>
This will look like the following webpage.
OK, so this is really simple. The next part of this needs a badly written php script.
<?php echo "The text you wrote is: " . $_POST["email"] . "</br>"; echo 'Try again: <form action="badscript.php" method="post" > Email: <input type="text" name="email" value="' . $_POST['email'] . '"/> <input type="submit" value="Mail me!" /> </form>'; ?>
So now let's try the script. This script let's you write some text, and then text will get printed on the webpage when you click the button. Now you can try some things. Nice examples are: normal text, some text with a bit html, like bold tags, or things with quotes.
To save you from some emberrasment, it's better to encode the output of the text in the script. The parts of the text that can destroy your page are the characters that are interpreted differently in HTML than in plain text.
The characters that you should encode are:
character | encoded |
---|---|
& | & |
< | < |
> | > |
" | " |
' | ' |
The table above shows the order in which the characters should be encoded.
The &
is first because is part of the other entities. The
order of the other characters don't matter that much.
<?php echo "The text you wrote is: " . htmlentities($_POST["email"]) . "</br>"; echo 'Try again: <form action="goodscript.php" method="post" > Email: <input type="text" name="email" value="' . htmlentities($_POST['email']) . '"/> <input type="submit" value="Mail me!" /> </form>'; ?>
The driver form:
When you program php for a living, it's good to know that magic quotes is a hack, and that it will break your page. The nice thing however is, you don't need magic quotes to survive injection attacks. By knowing what, when and how to encode all problems with injection attacks can be thwarted.
MySQL
MySQL is output just like HTML is, but instead of going to the webbrowser it's going to a MySQL server. With MySQL the rules are even simpler than with HTML.
Rule #2: Every ' that should be inserted into the database, should be replaced with \'.
Note that this is a specialization of rule #1. Only the rules for encoding MySQL are different than the rules for encoding HTML. It's all about the special characters. The ' is a special character in MySQL, in other databases there could be other special characters.
The easiest way to get rid of special characters in SQL queries is by using prepared statements. A prepared statement is a sql query that is 'prepared' before it is used. In a prepared statement you can use the question mark to specify a place where a variable will be inserted.
In PHP prepared statements can be found in the mysqli
module,
by means of the mysqli_prepare
function. In perl the
DBI
module is all you need and in Java you should take a look at
the PreparedStatement
class. All classes, modules and functions work in a
similar way.
- Prepare a mysql query; use
?
where variables should go. - Use the statement many times
As you can see it's really simple. In perl prepared statements work with the
prepare
method.
use DBI; my $conn = DBI->connect(...); # Find all people in berlin. my $stmt = $conn->prepare(<<"SQL"); SELECT name, city FROM people WHERE city = ? SQL $stmt->execute('Berlin'); while (my $row = $stmt->fetchrow_arrayref) { # do something with $row->[0] and $row->[1] }
By using prepare and execute, perl will make sure that the variable 'Berlin'
wil get quoted. Another nice side effect is the resulting code will run faster
and $stmt
can be used multiple times with other arguments. This is
especially useful for INSERT queries, where the query stays the same, but where
the arguments change. The prepared statement is faster because the query
doesn't have to parsed everytime the statement is executed.
Input
This section probably shouldn't be in this article, but as long as there isn't a better place, it will stay here.
Rule #3: Keep the input as close to the original as possible
This means that you shouldn't encode the input from user in any way if it isn't necessary. For example: a piece of text typed by a user, shouldn't be htmlencoded when it's inserted into the database. If you do encode the text at that time, it's a lot harder in the future to use the text in another medium, like plain text or pdf, because they expect another format.
Rule #4: Always check user input.
Don't check the input for possible encoding problems like quotes or angle brackets. But do check for the empty string (no value), or digits (when expecting a number). The simplest way to check these kind of things is by using regular expressions.
The number check in perl:
my $number = $cgi->param('id'); if ($number !~ /^\s*(\d+)\s*$/) { $error = "id should be a number"; return; } $number = $1;
This check finds out if there is a number. It can contain optional
whitespace at the front and at the end. Also notice the caret
(^
) and the dollar-sign
($
), which make
sure it's only a number. Otherwise it could match a number in a string like
"ad3df
" (3
), which would probably be the wrong
thing.