Jeff on Coding Horror writes:
Among programmers of any experience, it is generally regarded as A Bad Idea (tm) to attempt to parse HTML with regular expressions.
You should read the rest of the article. I’m mostly of the same opinion about this as Jeff. Additionally I think that parsing HTML is a bad idea in the first place.
One of the examples Jeff gives for parsing HTML is sanitizing user input. The user input will be used on the website as comments for example. I think allowing people to write HTML and putting it on a webpage is the wrong way to go.
If you want to include user input in a web page there are only two ways to do it:
Encode all special HTML characters. You can start with
<
,>
,&
,"
and'
. If you encode these then people can send all the HTML they want, but it’s encoded and will not affect your web page. For Perl I recommend the HTML::Entities module.The other way to allow HTML is, on a page, where it is their own. If they want to break their own web page, they should be allowed to do that. This doesn’t mean profiles on websites, but actual websites that are completly their own.. So you say it is HTML and then use the input verbatim.
If you want to include some kind of formatting for the input, you can use a markup language. Use a markup language that allows to specify as much formatting as you like, this could be a HTML-like language, that only transforms the tags that you want. Therest of the text should be HTML encoded.