Clean your HTML inputs or the dog-eaters will get to you

Last month, I took a short break from my computer and went on a holiday. When I came back I was surprised to find that, while I was on the beach, Google sent quite a few people looking for underground Korean adult movies to my web log. I don’t know what is so special about the Korean illegal film industry, but considering that they also eat dogs there, it must be something very interesting to watch. I guess that you can find anything on Internet these days, but why were they looking for it on my web site? The answer to that question turned out to be another great example of why inputs should be sanitised no matter how unimportant.

I use WordPress for my blog, and so far I am relatively satisfied with it. As a very popular online software package, it does get attacked a lot and security updates are released every once in a while. My site was hacked last year and the bastards dumped a bunch of hidden porn site links in about twenty articles, which took me a few days to clear up. So I learned the hard way that when the admin console suggests an upgrade, I should take its advice. I also added a cron task to check for a few keywords in the database and alert me if someone starts advertising limb enlargement devices for free, and since then I had no real problems. That is, until my site became a hot spot for south-east Asian smut aficionados overnight.

My first guess was that someone was simply spamming me with fake referrer headers, since there was absolutely no reason why my web site would actually appear in Google’s search results for adult movies, Korean or with a different geographic origin. Web sites use request referrer headers to identify where the visitors are coming from. A web browser will send the address of the site where you click on a link to the linked web site, if you have not turned that off. It is not a 100% reliable mechanism to identify visitor sources, as some people turn that feature off and some browsers have bugs and send rubbish, but in general it works OK. With the recent surge in the number of blogs, a new kind of spamming started to take place online. Spammers send fake requests to web sites, putting the address of the web site they are advertising into the referrer header. The rationale behind it is, I guess, to make the site owners to click on the referrer link to see who is sending people to their web site.

But there were quite a few of those requests, much more than with typical spam. The visitors were led to the search web page, and did not look at any other page after that, which could be explained by the fact that they were probably disappointed to find only clips of an ugly bald guy talking about agile acceptance testing instead of their favourite underground adult stars. However, with the web page they downloaded images, css and javascript files, which spammers typically don’t do. I did not know which article actually brought the unexpected guests since only the search page was affected. The database lookup did not help either — luckily this time it seemed that the site was not hacked.

I tried out the query on Google, just for fun, to be absolutely amazed that my site was the third on the list. Sure enough, my search page was there. I simply had to click on that to see what happens, and a few seconds later I was looking at a spam web site. My web logs showed a hit from Google again, but I was not looking at my site. Clicking on the “cached” link on Google led to the same outcome. I grabbed the page using wget, which definitely would not jump out directly, and there I found the words “korean underground adult movies”, but only after the “There are no results for…” phrase. More interesting, after that, there was a HTML image tag with “” as the source, and an onError event redirecting people to the spam web site. When the page loaded, the browser could not find to load the picture, and fired the onError event, sending the visitors from my web site to some place they could probably watch something more to their liking. Not a bad trick at all!

God knows how they got Google to index my web page with both their keywords and the redirection tag as a search phrase, but they did. And it’s not only my blog, there’s a few thousand other sites with the same problem. Search on google for “onerror freeimagew” to see them. The results containing </title> in the site name will probably redirect you automatically to the spam site.

The problem was that my blog just dumped out whatever people put into the search form when it could not find any relevant posts. The input string was properly sanitised before it was sent to the database, and WordPress generally cleans up all user submitted comments from hostile content, but it looks like they did not think of someone using the search form to hack the web site. In any case, I just changed the theme search.php file to print “Sorry, no posts matched your criteria” when there are no results, and that fixed the problem. A proper solution would be to strip out HTML tags from the search but I was too lazy to look for all the places where the phrase could be set.

In any case, this is one more example how important it is to filter and sanitise everything put in by web site users, regardless of how safe it may seem, and never ever printing it back on the web site without checking for potential problems.

Image credits: Sonja Gjenero

I'm Gojko Adzic, author of Impact Mapping and Specification by Example. My latest book is Fifty Quick Ideas to Improve Your Tests. To learn about discounts on my books, conferences and workshops, sign up for Impact or follow me on Twitter. Join me at these conferences and workshops:

Specification by Example Workshops

How to get more value out of user stories

Impact Mapping

3 thoughts on “Clean your HTML inputs or the dog-eaters will get to you

  1. Blimey, there’s an eye-opener.

    I always sanitise input that actually does anything, but I’m not sure if I have been careful enough sanitising input that could be abused in the same way as you described.


  2. The *REAL* problem is that you should never let users augment your source code. Sounds pretty basic, but of course people slip up all the time.

    In the case of SQL-insertion attacks, santizing the data is one approach – but it’s tedious and error prone, and often leaks back to the user (“sorry Mr O’Darn, but quotes are not allowed!”). A far better approach is to use parameter markers in the SQL (“update table set name = ? where key = ?”) and then the entire issue goes away. As does floating-point precision. And date/time formatting. And performance goes up! (except possibly in Java, but that says more about Java than about SQL).

    Problem is, there’s really no way to do this ‘right’ with HTML! For this reason, I think HTML is fundamentally flawed – there’s no systemic way to do it right which would solve it once and for all; the best you can do is keep patching system after system for exploit after exploit ad infinitum.

  3. …oops, meant to say FP precision LOSS, and date/time formatting issues (like between databases vs C locales vs system settings, etc)

Leave a Reply

Your email address will not be published. Required fields are marked *