Monthly Archives: November 2009

Semantic Web, Spam on Steroids and the Importance of an Authenticated Crawl

Robots.txt is old. While there have been extensions to the protocol, we will need a major update in short order.

Why you ask?

Because if you thought spam was a problem with Web 1.0, when rogue crawlers just grabbed huge chunks of your site and republished it, competing with you in search engines on your own content, welcome to the Semantic Web aka Web 3.0.

Spiders that can crawl your content and UNDERSTAND it will bring about a whole new world of pain.

I agree with one of the top experts in blocking bad bots, Incredibill, when he says he’s all for an authenticated crawl. This topic has been brought up to mediocre public support. Mostly I think because they don’t understand how much it hurts people.

If you wonder how bad it is, ask yourself this. Why do robots spoof Googlebot if they aren’t walking away with something valuable of yours?

You can’t just give away your merchandise (ie content) to anyone who asks without at least asking the robbers to identify themselves.

The Search Engines supposedly do have a method to prove the crawler on your site is for real. It’s called forward/reverse dns.

Last I checked, Yahoo! and Microsoft agreed to support it. But some have found legitimate robots from one of them that wasn’t on the list and therefore you can find yourself blocking a legitimate bot. In other words, this solution isn’t trustworthy or serious…. Maybe Google will introduce a new bot tomorrow and the engineer working on it never knew about Matt Cutt’s post… This happens when there are thousands of employees in your organization.

What’s needed is a stronger protocol with authentication using RESTful principles of content negotation. You want my robots.txt? When you visit, give me a unique key along with your URI so I can check you out.

We can then build backend tools that will identify the robot that visited, how many pages they took and how often they visited. Perhaps I can state that bots I haven’t explicitly given permission to can visit, but they only get N pages. I trust you. But only a little.

There’s no reason I need to publicly publish my robots.txt file. Every bit of information will be used against you. When machines are talking to machines, the possibility of abuse grows exponentially. And I haven’t seen anyone really talk about it since the day I got involved in the Semantic Web community. I’m no expert, just a student, but I’d have thought someone would have brought it up!

I’m also not an expert on authentication, so a robust method will have to be decided upon. But we do need to realize that for all the promise of the Semantic Web, there are forces who will misuse it badly and we need to prepare ahead of time. Unlike we did for technologies like email.

Life Lesson: Pay Attention!

Do you ever wonder if there is a subtext to the life around you? That there is a hidden world that you just aren’t seeing. Well if you paid more attention, you might see a little.

We’re legally bound by fine print but how often do we take the time to read any of it? What made me think about it was one day I decided to try and read the credits after Two and a Half Men.

I froze the screen on the DVR and started reading. It was HILARIOUS! Turns out Chuck Lorre does this for all his shows. Have you ever watched Dharma and Greg, The Big Bang Theory or one of his other shows? Next time, catch the vanity cards.

Want another example of something hidden in plain sight? Try Brett Tabke’s robots.txt file. That’s where he keeps his personal blog!

And that’s all I have to say about that.

Note: please ignore this disclaimer except in court:
By reading this post you are legally bound to be my personal slave. Any disputes arising from this agreement will be decided by an arbitrator in Los Angeles. You certify that you agree to these terms by surfing away from this page or closing your browser. If you do not agree, then stay on this webpage. But be aware that aliens might abduct you if you loiter too long on any one webpage. This is true, I heard about it on Art Bell’s show.

Faux Image Generator

Faux Image generator with PHP (inspired by Faux Columns)

Problem:

Faux Columns is a workaround by Dan Cederholm for the fact that CSS “elements only stretch vertically as far as they need to.” This means we can’t get a background color taking up a whole column without using an image.

If like me, you prefer vi or notepad to Photoshop, creating these background images is a pain, but not so bad that you can’t live with it here and there.

However, in my copious “spare time”, I’m creating a CMS in Django with a Zend Framework front-end. The CMS will allow administrators to create their own style guide. I’d like to allow them to choose a background color, optionally with a border and have it show up as a background image for the full blown Faux Columns effect.

Note that initially I called this post Faux Image Generator FOR Faux Columns, but then I realized that as soon as the older browser(s) die off in usage, Faux Columns won’t be used as much. Yet this class does have other uses…

As they say in the Open Source Community, projects usually begin when a developer has an itch so let’s start scratching…

Solution:

Use PHP with the GD Library to generate these background images on the fly and apache trickery to fully leverage caching.

I’ve created a class to do just that:

Faux Image Generator

Note it might be hard to read the rest of this article without seeing the code. My apologies but I don’t write for a living, I code. I will try to come back to this and walk you through the steps a bit more if people are finding it hard to understand, so look at this post as a first draft…

Issues:

Great. So now we can generate a background image on the fly. We can give it a border. But we aren’t done. If we use dynamic scripting to generate layout images, we better make sure there’s some form of caching. There are many types of caching.

You’ve got caching we don’t care about like a caching proxy server, which your ISP may perform to speed up load or database caches to prevent expensive queries to be run over and over when the result set hasn’t changed.

We can also apply a cache to the image using Zend_Cache. But while this may help our server deal with the repeated hits to the same image, we get the best benefit if the user grabs the image from the browser cache.

To achieve this reliably you need to fool the browsers into thinking this is an image and not a program. There are several techniques out there, but for my money, it would be most elegant to make the image follow a naming convention and be called as if it were an actual image.

This is something Apache can handle. *As a bonus, perhaps we’ll even modify the class to actually create an image with the same name! Then if Apache finds the image, it will serve it. If not, it will serve the PHP file the first time only.

So first, let’s try to maximize the benefits of image caching by the browser. Per Google’s suggestions, let’s cache the file for a year but not more.

You can read the gory details on how to set the caching for images or I found Jeremy Zawodny’s caching instructions easier to read. You may have individual needs, as do I, so you may want to set it in your .htaccess file or your httpd.conf file for sitewide deployment.

Essentially you need to put this directive in:

ExpiresActive On
ExpiresByType image/gif A2592000
ExpiresByType image/png A2592000
ExpiresByType image/jpg A2592000
ExpiresByType image/jpeg A2592000

Now that we have that set up, let’s go about telling Apache to redirect the file to our php code in case it doesn’t find an image. So before we do anything else, we need to decide on a file naming convention. So what variables do we need?

$imgType, $bgColor, $bgWidth, $bgHeight, $bdLoc, $bdColor, $bdSize.

Apache’s regex engine, otherwise known as Mod_Rewrite, needs to know how to parse our filename. So here’s the naming convention I’ve gone with in the .htaccess as well as the file creation routine in the class:

$bgColor$bgWidthx$bgHeight$bdLoc$bdColor$bdSize.$imgType

so that would match:

/url/path/dedede5x10topf7f7ff5.png

It isn’t as easy to read as I’d like, but this is the first cut, maybe with some suggestions it can be improved.

The .htaccess looks like this:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} -s [OR]
RewriteCond %{REQUEST_FILENAME} -l [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^.*$ - [NC,L]

#let's only support 6 char colors even though script handles 3 chars        
RewriteRule ^([a-f0-9]{6})([0-9]{1,4})x([\d]{1,5})(top|right|bottom|left)([a-f0-9]{6})([\d]{1,2})\.(png|gif|jpg)$  path/to/fauximage.php?bgColor=$1&bgWidth=$2&bgHeight=$3&bdLoc=$4&bdColor=$5&bdSize=$6&imgType=$7 [NC,L]

This .htaccess will display the image file if it exists, which it will the second time it’s called. On the first call it will send the variables to our class to save the file and then display the image.

You should watch out for one thing. You don’t want people to use your personal php file to fix their background image issues. You may need to watch your logs and take action if someone figures out you’re using this class.

I’d love feedback, especially if it can make this class better, so please comment.

Also, if you want to be updated on changes to the code, or just hear more about the Los Angeles Dev community, MySQL, Zend Framework, Django, PHP or the Semantic Web, consider following me on Twitter @joedevon.

*Since I wrote the first draft of this post, I’ve updated the class to create the image on first load.