Semantic Web, Spam on Steroids and the Importance of an Authenticated Crawl

Robots.txt is old. While there have been extensions to the protocol, we will need a major update in short order.

Why you ask?

Because if you thought spam was a problem with Web 1.0, when rogue crawlers just grabbed huge chunks of your site and republished it, competing with you in search engines on your own content, welcome to the Semantic Web aka Web 3.0.

Spiders that can crawl your content and UNDERSTAND it will bring about a whole new world of pain.

I agree with one of the top experts in blocking bad bots, Incredibill, when he says he’s all for an authenticated crawl. This topic has been brought up to mediocre public support. Mostly I think because they don’t understand how much it hurts people.

If you wonder how bad it is, ask yourself this. Why do robots spoof Googlebot if they aren’t walking away with something valuable of yours?

You can’t just give away your merchandise (ie content) to anyone who asks without at least asking the robbers to identify themselves.

The Search Engines supposedly do have a method to prove the crawler on your site is for real. It’s called forward/reverse dns.

Last I checked, Yahoo! and Microsoft agreed to support it. But some have found legitimate robots from one of them that wasn’t on the list and therefore you can find yourself blocking a legitimate bot. In other words, this solution isn’t trustworthy or serious…. Maybe Google will introduce a new bot tomorrow and the engineer working on it never knew about Matt Cutt’s post… This happens when there are thousands of employees in your organization.

What’s needed is a stronger protocol with authentication using RESTful principles of content negotation. You want my robots.txt? When you visit, give me a unique key along with your URI so I can check you out.

We can then build backend tools that will identify the robot that visited, how many pages they took and how often they visited. Perhaps I can state that bots I haven’t explicitly given permission to can visit, but they only get N pages. I trust you. But only a little.

There’s no reason I need to publicly publish my robots.txt file. Every bit of information will be used against you. When machines are talking to machines, the possibility of abuse grows exponentially. And I haven’t seen anyone really talk about it since the day I got involved in the Semantic Web community. I’m no expert, just a student, but I’d have thought someone would have brought it up!

I’m also not an expert on authentication, so a robust method will have to be decided upon. But we do need to realize that for all the promise of the Semantic Web, there are forces who will misuse it badly and we need to prepare ahead of time. Unlike we did for technologies like email.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s