The robots.txt file is supposed to be a tool for keeping search engines away from directories on your web site you don't want spidered or indexed. The major search engines all claim the obey them, but warn that there may be a delay between when a robots.txt file is changed and the spider reads, and follows it. All nice and good in print, but the reality is scary.
To cut down on bandwidth use I recently listed two directories containing seldom used message boards in my robots.txt as disallowed. Almost immediately Google began hitting those directories with the fervor of a teen-age hacker. The index page alone of one received 692 hits in one day from GoogleBots.
Now add that bit of info to the recent story from Reuters about hackers discovering a “wealth” of information regarding things most people don't want on the internet -- at Google.com. (I mentioned it here.) Could Google be using the robots.txt files to intentionally harvest data people want hidden?
Not scary enough for you? Well, add to that the problems Michelle Malkin, Charles Johnson and other bloggers have had getting their blogs listed on Google News. Apparently Google refused to add Conservative blogs, but has no problem adding Liberal blogs such as Wonkette or the Democrat Underground.
Then it should come as no surprise that as I reported earlier today about the political contributions of Google employees.
Let's add it up: Google a blatantly Liberal entity, is found to have tons of sensitive data archived on its site, and seems to be using the robots.txt files to sniff out where that sensitive information is hidden. Why would they want it, and what do they plan to do with it? The last election was pretty dirty and stuff was being dug up left and right. Could Google be building a “dirt chest” of secrets to unload during the next election?
Posted by Jack Lewis at February 15, 2005 03:42 PM
TrackBack URL for this entry:
» Et Tu, Google? from Kobayashi Maru
Jack Lewis weaves an additional thread into the already-ominous story about Google's apparent left-leaning track record with regards to including conservative blogs in its news section: [Read More]
Tracked on February 16, 2005 07:40 AM
» Google Abusing robots.txt? from Myopic Zeal
Frankly, I'm skeptical, but Jack Lewis has an interesting anecdotal observation and makes a case.
Let's add it up: Google a blatantly Liberal entity, is found to have tons of sensitive data archived on its site, and seems to be using the robots.txt ... [Read More]
Tracked on February 16, 2005 08:39 AM
» Google mind control (beta version) from Mazurland Weblog
I checked out Little Green Footballs for some morning inspiration and found this interesting link to an article claiming that Google is somehow suppressing links to conservative sites through their search engine. [Read More]
Tracked on February 16, 2005 10:25 AM
» Google's Bias from PeteHoliday.com
Jack Lewis plays the "what if" game and suggests that Google is building up a treasure trove of dirt to be used in future elections. He cites an alleged problem getting conservative sites blogs listed in Google News and the recent info that Googles emp... [Read More]
Tracked on February 16, 2005 11:23 AM
So then, the answer this problem seems obvious; include references to all the websites that Google won\\\'t ordinarily include into your robots.txt file. That way, the Googlebots can hit away to their heart\\\'s content, and help drive traffic to the politically incorrect infidel sites.
Posted by: Alexander the Grape at February 15, 2005 08:29 PM
Hmm. Info-mining for political purposes? Hardly impossible. Given the company they keep, it should certainly be considered as a possibility.
If you don't want people to see your most sensitive, confidential data, umm... don't store it on the Internets.
Posted by: Gumby at February 15, 2005 09:57 PM
A robots.txt file is a lousy way to "hide" data that shouldn't be world-readable in the first place.
I don't believe that Google would bother doing this for political effect, but I can't think of any other reason.
For example, Mozilla uses a robots file to discourage crawlers from browsing their automatically-generated source code display (lxr.mozilla.org). Why would a search engine want to fill up their index with crap, for instance?
Posted by: dr_dog at February 15, 2005 10:09 PM
And you guys on the right love to point out how the left is full of lunatic conspiracist theories. What the f*** man. Listen to yourself for a moment.
By the way, go check the LGF archives to find the huge celebration when they complained to Google News about DailyKos and then got them unlisted. Success!
Posted by: wtf at February 15, 2005 10:13 PM
I dont work for Google but have some insight into how data designers think. See below.
I am a data hog and have been for some time. We collect and use every bit of data that comes our way on our systems. We track what users do and how they do it. Unless we are specifically told not to, we collect and hold onto anything and everything that comes along.
I think Google is driven in this case by Data Greed (Normal) and the need for Closure.
Google is in the business of collecting information and referencing it. I see no reason why they would not ambitously seek out every nook and cranny. (I would.) They are even worse data hogs. This is the data greed part.
As for blowing past the robots.txt, Mathematically, if you want to exclude some subset, you subtract it from the main set. But first you have to define the subset.
I dont know what the internal ethics are at Google wrt the robots.txt files, but my guess is that Google may want a list as a positive list of what NOT to put up. This enumerated exclusion list may be a better way of assuring themselves they do not publish off limits links. Mathematically, it makes sense.
And that is how I would design it. The real problem is not crawling YOUR site, but what to do with the data on other sites pointing to the stuff under your robots.txt? How do you exclude these secondary references?
The only way is to develop an exclusion list.
This may become the Internet's Dark Matter someday - we know its out there and is most of the information, but we just can't get to it!!
Posted by: puredata at February 15, 2005 10:13 PM
The behavior you describe is not typically associated with Google. The most logical explanation for this is something else using a google user-agent trolling robots.txt for sensitive data.
It would only take me a few seconds to write a script which does just that, and then hammers the hell out of your server.
Posted by: Mason at February 15, 2005 10:42 PM
Did you try a reverse DNS lookup on the IP addresses which hit your site? Are they actually from Google? A user-agent is suggestive, but proves nothing.
robots.txt is supposedly there so that dynamically generated content isn't thwacked-upon by the bot, because it wouldn't add anything, and because it would put unnecessary load on the server.
in the search area and press the 'I'm feeling lucky' button
Posted by: Mark Macy at February 15, 2005 10:57 PM
If you don't want a page to be seen in the Google listings and you don't want the page to be spidered:
1) Don't have any "normal" links on other pages pointing to it
2) Put them in a password protected directory
3) If you must, link to the page(s) in question from other pages using the following code example:
I think this is just a little beat up...does anyone actually think someone at google has any interest in reading the hundreds of millions of pages that people may not have wanted to have been indexed to try and find something interesting... the googlebomb stuff is interesting though... follow the wikipedia link above....
Posted by: stephen at February 16, 2005 04:16 AM
man, put some tinfoil on your hat and go buy some duct tape. how insane are you?
Posted by: john at February 16, 2005 04:47 AM
I wouldn't put it past the lefties at Googoo.
Democrats have turned into a desperate cult of hateful obstructionsts.
Posted by: zvi wolfe at February 16, 2005 06:56 AM
Seriously, you people are insane. I don't even know what to say.
Posted by: wtf at February 16, 2005 08:46 AM
Keep in mind that Google, though technically a "public company," is under no obligation to be politically neutral.
Perhaps some day a more conservative version of the webcrawler will pop up, exposing people to the wisdom of Michelle Malkin above all others.
Let's face it, Google is a bunch of California nerds who are encouraged to take ping-pong breaks between coding sessions. It would not surprise me that they'd want to shut out sites that they find ideologically disagreeable. I certainly would not want to associate myself with the likes of LGF. Have you seen the kind of comments it allows?
Posted by: Johnny Mainstream at February 16, 2005 09:08 AM
Odds are your robots.txt file is incorrect. It is also possible that some other site is using a fake name and searching for files hidden by robots.txt to look for sensitive information.
"And that is how I would design it. The real problem is not crawling YOUR site, but what to do with the data on other sites pointing to the stuff under your robots.txt? How do you exclude these secondary references?
The only way is to develop an exclusion list."
Umm... no. The data is stored in a hierarchy -- robots.txt is going to either list specific files, or roots in the heirarchy... you don't need a fully enumerated list to know that something exists in a certain branch of the heirarchy and, in fact, that's probably the worst way to do it.
If all of the files in /a/ are excluded, and google wants to index /a/foo.htm ... why on earth would you want to enumerate all of the files in /a/ just to find out that foo is there? You already know it's there. Besides that, the method would require a continual re-indexing of the folder... which is what robots.txt is there to avoid.
it was pulling up random diaries and pasting them on the google news homepage, with the implication that it was "sanctioned" content.
Given what's sometimes written in the diaries, I was uncomfortable with that. I don't mind taking heat for things I write, but for things that other people write? I didn't want to deal with that."
Posted by: Stephen Tyson at February 16, 2005 11:58 AM
How does google unindex previously indexed material? If google previously indexed the contents of robots.txt, maybe they need see the contents again to remove it from their archives?
But a smart bot should be able to delete any indexed material that then gets referenced by robots.txt.
Posted by: bill at February 16, 2005 03:35 PM
This is absurd. I'm curious as to how much you know about the underlying technology used by Google.
Posted by: Chuck at February 16, 2005 06:46 PM
I believe skynet is becoming self-aware ;)
Posted by: John Connor at February 17, 2005 01:42 AM