Google changes meaning of robots.txt unilaterally

(Carl Zimmerman) #1

For years, a simple robots.txt file has kept search engines out of two subdirectories on my Website, where there is nothing but mapping routines that are called by visible pages elsewhere on the Website, passing parameter strings. Suddenly, Google is treating those parameterized URLs as pages that it can present in search results, while claiming that the robots.txt directives prevent Google from giving a preview of the page. Worse yet, Google puts these unwanted items first in a search results list. (Try googling Wappingers and then try the same thing on DuckDuckGo or Bing or Yahoo. The first item in Google’s list is actually a link from within the second item!)

Google’s explanation of why this happens and what to do about is classic doublespeak. On the one hand, it claims that a robots.txt file is useful to keep a Website from being overloaded with requests. On the other hand, it says that the only way to keep a Webpage out of Google searches is to put a meta statement with a “noindex” parameter into the HTML for that page and then eliminate the robots.txt file so Google can find that meta statement. What nonsense! Pretending that the existence of a URL pointing into a protected subdirectory justifies presenting that URL in search results is just plain stupid, because such URLs are the only way any Webcrawler has of even noticing that something actually exists within such a subdirectory. And following Google’s suggested “solution” not only increases the workload on the crawled Website; it also increases the workload on both the Webcrawler and the search engine, all for the purpose of presenting to a searcher something that doesn’t belong in a results list anyway!

Google ought to revert to what it had been doing until recently, and what other Webcrawlers & search engines have done all along – ignore everything that is hidden by a robots.txt directive.

If anyone complains to me that my Website doesn’t follow Google’s recommendations, they’ll get this lesson on Google’s bad judgment.

(Curtis Wilcox) #2

The Robots exclusion standard is not a directive for search result publishing; it is, as the filename suggests, a directive to web crawler behavior. The web crawling software should not retrieve a URL of a form disallowed by robots.txt but it’s fine for the search engine “fed” by the crawler to publish information from URLs it can crawl, including all links in the content of that URL. In the example you’ve given, it appears that the link shown in Google’s search results is found in at least one other page on the site which Google is allowed to index.

I don’t think this is a recent change in the Google search presentation. It’s rare, but I think I’ve seen links in search results like that for years. robots.txt was invented at a time, 1994, when many sites were hosted on weak hardware so the traffic from crawlers could be a significant burden. It has long been the case that crawler traffic is trivial.

Adding the rel=“nofollow” attribute to such internal site links won’t remove them from search results entirely but should lower them in the search result rankings. Of course, other web sites can include links in their pages and you have no control over those unless you let Google crawl the pages and include a directive for the crawled page to not be included in search results. Alternately, the links on the site could be changed to form submit buttons, with the parameters sent to a form processor instead of included in a link URL.

(Carl Zimmerman) #3

You’ve explained what Google does, in different words, but I still don’t understand why Google does that. By the time a crawler finds a link that points into an off-limits directory, it has already seen the robots.txt file, knows that the directory is off-limits, and is in the process of indexing the page that contains the link. Google could have programmed its crawler to ignore such links to un-indexable pages; instead, it not only retains them but takes each such URL apart to index the pieces so that the search engine can use them to trigger inclusion of the URL in a results list with a relevance score greater than that of the indexable page in which the URL was originally found.

It is as if Google is saying to the searching person, “I’m not allowed to know what the contents of this page are, therefore it must be more relevant to you than any page for which I do know the contents.” Huh?

Other search engines and their supporting crawlers are smarter than that, and I believe that Google used to be. The directory in question has been off-limits for seven years, with a link pointing into it from each of more than 2500 indexable pages on my Website. I frequently use a “site:” search to find a page that I need to re-read, but never before this week has a results list included a bare URL.

You mention that other Websites could link to an uncrawlable page. That is true, but it is also irrelevant, because there is in this case no motivation for anyone else to do so. I didn’t put the directory off-limits to hide anything from public view; after all, it exists in order to display information relevant to the indexable pages that use it. It’s off-limits to crawlers so that they don’t waste their time (and my hosting company’s resources) in trying to index files whose contents are meaningless outside the context of those indexable pages. But Google persists anyway. Sigh…

Carl Scott Zimmerman, Campanologist
Webmaster for

(Curtis Wilcox) #4

“Why is Google putting URLs that are uncrawlable at the top of search results” is a different question than why are they including them in search results at all. I don’t have a definitive answer but I have a guess. I searched for that Wappingers of the “forbidden” URLs with the link: parameter to see if it would turn up a page on a different domain with that exact link; it didn’t but it did return five pages on, none of which contain the exact link searched for. My guess is Google is treating all the data/atlas/GMapNGeocoder.html URLs as equivalent instead of treating the GET parameters after the question mark as meaningful. If they’re equivalent, the fact that you have many such links suggests they’re important and should be at the top of the results (if people actually follow those links rather than the links to “normal” web pages, that will reinforce their rank).

Why would it treat the URLs as equivalent? Lots of URLs in the world have GET parameters that don’t have a bearing on their destination, they’re used purely for tracking purposes. Google’s algorithm may have a heuristic that says these long strings of parameters aren’t relevant, especially since the filename extension is not one that conveys there’s server-side code at work; changing data/atlas/GMapNGeocoder.html? to data/atlas/GMapNGeocoder.php?, data/atlas/GMapNGeocoder.cgi? or even data/atlas/? may be enough to bump the algorithm into thinking those parameters are meaningful but this is so speculative, SEO is so complex, and it takes so long to see the results of experiments I wouldn’t try changing it for this reason alone.

So what? It’s a drop in the bucket. That waste is too small to think about. Better quality, more useful search results to actual people may be something worth considering (or not, the example search you gave returns one crummy-looking URL followed by six pretty good results). There are a number of suggestions for improving findability or search result quality, only some of which are specific to Google’s search.

(Carl Zimmerman) #5

Thanks for taking the time to reply, Curtis. I’m delighted to learn that Google has restored the “link:” search method. Unfortunately, its implementation is so sloppy that it’s not as useful as it could be, as evidenced by your experience. It ought to have given you precisely one result, namely the second item in my example, where you would find that link as the “Site locator map via GoogleTM”.

Indeed there is no server-side code at work; the map customization is all done by Javascript on the visitor’s machine, guided by parameters passed in the link. So the customized map is of no particular interest to anyone except those who request it from the page containing the link.