Unrecognized file type by Google


Author Message
GoogleGuy Says

PostPosted: July 30, 2003 11:49 AM 

Importance: Medium

A member reports a problem with their site listed in Google as an "unrecognized file type". GoogleGuy offers some troubleshooting advice.

GoogleGuy Says: [Link to quote]

Hmm. If I had to take a guess, I'd look for a misconfigured webserver. Just a shot in the dark, but I would guess that the webserver isn't returning text/html as the content type.
Here's how you can debug it yourself from Unix/Linux--you basically imitate a web browser or spider. Here's an example of fetching a page by hand from Google:

telnet www.google.com 80
Connected to www.google.com.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.google.com
(hit return once or twice until you get a response, which will look like the text below:)

HTTP/1.1 200 OK
Date: Wed, 30 Jul 2003 16:38:18 GMT
Cache-control: private
Content-Type: text/html <--- this line says what type of file it is.
Server: GWS/2.1
Content-length: 2691

Now if your page is www.foo.com/user1/test.html, you would type
telnet www.foo.com 80
and then do
GET /user1/test.html HTTP/1.1
Host: www.foo.com

and see what the webserver returns back. This is all that a crawler does, except it also looks for links and follows them several billion times. ;)

By the way, the "Host:" line is what allows an ISP to support virtual hosting--the bot says which domain it wants to fetch the page from. That's what allows an ISP to host many domains on one IP address. You can also use this technique to verify that an ISP is doing virtual hosting correctly. If you ask for pages from foo.com and get pages from someothercompany.com or yourisp.net, then tell your ISP to fix their virtual hosting. If you find virtual hosting errors, it could be that your ISP made a mistake, or maybe you didn't pay your ISP bill, so they've started serving their own content instead of yours. :)

So try that out. If the Content-Type: line doesn't say text/html, that's what needs to be fixed. If it does say text/html, then you might want to look into whether the webserver is sending binary data (e.g. an executable, or bad character encodings for non-English pages, etc.). Let us know what you find out, and good question! :)

P.S. I kinda spilled that out fast; definitely let me know if I did a typo/mistake in the above..

