Did the Wayback machine break?

Beau Schwabe · 2013-10-25 22:19

I was looking for something and needed to use the Wayback machine .... http://archive.org/web/ ... and can't get anything but "Page cannot be crawled or displayed due to robots.txt" everywhere I try to go. Even my own website that I had in the early 90's isn't there anymore... I get the same message.

Bummer !!

Ym2413a · 2013-10-25 22:23

Seems to work for me when I tried it. : ]

Beau Schwabe · 2013-10-25 22:34

Just curious... Can you go here?

http://www.ionet.net/~bschwabe/index.html.

This is the message I keep getting ....

Phil Pilgrim (PhiPi) · 2013-10-25 22:34

'Works for me when I use it on parallax.com. Which website were you trying to explore?

-Phil

Phil Pilgrim (PhiPi) · 2013-10-25 22:37

I think you're screwed. Here's the robots.txt file:

User-agent: *
Disallow: /

IOW, keep out!

-Phil

Beau Schwabe · 2013-10-25 22:53

Yes, I know what the robots.txt is for, but it doesn't explain how I have gone to my own site ... and others in the past using the WBM.

bomber · 2013-10-26 00:17

I tried the link (probably too late) and got a generic 500 server error.

Heater. · 2013-10-26 01:10

No idea what is going on here but archive.org and hence the wayback machine have made some changes recently which may have something to do with it.

http://blog.archive.org/2013/10/25/reader-privacy-at-the-internet-archive/

Christoph_H · 2013-10-26 02:50

The wayback machine applies the robots.txt contents retroactively. You could access your site in the past because there probably wasn't a robots.txt in place at that time.

LoopyByteloose · 2013-10-26 03:50

Beau Schwabe (Parallax) wrote: »

Just curious... Can you go here?

http://www.ionet.net/~bschwabe/index.html.

This is the message I keep getting ....

i seem to get a 500 Internal Server Errory when I try your link. Hopefully that helps you in some way.

KMyers · 2013-10-26 09:46

I get the same error here. Internal server error....

GordonMcComb · 2013-10-26 11:12

Christoph is right -- the robots.txt file applies retroactively. Though it's changed over the years, if the robots file disallows crawing, Wayback will not skim pages nor display them. If the restriction is subsequently removed, skims during the disallowed period are still not shown.

As others have mentioned, your site now displays a 500 server error.

Beau Schwabe · 2013-10-26 11:17

"The wayback machine applies the robots.txt contents retroactively. You could access your site in the past because there probably wasn't a robots.txt in place at that time. " .... Does this mean someone came along 15 years later and placed a "no robots" in their HTML code or a nobots.txt on their main site and it effectively prevents me from looking at my own material? or the material of others that fall in a similar loophole? .... If that truly is the case, then what a waste of buried knowledge.

Christoph_H · 2013-10-26 12:58

Beau Schwabe (Parallax) wrote: »

Does this mean someone came along 15 years later and placed a "no robots" in their HTML code or a nobots.txt on their main site and it effectively prevents me from looking at my own material?

It seems to work that way but in my opinion this is not the way it ought to be done.

The behaviour is mentioned in the wikipedia article:

The Wayback Machine also retroactively respects robots.txt files, i.e. pages which are currently blocked to robots on the live web will be made temporarily unavailable from the archives as well.

Source for this statement is probably The Internet Archive FAQ

GordonMcComb · 2013-10-27 12:57

The ionet.net domain appears to be dead, with the domain redirecting to Earthlink. You could always hope that they redirect everything, and if so, the Earthlink site does not have the robots.txt exclusion, and you would once again be able to see your archived content. You could also write to Earthlink and ask them to remove the exclusion -- which may be unintentional -- which is irrelevant now anyway as the personal pages themselves seem to be trashed.

This really points up not using personal pages provided by an ISP and school site for anything you want to keep, because you have less than zero control. It should all be considered transient. Google offers free pages, and they are highly unlikely to ever apply a global robots exclusion. If you don't want to pay for hosting, there are also freebie blog hosts you can use where your blog is a subdomain, in which case each subdomain can have a unique robots.txt file.

kwinn · 2013-10-27 15:30

Another example of why I don't trust "the cloud" for storing or backing up my data. Your data can disappear or become inaccessible for a multitude of reasons and there is almost nothing you can do to recover it. I prefer to maintain my own backup media even though it is more work and expense.

evanh · 2013-10-27 22:22

Another example of how the web is no longer the web. Using scripting is on that list too.

The Web was designed for pulling info, but more and more of the new bits are for pushing ads and tracking and controlling the end users. I'm becoming more disconnected all the time.

Peter KG6LSE · 2013-10-27 22:47

Christoph_H wrote: »

It seems to work that way but in my opinion this is not the way it ought to be done.

The behaviour is mentioned in the wikipedia article:
Source for this statement is probably The Internet Archive FAQ

i just added a NO TRACK robots.text on my site a few months ago .......... Lets see if my site is there still on WBM....... granted there is a way to add a exclusion to the fie to let the wayback machine do its thing.

Well I guess its not retroactive as we think it is .....

I can see my stuff back to 05

evanh · 2013-10-28 04:35

It'll take time to get around to re-crawling and update. Either that or there is explicit instruction required in the robots.txt file.

Did the Wayback machine break?

Comments