Shop OBEX P1 Docs P2 Docs Learn Events
Did the Wayback machine break? — Parallax Forums

Did the Wayback machine break?

Beau SchwabeBeau Schwabe Posts: 6,566
edited 2013-10-28 04:35 in General Discussion
I was looking for something and needed to use the Wayback machine .... http://archive.org/web/ ... and can't get anything but "Page cannot be crawled or displayed due to robots.txt" everywhere I try to go. Even my own website that I had in the early 90's isn't there anymore... I get the same message.

Bummer !!

Comments

  • Ym2413aYm2413a Posts: 630
    edited 2013-10-25 22:23
    Seems to work for me when I tried it. : ]
  • Beau SchwabeBeau Schwabe Posts: 6,566
    edited 2013-10-25 22:34
    Just curious... Can you go here?

    http://www.ionet.net/~bschwabe/index.html.

    This is the message I keep getting ....
    WB.png
    594 x 485 - 25K
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2013-10-25 22:34
    'Works for me when I use it on parallax.com. Which website were you trying to explore?

    -Phil
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2013-10-25 22:37
    I think you're screwed. Here's the robots.txt file:
    User-agent: *
    Disallow: /
    

    IOW, keep out!

    -Phil
  • Beau SchwabeBeau Schwabe Posts: 6,566
    edited 2013-10-25 22:53
    Yes, I know what the robots.txt is for, but it doesn't explain how I have gone to my own site ... and others in the past using the WBM.
  • bomberbomber Posts: 297
    edited 2013-10-26 00:17
    I tried the link (probably too late) and got a generic 500 server error.
  • Heater.Heater. Posts: 21,230
    edited 2013-10-26 01:10
    No idea what is going on here but archive.org and hence the wayback machine have made some changes recently which may have something to do with it.

    http://blog.archive.org/2013/10/25/reader-privacy-at-the-internet-archive/
  • Christoph_HChristoph_H Posts: 31
    edited 2013-10-26 02:50
    The wayback machine applies the robots.txt contents retroactively. You could access your site in the past because there probably wasn't a robots.txt in place at that time.
  • LoopyBytelooseLoopyByteloose Posts: 12,537
    edited 2013-10-26 03:50
    Just curious... Can you go here?

    http://www.ionet.net/~bschwabe/index.html.

    This is the message I keep getting ....
    WB.png

    i seem to get a 500 Internal Server Errory when I try your link. Hopefully that helps you in some way.
  • KMyersKMyers Posts: 433
    edited 2013-10-26 09:46
    I get the same error here. Internal server error....
  • GordonMcCombGordonMcComb Posts: 3,366
    edited 2013-10-26 11:12
    Christoph is right -- the robots.txt file applies retroactively. Though it's changed over the years, if the robots file disallows crawing, Wayback will not skim pages nor display them. If the restriction is subsequently removed, skims during the disallowed period are still not shown.

    As others have mentioned, your site now displays a 500 server error.
  • Beau SchwabeBeau Schwabe Posts: 6,566
    edited 2013-10-26 11:17
    "The wayback machine applies the robots.txt contents retroactively. You could access your site in the past because there probably wasn't a robots.txt in place at that time. " .... Does this mean someone came along 15 years later and placed a "no robots" in their HTML code or a nobots.txt on their main site and it effectively prevents me from looking at my own material? or the material of others that fall in a similar loophole? .... If that truly is the case, then what a waste of buried knowledge.
  • Christoph_HChristoph_H Posts: 31
    edited 2013-10-26 12:58
    Does this mean someone came along 15 years later and placed a "no robots" in their HTML code or a nobots.txt on their main site and it effectively prevents me from looking at my own material?
    It seems to work that way but in my opinion this is not the way it ought to be done.

    The behaviour is mentioned in the wikipedia article:
    The Wayback Machine also retroactively respects robots.txt files, i.e. pages which are currently blocked to robots on the live web will be made temporarily unavailable from the archives as well.
    Source for this statement is probably The Internet Archive FAQ
  • GordonMcCombGordonMcComb Posts: 3,366
    edited 2013-10-27 12:57
    The ionet.net domain appears to be dead, with the domain redirecting to Earthlink. You could always hope that they redirect everything, and if so, the Earthlink site does not have the robots.txt exclusion, and you would once again be able to see your archived content. You could also write to Earthlink and ask them to remove the exclusion -- which may be unintentional -- which is irrelevant now anyway as the personal pages themselves seem to be trashed.

    This really points up not using personal pages provided by an ISP and school site for anything you want to keep, because you have less than zero control. It should all be considered transient. Google offers free pages, and they are highly unlikely to ever apply a global robots exclusion. If you don't want to pay for hosting, there are also freebie blog hosts you can use where your blog is a subdomain, in which case each subdomain can have a unique robots.txt file.
  • kwinnkwinn Posts: 8,697
    edited 2013-10-27 15:30
    Another example of why I don't trust "the cloud" for storing or backing up my data. Your data can disappear or become inaccessible for a multitude of reasons and there is almost nothing you can do to recover it. I prefer to maintain my own backup media even though it is more work and expense.
  • evanhevanh Posts: 15,941
    edited 2013-10-27 22:22
    Another example of how the web is no longer the web. Using scripting is on that list too.

    The Web was designed for pulling info, but more and more of the new bits are for pushing ads and tracking and controlling the end users. I'm becoming more disconnected all the time.
  • Peter KG6LSEPeter KG6LSE Posts: 1,383
    edited 2013-10-27 22:47
    It seems to work that way but in my opinion this is not the way it ought to be done.

    The behaviour is mentioned in the wikipedia article:
    Source for this statement is probably The Internet Archive FAQ

    i just added a NO TRACK robots.text on my site a few months ago .......... Lets see if my site is there still on WBM....... granted there is a way to add a exclusion to the fie to let the wayback machine do its thing.


    Well I guess its not retroactive as we think it is .....

    I can see my stuff back to 05
  • evanhevanh Posts: 15,941
    edited 2013-10-28 04:35
    It'll take time to get around to re-crawling and update. Either that or there is explicit instruction required in the robots.txt file.
Sign In or Register to comment.