Github is really broken today.

Heater.Heater. Posts: 21,233
edited 2018-10-22 - 10:45:09 in General Discussion
Github has been really messed up this morning.

I have this repository: which is a week old and to which I pushed changes some hours ago. If I poke around in the files there I often get a 404 Not Found Error. Sometimes a red alert appears on the page announcing that it cannot supply the latest commits at this time.

To make things more interesting some hours ago I renamed that repo to just "xoroshiro". Well is 404 Not Found.

But in my home page xoroshiro was listed as one of my repos earlier today and I was able to surf it. Not any more.

Basically, github is serving up random files or randomly saying things are not found.

At one point I could not clone either of these as git clone said it could not find them at all!

Seems it not just me:

And MS hasn't even taken over github yet :)

Luckily git is a distributed system. Even if github disappeared all together things can just keep rolling along here.


  • Dang, I was about to blame M$!
  • TorTor Posts: 1,995
    I'm not sure I see a problem, except for when you're just browsing around the site. The backend git services are supposed to be working, but even if there weren't - we're talking Git here, we all have fully distributed copies of the repos and we can just keep working. Catch up or sync with github another day. Just as we do internally at work, with the internal git server. And we can always fetch from each other, without going via any server.
  • Sorry folks, I meant to post this thread under "General Discussion". Perhaps someone could move it there.

    It's weird, my repo is really out of synch. Sometime I see "xoroshiro" in my repo list, sometimes with it's old name "xoroshiro". Sometimes I can access the files of one or other. Then again not. Sometimes I can clone one or the other, sometimes not.

    The the problem for many is that they have become dependent on github's issue tracker. They have no work to do today as they see no issues coming in. Or they cannot show any progress.

    This is why one should use cloud system that are distributed and fault tolerant not "cloud" systems that are that in name only and really highly centralized.
  • Maybe it's the MS W10 1809 bug deleting files ;)
  • Moved!
  • YanomaniYanomani Posts: 977
    edited 2018-10-22 - 11:14:58
    Hi Heater

    I've just tried to access your repo, and apparently, it's coming back, slowly, file by file.

    Sometimes a 404 error shows its ugly face, then, one more try and things seems to be working, as expected.
  • Except xoroshiro-plusplus was renamed xoroshiro and so should not exist. They both do. They are both available and not available at random. They alternate in my repo list.

    It will be interesting to see where my repo lands when they have everything synced up again.

    For a short while there I could not sign in at all.
  • Hi Heater

    Not completely working anymore. Unstable!

    Another set of retries, and back again at 404. Weird behaviour, like a vintage Cuckoo Clock!
  • Busy pushing to bitbucket and gitlab :)
  • TorTor Posts: 1,995
    Well, they *are* trying to live-repair a system with one or more broken disk drives.. can't expect things to be much better. The alternative would be downtime until it's fixed.
  • Yes, there was some serious breakage that started last night. Some red-eyed people working on it by now I'm sure:
  • Yes, I'm sure it's not trivial.

    However, I naively assumed that the issue of failed disks/machines and software in distributed cloud services was a solved problem. That, basically, there are dozens, hundreds, thousands of nodes. That nodes being broken or off line was expected to be a normal situation and systems are designed to keep working despite that, without down time or interruption.

    Why would I assume that? Well, it's been a a long time since Lamport, Shostak, and Pease wrote their famous paper on the problem of "Byzantine" fault tolerance. Since then other solutions have been used widely, the Paxos and Raft algorithms.

    This past yeas I have been using the CockroachDB database. Which is a distributed SQL database that will remain functional and importantly consistent, provided a majority of nodes can agree on the state of things. Cockroach uses the Raft consensus algorithm. It's been kind of fun playing with this, killing off nodes and watch how well it survives. It does.

    Still. On the whole I think Github has done very well over the years. They did survive the biggest DDOS attack in history some time back. I have never noticed any such prolonged unavailability before.

    One wonders what they use for storage. Googling around nobody seems to know.

    All seems to be up and running stably again.
  • I'm sure the infrastructure (object stores?) is pretty interesting and distributed. Maybe an indexing service issue? Anywho, other than a backup of page build jobs all seems to be well.
  • I'd love to read an analysis of today's events.

    For example after I renamed my repo from "xoroshiro-plusplus" to "xoroshiro" it was oscillating between those two names in my repo list on my front page. That and sometimes a page was available and sometimes 404'ed under either name of the repo. Clearly different parts of their store were in different stages of update, out of sync.

    From this I think we can conclude they are not ensuring consensus between nodes with any consensus algorithm like paxos or raft. Rather they have an "eventually consistent" distributed store. Something like Mongo DB perhaps.

    This presents the slight worry that when one does a download or git clone it may be possible to get an out of date version of a repo on occasion.

    Yep, all seems well just now.
Sign In or Register to comment.