Parallax forums website: 504 Gateway Time-out
evanh
Posts: 16,028
I'm getting recurring 504 errors. Lasts for about 15-20 mins at a time over the last 12 hours or so.
Wondering if it maybe due the wider Let's Encrypt certificate expiry issues?
Comments
I experienced the same issue yesterday that practically rendered the forum site unusable. Today's fine but will see what happens next.
Got more 504's this morning. Will have to call the mothership on Monday to see what is going on.
Hello!
I saw one as well. It might be the failure of the certs from that company. Then again it might be something else.
A good point. I wonder how many other sites are effected?
Hello!
So what happened? Everything is up today, so it must have been fixed.
Still happening to me less than 12 hours back.
Hello,
We started seeing increased error rates on Sunday and into today. Specifically, we are seeing that the gateway is reporting that the server is unreachable and cannot deliver traffic. This appears to happen more often when the site is being crawled by various search engines.
The server that is hosting the website is an older configuration that meters its resources when it gets under load. When it gets too busy, it idles out for 60-120 seconds. The gateway see that and starts blocking traffic. Once the web server goes active again, the gateway will begin forwarding traffic again.
We are now working to upgrade the underlying bits to replace the resource-sensitive server with a scalable cluster that can scale up or down to meet the traffic.
I will provide updates as this moves along.
Thanks Jim, that'll be appreciated.
So the website is getting more users surfing it in general? Or just more background Smile?
The user traffic is about the same over the past 4 weeks. The search engines are generally polite but I am seeing two and sometimes three crawling the site at the same time.
With the new configuration, we will automatically spin up additional servers to ensure that they don't interfere with user activities on the site.
Thanks for the update Jim.
Hello!
Yes thank you Jim. You've done yeoman's service getting this, ah, spun up, and still working. I was going to suggest that you (or someone in the offices) setup a mirror of everything, and point the search bots to it, to keep it off of the regular forum, but I suspect that would be too much work.
But that does not explain why there is a nest of mountain lions taking over the backlawn of the firm to raise kittens. At least four families worth are there.
The search bots are looking for forums.parallax.com, so there isn't any way that I am aware of to force bots to look at a different address. We can reroute them once we have the receive the headers, but that is going to slow down all the other traffic since we have to look at the headers in every request.
The plan is to scale horizontally as the load increases - keep throwing servers at the site until the load is mitigated. The system will start removing servers as the load drops to keep the accountants happy.
Another protection could be to drop packets or route to one other low cost, low resource "overload" server based on src IP and requests per second (/s). Something like anti-hammering protection. Saves the risk of extra server cost if the issue really is just robots and homebrew high-jinx. Various other benefits too.
Update:
We have implemented fully load-balanced multi-server support for the forums. We are working now on retiring the original server to eliminate the 504 error from the site.
Please let me know if anything is amiss and we'll get on it right way.
Update:
The original web server has been taken offline and is retired. We will continue to monitor the site closely.
It appears that the site is unable to send email. We are working to correct that now.
Email is working again.
Seems like I only get 504's on my Ipad, and on a different ISP
The new server configuration appears to have resolved the 504 Gateway errors. We have not logged a single 504 error since the evening of October 5th.
The site is now running on two dual core ARM processor servers. Traffic is load balanced between the two servers. Over the past 24 hours, each server has been running at between 5-8% CPU. The higher value is approximately when the Microsoft Bing crawler was scouring the site.
Nice. Yep, no issues for me in past couple of days. Neat to hear ARM servers in use.
Same here in EU. All good. With the electricity prices rising sharply, these ARM servers sound like a smart choice, energy wise.
Amazon is incentivizing customers to use ARM servers. They are about 20% less expensive to operate and they benchmark at about 20-30% faster than the equivalent Intel-based servers.
6 days without a single 504-Gateway error. Time for the next step.
I dont know if this is the correct thread but the website www.parallax.com is getting the same symptoms. I get a 504 error trying to access any page except the forums.
The forums and also the learn site seem to be ok; parallax dot com and its shop site are experiencing the 504 Gateway Time-out. Too bad for the business...
Hello!
Actually yes it is. The chap who manages the site, did a major update here for the Forum a while back because that happened as well. It is certainly possible it will need to be done there as well.
The chap being Jim Ewald a few posts back.
seems to be fine now.
When you're in IT, every day is a Monday.
The main Parallax website is back online this morning. Thanks to everyone who let me know it was broken.
What went wrong? It appears that the web server was running but the network interface to the load balancer failed. I have been working with AWS infrastructure since 2014 and this is a first for me. Everything was back to normal once the interface was restarted.
I am making the preparations today to migrate the site to a similar infrastructure that we use here on the Forums. It seems to work so much better than the last setup.
Hello!
Good to know. I was once told something similar, "When you're in IT, every day is the day something will break, and people wanted it fixed right this minute. Never mind that the parts to fix it were ordered a year earlier, and never arrived." Then it was a copier who worked also as a printer who kept jamming and the thing it used to properly connect to the office LAN.