Fun Friday, or 504 out of nowhere
You probably know the advice not to deploy on Fridays, because no sane person wants to deal with possible problems over the weekend. But sometimes that's just not enough...
Last Friday was full of fun — Fascinating Unfolding Network issues. So... imagine a peaceful Friday. Blue sky, birds singing, everyone ready to wrap up tickets, grab a glass of wine or beer, hug their cat/dog/partner, and have a great evening after a hard working week. And suddenly an alarm goes off: a new web app has stopped working everywhere (not deployed to production, but across 3 environments — dev, test, and stage). Instead of a nice React interface — a 504 timeout error.
And it's not just one team affected. Several teams, each team has its own AWS account, and inside it they use the same modules for running the app — Route53, ECS, and ECR. Our team recently added CloudFront, so our traffic pattern looks like: CloudFront -> ALB -> Container.
If you're not familiar with all these abbreviations, that's okay — and I'm slightly jealous.
When you know — you know
Of course, we tried to debug using AI — it gave contradicting explanations, recommended looking at different pieces of our AWS infrastructure, and suggested updating our app. But we hadn't updated anything in our app or its infrastructure; it just stopped working, and since it happened to other teams too, we suspected a more common problem.
Network.
We noticed that some requests weren't failing with 504. For example, we were able to get a redirect or ping our /health endpoint — so the service and routing worked, but all our JS and CSS files were inaccessible.
The main question was: were there any changes in network configuration that could have affected this? It's not an easy question, because as always — a lot happens and changes in a company. But one of the answers was "updating FTD."
It was an ongoing call for several hours, so my genuine question was: what the heck is FTD? Now I know it's Firepower Threat Defense, and that's how Cisco prevents attacks and helps businesses run their websites without interruptions (ha-ha).
Cisco Firepower Threat Defense (FTD) is an integrated security platform that combines firewall capabilities, intrusion prevention, and advanced threat protection to detect, block, and remediate cybersecurity attacks. Source
It's an NGFW, and all those letters stand for Next-Generation Firewall, meaning it's intelligent enough to decide what is a threat. Not sure if it, by blocking our new web app, is trying to tell me something. (Here you can read more about NGFW — from Cisco or Palo Alto.)
Next generation here
The NGFW was designed to provide deeper visibility and smarter enforcement. They combined traditional firewall capabilities with integrated intrusion prevention and full-layer inspection. They recognized traffic based on apps, not just ports. — Palo Alto
Then we found what our failing services had in common: they were all serving their resources from /assets. For some reason, the firewall decided that the files there were too big and dangerous, and started blocking requests to it. Because of that, our container couldn't prepare the HTML and failed with a timeout. In our case it wasn't so obvious — because everything happened on the server and failed while rendering the HTML page — but for other services using the SPA approach, you could see in the network tab how they'd get index.html, while subsequent requests for CSS and JS would fail.
Our network specialists went to Cisco and we got the issue resolved — what a relief.
- Try running requests from the browser and look at the network tab; also try requesting resources using curl — it can help, because sometimes server-to-server requests surface more information.
- Make requests for different resources. A failure on your index.html doesn't mean everything is inaccessible.
- Identify the time when the problem started and ask what changes were applied to the system around that time. Think about how to prove that a change did or didn't cause the current behavior.
- Think critically and don't blindly trust AI or people's advice. Try to verify every statement.
- Think about big picture, not only about your piece.