Google Cloud trips up on 'physical damage' to network fiber - but it's not an outage

Google Cloud Platform has been hit by more disruptions, which have caused some users a spike in latency.
Written by Liam Tung, Contributing Writer

After last month's 'catastrophic failure' on Google Cloud due to a bad configuration, Google Cloud Platform (GCP) has suffered another multi-hour problem. This time, Google doesn't quite consider it an outage, even though it was causing a spike in latency for customers. 

Google said Tuesday's "disruptions" to Google Cloud Networking and Load Balancing were caused by physically damaged fiber bundles serving its us-east1 data center in South Carolina. 

The issues were first posted at 10:25 Pacific Time on Google's status page, which has several updates detailing its response and explanation for why the disruptions were happening. 

SEE: Cloud v. data center decision (ZDNet special report) | Download the report as a PDF (TechRepublic)

Google has mitigated the damaged fiber by "electively rerouting some traffic to ensure that customers' services will continue to operate reliably until the affected fiber paths are repaired".

Despite these measures, the cloud provider warned that some customers will still see higher than usual latency until it fixes the damaged fiber, which it expects to do within the next 24 hours and will fully resolve the latency problems.  

An individual who claimed to work for Google Cloud using the handle 'boulos' on Hacker News popped into a thread on the site to correct comments that the networking issue meant the region was "down" – although boulos admitted that "network latency spiking up for external connectivity is bad".

Another Hacker News user, 'mrweasel' challenged the explanation that the region technically wasn't down. 

"As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN," wrote mrweasel. 

Another user said mrweasel's boss was "nitpicking" over words during a crisis and sacrificing "accuracy and precise understanding". 

Mrweasel countered that it was accurate: "From a business perspective the site was down. Nitpicking is telling him: No it is in fact up, the customer just can't use it." 

Bolous explained that they had intervened because of "confusion" among commenters who claimed the region was down. 

"During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done," wrote boulos. 

A 'david-cako' who claimed to work for AWS chimed in: "I work for AWS. There is typically a balance that has to be struck when sharing information with customers. I would imagine this goes for most companies, which is why it isn't until a post-mortem that the messaging is fully refined."

Like Google, popular CDN provider Cloudflare has suffered two outages in the past week and has done plenty of explaining. The first was blamed on a Verizon internet-routing error that caused a "catastrophic cascading failure". 

The second, on Tuesday, was caused by an internal "bad software deploy" that triggered an unprecedented CPU spike on its equipment. The outage only lasted 30 minutes, but impacted every single data center Cloudflare operates across the globe. 

SEE: The future of Everything as a Service (free PDF)

Visitors to sites that depend on Cloudflare were met with 502 'bad gateway' error messages.       

Cloudflare CTO John Graham-Cumming has since revealed the bad software deploy was actually a "single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules" intended to bolster defenses against JavaScript attacks. 

"Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100 percent on our machines worldwide. This 100 percent CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82 percent," wrote Graham-Cumming. 

He admitted the company's testing procedures were "insufficient" and said they are currently under review. The widespread impact was because the new WAF rules were "deployed globally in one go". 

UPDATE July 4: Google said the Google Cloud Networking and Load Balancing issues were resolved as of Wednesday, July 3, 7:35 Pacific Time, meaning the disruptions had lasted about 21 hours. 

The company confirmed that it has repaired the damaged fiber bundles and returned the us-east1 region to normal routing. It will also be conducting an internal review of the incident and make appropriate improvements.  

More on Google and cloud outages

Editorial standards