GCP is incredibly bad at communicating when there are problems with their systems. Just terrible. Its only when our apps start to break that we notice something is down, then look at the green dashboard which is even more infuriating.
I suspect there's a correlation between outages that are easy to detect and communicate and outages that automation can recover from so easily that you hardly notice.
I really don’t get this. There’s a huge number of complaints about poor communication from companies like Google and AWS during every outage. Yet they remain seemingly indifferent to how much customer trust they are losing, and the competitive edge the first one to get this right could gain.
I don't think they are losing any kind of customer trust.
Unless something is really fucked (like both GCP and AWS being down for us-east) incidents like these are not going to impact them at all.
The cost of either migrating to the other provider or, even worse, migrating to more traditional hosting companies is enormous and will require much more than "service was down for 2 hours in 2019". The contracts also cover cases like this and even if they don't, Google and Amazon can and will throw in some free treat as an apology.
On one hand I find this quite sad, but from a pragmatic point of view it makes sense.
If 20% of Google Cloud's customers leave after this outage because of poor communication they'll prioritise accordingly and apply all that nice SRE theories to their infra. But this isn't happening, because <various reasons>, so... who cares?
> "I care about how my providers behave when they have issues"
We all do.
As the other commenters stated, the communication is poor because the clouds are still growing rapidly and there's not much reason to be better. We might also be underestimating just how much more better service would cost and whether it's worth the revenue loss (if any). Are you really going to shift all of your spend overnight because of an outage? And where are you going to go?
The reality of these decisions is far more nuanced than it may seem and the current state of support is probably already optimized for revenue growth and customer retention.
What aren’t these on separate systems? I never had the impression that google cheaps out on things but this sounds exactly like the sort of shit that happens when people cheap out. Not even a canary system?
The idea that Google spends big on expensive systems is a huge lie.
Google started using a Beowulf cluster that the founders wired themselves. From the very beginning, the goal of metrics collection was to optimize costs. While today it’s seen as the cash cow, the focus has always been on cheap components strung together, relying on algorithms and code for stability and making the least possible demands of underlying hardware.
To think that they won’t try to save money any time they can seems implausible.