Exposing cloud failures
The result of the Amazon EC2 failure this week has exposed a number of technology strategies in cloud infrastructure as being less than perfect.
Complex systems have complex failures
The most vexing problem of Cloud Computing is that these systems are complex, and the more complex system the more complex the failure. Those people who run large clusters of virtualized servers with extensive storage farms and networking backbones will know exactly what I mean when we talk about complex failures.
Although evidence is scant, it would seem that EC2 was so complex that troubleshooting was extremely difficult to locate source of the problem. And complex systems also take a lot of time to recover. There is no solution for complexity, only that the customer must balance of value provided by the system must outweigh the risk of complex failure.
History is littered with companies who have had complex systems fail, usually accounting, ERP or supply chain management projects and those companies also fail and go out of business.
Programmers and infrastructure
Another more amusing outcome is that programmers working for fancy start-ups in San Francisco are having a brutal lesson that infrastructure matters and that careful design of the infrastructure needs to be a part of their design processes. A start-up who wants to rapidly “ship a product and iterates” usually decides to take the easy route and choose to deploy within a single Amazon datacenter and ìaccept” the risk that they might be a failure, if indeed they ever thought of it.
I imagine that the recruitment agencies around San Jose are very busy looking for infrastructure architects this week.
The simple fact of “redundant systems” mean that services must be in two locations. The scaling challenge to replicate data to multiple locations must be addressed. This seems to be missing for a number of high-profile online businesses who are still having problems today. Sadly, they probably deserve what has happened.
No gloating – it’s means more for us
Other will be regretting the decision to save some dollars by not purchasing the extra capacity. It notable that the other Cloud Providers haven’t been attacking Amazon. Two things. One, most Cloud Infrastructure is flaky and those companies know that their systems are not very stable today. It could easily be them tomorrow.
Second, as customers take stock of the damage, they are going to boost their redundancy planning. This means more systems, more storage, more network and wads more cash. Criticising their competitor is clearly bad for that business – customers might implement it themselves.
And for customers/end users ? The price of using the Cloud has increased by about 150% because you can no longer ignore the HA features by claiming “its the cloud”. The hallowed halls of cloudy infallibility have been undeniably breached and security practice around business continuity has been found wanting.
Guarantees are worthless, audits are mandatory.
The first thing to note is that Amazon gave service guarantees that the EC2 instances within the single datacenter location were fully isolated and provided undertakings that systemic failures could not affect within their data centre location. Clearly the zones were not isolated and their undertakings are worthless, now and into the future.
What this means is that compliance and checking must be a part of any future strategy if you wish to place critical services into the cloud. That is if a Cloud Provider claims that their services are truly isolated, or high availability, they must be willing to withstand an external investigation, audit or challenge and prove that this is the case. The audit can be performed by a third party such as a certification board, but preferably by the customer themselves. If your Cloud provider claims there are secrets and will not allow audits, then move to another provider – it’s probably a smokescreen for bad infrastructure.
Remember that Amazon has refused to allow access for audits and I wonder if audits by qualified and experienced individuals would have exposed these problems.
The use of multiple data centre locations, or multiple cloud providers, will require additional investment in the relatively new field of ìdevelopment operations” or DevOps. This this role look something like system administration but with much wider scope and expertise, the programming team for the operations of the infrastructure must also be to integrate network, storage, virtualization, firewall, load balancer into a fully functional operational platform – consider system administrators writing bash scripts for Linux administration, but expanded to include all elements of the datacenter.
The EtherealMind View
You still need to understand your infrastructure and how its works to determine the impacts on your applications. Just because you put it into the cloud doesn’t make it “work”
To understand Cloud as a piece of infrastructure means you need to be able to perform compliance and audit of the cloud infrastructure to verify how it works. Your cloud provider can’t keep secrets as Amazon has managed to do.
Building a Cloud System of any sort means building an overly complex system that needs enormous scale and capability. Inherently, it must support the widest possible array of requirements to be as profitable as possible and justify the capital deployed to bootstrap the first stage of development.
This breadth of scope remains the greatest weakness in Cloud Computing because it leads to systems that are inherently failure prone due to over complexity. And complex systems have complex failures that do not recover gracefully. As we saw in the Amazon outages, the recovery process took much longer that anyone expected. Except those of us who have done it before and seen it happen.
Good luck to Amazon and the rest of the Cloud community. You are going to need it.
Postcript – Cloud Security Considerations
The Australian Government Defence Signals Directorate released this PDF file last week on Cloud Computing Security Considerations.pdf. It’s the most practical and useful guide to evaluating Cloud Security I’ve seen. Remember that “security” also means “business continuity” and protecting the business from operational failure.
It’s blunt, functional, and brief. I†highly recommend it to you.