What Sort of Scaling Issues did You Face?
The NASDAQ OMX Corporate Solutions platform has extremely volatile usage patterns. At the end of the financial year, we may see thousands of companies reporting at pretty much the same time; this requires, along with several management and reporting servers, at least double that number of encoders to deliver. This level of spikiness is an outlier even for a company the size of Amazon EC2, so we had to negotiate specific permission to create such large demands on its infrastructure on short notice.
One thing that was crucial was the ability to move encoders from one availability zone to an alternative in the circumstances that the initial target zone didn't have capacity. Our technology completely abstracts NASDAQ OMX Corporate Solutions' working processes from all that underlying complexity.
How about SLA?
Amazon EC2 offers at least 99.95% (aws.amazon.com/ec2-sla) availability. This translates to a target of 4.38 hours annually that the entire service may be unavailable. Our application always runs in at least two regions all the time. Broadly speaking, this means we end up with an overall service-level agreement (SLA) for our application of (100 - (0.05% x 0.05%)) = 99.9975%. The key to maintaining this availability is the autonomy of the different regions and the applications. The chances of something going wrong on a server in a public cloud data center are hardly different from the odds of something going wrong on a machine you own and host in your own location. In the case of an IaaS public cloud, however, you have instant access to many thousands of other resources to use in place, and you can - and should - be using multiple systems for redundancy all the time.
I have written before about people who claim Amazon EC2 is not reliable after famous Reddit and Netflix outages. Amazon EC2 is usually operating well within its SLA; the issue was that Reddit and Netflix did not code their applications well to respect outages or failures. In contrast, the platform we delivered to NASDAQ OMX Corporate Solutions is automatically operating hundreds of servers in multiple regions of Amazon EC2, and we only knew that the previously mentioned outages had had any effect on our delivery by inspecting our logs. Our applications simply fail between machines (in a single frame of video or audio), the downstream origination, and the upstream CDNs, and the clients would have been unaware.