Network Faults in Practice
We have been building computer networks for decades—one might hope that by now we would have figured out how to make them reliable. However, it seems that we have not yet succeeded.
There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company . One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack . Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers . It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfig- ured switches), which is a major cause of outages.
Public cloud services such as EC2 are notorious for having frequent transient network glitches , and well-managed private datacenter networks can be stabler environments. Nevertheless, nobody is immune from network problems: for example, a problem during a software upgrade for a switch could trigger a network topology reconfiguration, during which network packets could be delayed for more than a minute . Sharks might bite undersea cables and damage them . Other surprising faults include a network interface that sometimes drops all inbound packets but sends outbound packets successfully : just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.