Microsoft released a research paper that describes how they emulate their network to test and validate network changes to meet SLAs
CrystalNet is a high-fidelity, cloud-scale network emulator in daily use at Microsoft. We built CrystalNet to help our engineers in their quest to improve the overall reliability of our networking infrastructure. A reliable and performant networking fabric is critical to meet the availability SLAs we promise to our customers.
Some notes from reviewing the paper:
- Tools that ingest configurations, routing databases and state do not account for faulty device software. Specifically refers to Batfish. * Not able to test for multi-vendor interoperability issues. * “In our network nearly 36% of the problems are caused by such software errors ” * “not suitable for preventing human errors, which are responsible for a non-negligible 6% of the outages in our network.”
Microsoft Azure analysis of network incidents root cause
Sophisticated failure modes that are hard to detect:
For instance, a software load balancer owned a /16 IP prefix. However, it was asked to release some IP blocks in the prefix and give them to other load balancers. It then broke the /16 IP prefixes into 256 × /24 IP blocks and announced the blocks (about 100) that it held. However, a router connected to the load balancer was short of FIB space and dropped many of these announcements, causing traffic black holes.
Humar Errors are due to lack of practice:
Human errors surprisingly cause a non- negligible portion (6%) of the incidents. One might argue that this is due to carelessness and cannot be remedied. However, after conversations with experienced operators, we found a more important systematic reason is that operators do not have a good environment to test their plans and practice their operations with actual device command interfaces.
The EtherealMind View
This paper makes an excellent reference for Enteprise IT architects/designers on the value of network emulation. The data here could be used to build a case for network testing through software, the development of test beds and emulation to reduce human errors through lack of practice.
Its also the reason why I’m so critical of Juniper & Cisco for charging high prices for the emulation platforms and putting them out of reach of Enterprise budgets. Because vendors deliver such poor quality products, support for customer testing/validation should be free of purchase cost so that our own testing will improve the value of vendor products.
References
CrystalNet: Faithfully Emulating Large Production Networks- https://www.microsoft.com/en-us/research/wp-content/uploads/2017/10/p599-liu.pdf – Retrieved 16 Nov, 2017
NetDevOpEd: The power of network verification – Cumulus Networks Blog Retrieved 16 Nov, 2017