Can you imagine focussing so much on the performance of a single application, spending so much time and resources on the problem that eventually you conclude that the problem is happening inside the switch.
Well, clearly Facebook has the luxury in this article where they talk about what drives their open networking strategy:
A few months ago, we were seeing some issues with Memcached in our environment, with transactions taking longer and there being a lot of retransmissions. So we were trying to debug it to find out what the heck was going on. And we just couldn’t find it. And this went on for a couple of weeks. And then our switch vendor and one of its developers came out to help us troubleshoot. And the developer said, “Wait, hang on, let me log into the ASIC.” This is a custom ASIC. There was a hidden command, and he could see that the ASIC was dropping packets. And we had just wasted three weeks looking for where the packets were going. They had a secret command, and the developer knew it, but the support staff didn’t and it wasn’t documented.
You have enough pull with the switch vendor to go through the escalation process, then demand that they do more, and more, and more until the send a developer out to your site to help diagnose the fault ? Fantastic. Vendors don’t do that for just anyone you know. Shame you chose a vendor with bad products, oh wait, this is common to all vendors in networking.
We figured this out at 5:30 in the evening, and we had to log into every damned ASIC on hundreds of boxes – and most of them have multiple ASICs per box – and you run this command, it throws out text, and you screen-scrape it, you get the relevant piece of data out, and then you push it into our automated systems, and the next morning there were alerts everywhere. We had packet loss everywhere. And we had no clue.
Fair enough. We’ve all experienced this but mostly because there are so many systems in my data centre that I’m just not expert at most of them. Bully for you for being able to have just a couple of applications to look after. Oh, and enough people to write code to do the screen scraping, roll up and report functions. A nice place to be.
And you just shake your head and ask, “How did I get here?” This is not going to work, this is not going to scale.
You are right. 2 network engineers for a thousand devices, that doesn’t scale. Oh, wait, Facebook. Unlimited cash, unlimited space, unlimited resources. Sorry, forgot myself.
This is the kind of thing I want to get rid of. We want complete access to what is going on, and I don’t want to fix things this way. We want to run agents on the boxes that are doing a lot of health checking, aggregate their data, and send it off to an alert management system. SNMP is dead.
What I think you really want is for vendors to ship products that are reliable and trustworthy. Just the same as I do. Doing it yourself will scale for a while but real scale will be achieved when vendors deliver on reliability.
That’s we all really want.