Scripting Does Not Scale For Network Automation

Lots of interest is using scripting for automation and, for a few scripts or tasks, you can get a lot done for not much effort. My experiences with scripting have left me bitter and jaded. Here is why.

Nearly all of the current scripting is done using screen scraping with Expect. Which is fine until the vendor recompiles the OS and adds a single white space in the middle of the command that breaks your oh so clever regex in Python or Perl or Ruby (or whatever is in fashion this week).

So you change to using an API – XML, JSON, REST, SNMP … (whatever the current fashion is this week). You rewrite your script to use a better data source and start performing error checking on the data. Which is fine until the API version changes, or new data is added but that’s OK because you know it will happened eventually. You probably learned the hard way to better data validation.

There is some progress with Ansible or one of the other frameworks that are being adapted to networking but the coverage is still limited and it’s still screen scraping.

So you write a few scripts, then a few more. You have some problems with exhausting the device memory, hitting a memory leak in the OS or spiking the CPU during the script run. You keep adding validation and data sanity checks every time you find a problem. Then one day you realise that the cost of script maintenance is out of control. Or you are the only fool writing them. Or you have a network outage because one of the 20 or 30 automation scripts you setup creates a race condition where one undoes the action of another or creating a failure condition that you didn’t see coming.

And that when you realise that you have reinvented the wheel. Except it only has five sides and the hole isn’t quite in the middle. When you really look at it, you realise it is a crappy wheel.

I’ve seen/participated/used about seven comprehensive scripting “automation” platforms developed and deployed over the last two decades, each with thousands of man hours of development and testing. None of them have survived. They all died when the smart guy who knew how to bridge between programming and networking got tired and quit the company. Or the constant failures gave management to hump and canned the project. Whatever, the result was always the same.

Here is what I’ve learned: Scripting doesn’t scalescripting-doesnt-scale

But what I see in SDN is solution that can scale way beyond scripting. Because much of scut work, the tedious part of the scripting about parsing data streams, or identifying OS versions, or data validation is all handled by SDN constructs.

I’m not investing too much effort in making scripts because I know it pointless in the end. It’s all good fun, no one is getting hurt and probably those scripts will be useful to the companies who are paying for you to deliver them. But I’m quite sure that within 2 years, those scripts will be dead.

Because they don’t scale.

  • http://www.vtesseract.com Josh A

    Totally get where you’re coming from on this post. As a guy who’s built a career around scripting and automation I agree it can be very frustrating and challenging to maintain code through product life cycles. However, I’ve also found that the quality and extent of my output still maintains higher than doing my work without scripts. Here’s three reasons why I think people need to continue to script, and WILL script even with the approaching Software-Defined age.

    1. SDN isn’t fully realized yet and as you say it’s time to value is probably 2 years away. That’s 2 years where the need for automation still exists and the benefits of dealing with the issues you described still outweigh the efforts.

    2. SDN is going to provide a considerable amount of intelligent functionality. However, that functionality is still going to require input from architects/engineers/admins to put the rails around it’s capabilities. As such, scripting is a great mechanism for identifying the needs and those rails. Work in scripting today increases the likelihood that organizations will feel less overwhelmed as they make the transition to software defined. I think that response will vary from environment to environment but in the end value will still persist.

    3. Pretty sure that scripting will still be required in the Software Defined age. The use cases around specific reporting, batch changes across multiple policies, batch adding or removing of various elements will likely still persist. As you say, just another language/framework to work with.

    Final Note: Always script in test/dev before production :-)

  • http://lamejournal.com/ jgherbert

    So, yes – ultimately scripting does not scale; and if there’s a product out there that can do what you want, then you have the option to use that rather than develop in-house. In many ways this should be obvious – if you’re not a software development house (or at least willing to support a dedicated devops team), don’t try to develop software. Where scripting doesn’t scale is when one person’s little hobby becomes mission-critical and the company jumps on it because it’s “free” (which it really stops being, the moment it’s essential to the business).

    That said, with SDN I’d posit that those off-the-shelf products are not really there yet with the breadth of support and functionality that can make them a true swiss army knife of network automation tools. I think we’re going to see a lot of hybrid tools that make use of what functionality is out there from others, front-ended by, you guessed it, scripts. Or are they playbooks? Or templates? Openstack Neutron is probably a good example of hybrid functionality – somebody creates the configuration snippets for NETCONF with spaces for the bits you need (i.e. they do the fiddly part), and offer a transport over which to do this stuff, but you ultimately control what gets done. Does that scale? Well maybe there’s a need for layer of abstraction above that too. Gah.

    Despite the hopes of avoiding scripting, at what point does the smart stuff get automated, or do we always need a brain to take control of the overall sequence? By the way, I believe quite strongly that once we have intelligent enough automated tools that can make most ongoing decisions autonomously and correctly, that’s when SDN will be ripe to be embraced by a wider audience.

  • Ivan Pepelnjak

    After you’ve seen all the network management products and their failure to meet their promises, I’m amazed that you still believe in tooth fairy … oops, well-written SDN controller that will do what you expect them to do.

    There’s nothing fundamentally new or different in the currently hip brand of IT washing powder, so don’t expect miracles.

    As long as every network remains a unique snowflake, we can’t expect network automation to perform better than a botched SAP deployment with life-long on-site consultants.

    On the other hand, people who approach network automation with proper discipline, design, abstraction layers, software development tools and skills … do get things off the ground.

    • http://etherealmind.com Etherealmind

      I guess our combined cynicism approaches some sort of convergence. My view is that scripting is a good tool but not a great one. Scripting frameworks like Django and Ansible improve what scripting can do and its reliability but that still doesn’t scale well.

      We need APIs for device consistency, frameworks for validation and common actions. But above that we need platforms that solve big problems – scripting can only solve little problems.

  • David Barroso

    If by scripting you mean several scripts running independently without control you are right, but you are doing it wrong. It is very easy to implement a simple job scheduler to avoid racing conditions or jobs undoing some other job´s work. Most of the problems you mention are easily solvable or they are just business as usual (you haven´t had to kill/restart a process/service in production because suddenly it was consuming all your RAM or CPU resources, have you? ;) )

    The real problem with the network is that we have to configure it as if we were coding in assembler. One “network instruction” at a time and, obviously, executed in real time (stupid, isn´t it?). If you move away from that concept and you tell your devices “I do not care about your current state, I want you to look like this”, then all your problems are solved. Have you seen any sysadmin caring about the state of their servers when running ansible, puppet or chef? They simply do not care, they care of where they want to go. That is why they succeeded where most network operators failed, because network operators still treat network devices as pets instead of cattle.

    I am not sure what you call SDN on this context but my guess is that it will be nothing else than some solution implementing what I just described above.

  • Dee

    Scripting never scales however automation IS required. I believe once we have a common framework that all vendors support which should be a web-services framework, life will be much easier. I have a customer who uses TCL for automation and it is a pain. Same for Python. Mainly because you require a TCL/Python expert when new code releases modify the syntax (as you state). All of this stated from a Computer Science guy. Exposing a web-services API and not scripting based upon vendor CLI is the key. We need a common framework to reference.

  • Rob

    I am not sure one can blame the CPU or RAM usage issues on the idea of scripting. program practices in all languages provide the same ill opportunities and many of services/programs have had issues written in every programming language possible – and yes – even Java ! Also – I think a lot of very large scale test automation frameworks have been written and are successful today. many many years after they were written..

    • http://etherealmind.com Greg Ferro

      I was referring to CPU/RAM in the device, not the server. I’m not sure that I made that clear. And test frameworks are not able to accurately test devices.

  • http://etherealmind.com Etherealmind

    Let pick two scaling problems with scripts that are at the top of my mind that I didn’t mention here.

    1. Scripting is not easily passed between owner/programmer/authors. For most companies, when the person who wrote the script leaves the company then the script function will fail and revert to dumb CLI configuration. This isn’t scalable in practice.
    2. I agree that scripting to thousands of device works fine but would say that scripting for 100 different devices, each with different operating systems is not scalable. Scale happens in many dimensions and device counts is just one aspect.

    Otherwise you make good points that scripting is a good automation tool and I agree with your view. My point is to highlight the limitations of scripting and when to consider a framework or a platform.

    • Brandon Bennett

      1. That is a management/company problem by not putting any emphasis or focus on scripting and documentation. The same issue can happen with any project it’s just hard to sell most companies on scripting so it becomes a pet project instead of a supportable part of the network.

      Solved the same way you solve larger and more complex network infrastructures. You document and you train. But i agree that you cannot have this side projects ran by one or two members of the team without support from the rest of the company,

      2. Manually configuring 100 devices each with a different operating or even trying to use SDN on them doesn’t scale either. I don’t have much faith on some magical framework that with unify this. We can only hope for YANG at this point (or a new challenger) but this takes value away from vendors so they will fight kicking and screaming.

    • queridiculo

      It seems to me that you’re describing issues that neither have to do with the viability of scripting for the purposes of automation, nor with failure of scripting to scale.

      Like any project, the success of the implementation hinges on how much buy in you have to get the resources to complete it, and whether you employ a development model that goes beyond, oh gee, how can I make this job easier right now.

      You raise a number of valid points about how scripting can blow up in your face, but none that cannot be solved by things as simple as documentation, dedicating proper development resources and prototyping/testing.

      What I personally find really surprising is that somebody with as much as experience in the industry as you is championing SDN as the holy grail.

      When on earth did that ever work out?

      I’ll take my “scripts”, and find a way to make them “scale”, because I know better than to rely on somebody else to solve my problems for me.

      • http://etherealmind.com Etherealmind

        I’m sure you can solve your problems, but after yo leave the company (you are planning on leaving, if not you should be) then how well will those scripts scale without you there to maintain them ?

        Lots of ways to define “scale”

  • http://etherealmind.com Etherealmind

    When _I_ am talking about SDN I’m normally considering how to orchestrate network configuration and operation in a co-ordination with server and storage automation.

    Networking doesn’t live in a bubble anymore, it’s all integrated.

  • Jeremy Schulman

    This is a great discussion, thank you Greg for the blog. I’ve spent time with the DevOps community, and they too went through a period of scripting chaos that my friend John Willis (@botchagalupe) talks about as “Bob’s scripts”. A big pivot point for DevOps (before it was called that) was when tools like Puppet and Chef came on the scene. These frameworks not only gave admins a tool to use (a choice vs. writing their own), but it also helped them to re-think the approach to managing infrastructure. Networking professionals need the same caliber of tools; but tools that “fit the brain” of NetOps. I passionately believe we need these tools. I’ve resigned my most awesome job at Juniper so I could start Schprokits. http://www.schprokits.com/going-all-in/

  • Peter Silva

    Scripting doesn´t scale because every vendor of random network boxes makes their own GUI or CLI interfaces. If the whitebox networking trend catches on, and people start using Linux CLI for their switches, suddenly networking mostly boils down to configuring the ip and iptables commands in Linux. Linux has always been about automation, and having the software (and it´s interface) de-coupled from vendor whim. Once the underlying environment being scripted becomes much more stable, flexible, standard, and things will be able to scale.

  • Stefan