Golf and the DevOps 3RA

Last week, while on vacation in San Diego, I took a golf lesson from a veteran golf pro, Bob Madsen. For the record, I am not a good golfer. While I have swung a club at a white ball for many years, I have failed to improve during this time due to lack of dedicating myself to improvement (I have only taken a few lessons, and I don’t invest time in practicing regularly). In short, I am a hacker. I know how to golf, but I suck at the execution of the golf process.

During my lesson Bob and I were talking about improvement in golf, and what he was describing to me sounded an awful lot like how I think about improvement in delivering software, which leads me to believe that how to improve in DevOps and agility is not markedly different than how to improve in any skill, like golf. It begins with recognizing the barriers that are preventing or blocking improvement.

Barriers to Improvement

During my golf lesson Bob was describing to me the typical barriers that golfers hit as they attempt to improve and reduce their score from the in 100’s all the way to the low 70’s (also known as ‘scratch’). He drew a diagram that looked something like this.

With this diagram Bob described these barriers that a golfer faces as they attempt to lower their score from the 100’s to the 90’s to the 80’s to the 70’s. The challenges they face at each barrier are different, and get more difficult to overcome at each stage (we spent our time talking about the 100’s to 90’s barriers if that tells you anything). Each barrier requires both different skills and overall improvement of existing skills in order to overcome it. In other words, “what got you here, won’t get you there.” Interestingly, regardless of the level of skill, golfers use basically the same tools throughout, although they learn how to use them better (foreshadowing of another analogy…perhaps).

3RA – DevOps Improvement

I started aligning this with how I think about software delivery and DevOps improvement. Like golfers trying to improve their game, organizations face different barriers at each stage as they attempt to improve in delivering software. I assert that there are four stages on a continuum of improvement that organizations spend time in as they build the skills to overcome the next barrier to their improvement. With that said, I don’t think an organization is in one stage one day, and then clicks over into the next stage. I think it is a slow penetration through the barrier into the next stage. Like my golf game, I won’t suddenly have consistent scores in the next lower score bracket (or stage) from one day to the next. Instead, I will see some success and some failure, hopefully improving the frequency of success until I am consistently performing in the next better stage.

Any skills improvement (whether it is my golf game or an organization’s approach to delivering software) advances progressively through a series of stages as the person or organization overcomes the barriers blocking them from improving. I call this the 3RA stages (pronounced ‘Era’)—Reactionary, Repeatable, Reliable, and Aspirational.

Reactionary

As the name implies, this state is demonstrated by a reactionary approach to the skill. The behaviors of the individual or organization are typically ad hoc and success is achieved mostly through luck. This state is often correlated with significant inter-team conflict and finger-pointing (or in the case of golf, lots of swearing and club throwing). For the record, this is where I am at in my golf skills. Like I said, I have been playing golf for years, but time alone doesn’t yield improvement. In fact, if you believe in the adage “Practice Makes Permanent” then the time I spent doing it the wrong way only makes improvement more difficult. In my case, I have never put in the time and effort to improve my skills enough to achieve anything that would resemble success. I simply keep doing the same dumb stuff I have always done and wonder why I’m not getting better. I have never scored below 100 because I continue to make the same mistakes in how I execute everything from the golf grip and swing to how I think about (or fail to think about) course management. In my case, I have the knowledge (I can describe the grip and swing), but I haven’t learned to successfully execute what I know repeatedly. I have met many organizations that deliver software like this—they may know what successful software delivery looks like, but they don’t know how to execute it and don’t put in the work to improve.

Repeatable

By deciding to focus on improvement though training and regular and consistent practice we can increase the frequency of success slowly and achieve a state of repeatability. For most this is a culture change that requires support at all levels (I better get my wife to support me in spending more time on golf). In this state individuals or organizations start to see some success based on the application of the skills they are developing, and not just as the result of good fortune. They aren’t perfect yet, but at least they can perform the same thing over again, such as swinging the golf club correctly most of the time, releasing software repeatedly without having to invent the release process each time, or implementing a change management process so that changes are handled the same way each time they arise. While the individual or organization can repeat these behaviors based on the skills they have developed, they fail to repeat the behaviors with the necessary regularity and still have more failures and fire drills than they would like. In the golf analogy this is my next goal—bogey golf. I am not trying to become a scratch golfer next—that would be an unreasonable expectation—I just want to lower my typical score into the 90s (my goal for this summer is to get a score in the 90s). In my case, I have begun the culture change (I have stakeholder agreement, now I have to commit to investing the time), I have the tools (TaylorMade clubs and balls, Adidas shoes, Nike clothes, etc.), and I have the core knowledge (I know how to swing the club), and now I must build my skill in execution through regular and consistent practice.

Reliable

As individuals or organizations continue to improve they next achieve a state of consistency where they are beginning to master the repeatable behaviors they have learned and they are executing them with the needed regularity. In other words, they are becoming reliable in their execution of their skills. When and if I ever achieve this state in golf will depend on the level of dedication I exert in building my skill. In this state I would swing the club and hit the ball reliably, making few mistakes in the execution (although occasional mistakes should still be tolerated), and would focus more and more on course management and good decision making to minimize risk. I would expect to par at least ten out of 18 holes, and bogey the rest. In other words, I would expect success most of the time, and minor issues some of the time. For organizations that achieve this state, the frequency of issues has decreased and the velocity at which they are able to deliver software is increasing. They are getting better at using data in their decision making and are therefore delivering products that their users have a higher level of satisfaction with. While they are consistent, there is still room to improve and deliver software faster and at a higher frequency to improve their business results.

Aspirational

The Aspirational state represents the ideal—a scratch golfer, maybe even a touring professional. This is a state that most of us mere mortals will likely never achieve, but some do and they can show us how it is done. For those few who get to this state, they make it look easy because they have invested the time and effort to build their skills to such a level of reliability success is nearly a given (although their definition of success has likely evolved to something that is difficult for us normal folk to understand). Regardless of whether we are talking about golf or delivering software, the Aspirational state is one that we may continue to strive for knowing we may never fully achieve it (I doubt I will ever be a touring pro golfer, but I am sure that I will always want to improve). For organizations, the Aspirational state is one where true transparency and collaboration enable them to deliver software to production as often as they want, including multiple times per day if they choose. The Aspirational state is one that only a few organizations will fully achieve (some may not even desire to achieve this state), but this will remain the ideal that is referenced when discussing the value of DevOps.

Great, So What’s Next?

I am only scratching the surface here. In any effort to improve skills, whether in golf or in software delivery, it is important to gauge where you are so you can identify what is next. Just like I shouldn’t focus on trying to get par on every (or any) hole, you shouldn’t try to deliver software faster than is reasonable without building up to it. Knowing where you are starting from is important.

There is a lot under the surface of the 3RA Framework that I will share with you in some upcoming posts. For now take the time to internalize the four stages—Reactionary, Repeatable, Reliable and Aspirational. Understanding the differences I have described is important in learning how to assess where you are and what you should focus on next.

Knightmare: A DevOps Cautionary Tale

I was speaking at a conference last year on the topics of DevOps, Configuration as Code, and Continuous Delivery and used the following story to demonstrate the importance making deployments fully automated and repeatable as part of a DevOps/Continuous Delivery initiative. Since that conference I have been asked by several people to share the story through my blog. This story is true – this really happened. This is my telling of the story based on what I have read (I was not involved in this).

This is the story of how a company with nearly $400 million in assets went bankrupt in 45-minutes because of a failed deployment.

Background

Knight Capital Group is an American global financial services firm engaging in market making, electronic execution, and institutional sales and trading. In 2012 Knight was the largest trader in US equities with market share of around 17% on each the NYSE and NASDAQ. Knight’s Electronic Trading Group (ETG) managed an average daily trading volume of more than 3.3 billion trades daily, trading over 21 billion dollars…daily. That’s no joke!

On July 31, 2012 Knight had approximately $365 million in cash and equivalents.

The NYSE was planning to launch a new Retail Liquidity Program (a program meant to provide improved pricing to retail investors through retail brokers, like Knight) on August 1, 2012. In preparation for this event Knight updated their automated, high-speed, algorithmic router that send orders into the market for execution known as SMARS. One of the core functions of SMARS is to receive orders from other components of Knights trading platform (“parent” orders) and then send one or more “child” orders out for execution. In other words, SMARS would receive large orders from the trading platform and break them up into multiple smaller orders in order to find a buyer/seller match for the volume of shares. The larger the parent order, the more child orders would be generated.

The update to SMARS was intended to replace old, unused code referred to as “Power Peg” – functionality that Knight hadn’t used in 8-years (why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point). The code that that was updated repurposed an old flag that was used to activate the Power Peg functionality. The code was thoroughly tested and proven to work correctly and reliably. What could possibly go wrong?

What Could Possibly Go Wrong? Indeed!

Between July 27, 2012 and July 31, 2012 Knight manually deployed the new software to a limited number of servers per day – eight (8) servers in all. This is what the SEC filing says about the manual deployment process (BTW – if there is an SEC filing about your deployment something may have gone terribly wrong).

“During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.
SEC Filing | Release No. 70694 | October 16, 2013

At 9:30 AM Eastern Time on August 1, 2012 the markets opened and Knight began processing orders from broker-dealers on behalf of their customers for the new Retail Liquidity Program. The seven (7) servers that had the correct SMARS deployment began processing these orders correctly. Orders sent to the eighth server triggered the supposable repurposed flag and brought back from the dead the old Power Peg code.

Attack of the Killer Code Zombies

Its important to understand what the “dead” Power Peg code was meant to do. This functionality was meant to count the shares bought/sold against a parent order as child orders were executed. Power Peg would instruct the the system to stop routing child orders once the parent order was fulfilled. Basically, Power Peg would keep track of the child orders and stop them once the parent order was completed. In 2005 Knight moved this cumulative tracking functionality to an earlier stage in the code execution (thus removing the count tracking from the Power Peg functionality).

When the Power Peg flag on the eighth server was activated the Power Peg functionality began routing child orders for execution, but wasn’t tracking the amount of shares against the parent order – somewhat like an endless loop.

45 Minutes of Hell

Imagine what would happen if you had a system capable of sending automated, high-speed orders into the market without any tracking to see if enough orders had been executed. Yes, it was that bad.

When the market opened at 9:30 AM people quickly knew something was wrong. By 9:31 AM it was evident to many people on Wall Street that something serious was happening. The market was being flooded with orders out of the ordinary for regular trading volumes on certain stocks. By 9:32 AM many people on Wall Street were wondering why it hadn’t stopped. This was an eternity in high-speed trading terms. Why hadn’t someone hit the kill-switch on whatever system was doing this? As it turns out there was no kill switch. During the first 45-minutes of trading Knight’s executions constituted more than 50% of the trading volume, driving certain stocks up over 10% of their value. As a result other stocks decreased in value in response to the erroneous trades.

To make things worse, Knight’s system began sending automated email messages earlier in the day – as early as 8:01 AM (when SMARS had processed orders eligible for pre-market trading). The email messages references SMARS and identified an error as “Power Peg disabled.” Between 8:01 AM and 9:30 AM there were 97 of these emails sent to Knight personnel. Of course these emails were not designed as system alerts and therefore no one looked at them right away. Oops.

During the 45-minutes of Hell that Knight experienced they attempted several counter measures to try and stop the erroneous trades. There was no kill-switch (and no documented procedures for how to react) so they were left trying to diagnose the issue in a live trading environment where 8 million shares were being traded every minute . Since they were unable to determine what was causing the erroneous orders they reacted by uninstalling the new code from the servers it was deployed to correctly. In other words, they removed the working code and left the broken code. This only amplified the issues causing additional parent orders to activate the Power Peg code on all servers, not just the one that wasn’t deployed to correctly. Eventually they were able to stop the system – after 45 minutes of trading.

In the first 45-minutes the market was open the Power Peg code received and processed 212 parent orders. As a result SMARS sent millions of child orders into the market resulting in 4 million transactions against 154 stocks for more than 397 million shares. For you stock market junkies this meant the Knight assumed approximately $3.5 billion net long positions in 80 stocks and $3.15 billion net short positions in 74 stocks. In laymen’s terms, Knight Capital Group realized a $460 million loss in 45-minutes. Remember, Knight only has $365 million in cash and equivalents. In 45-minutes Knight went from being the largest trader in US equities and a major market maker in the NYSE and NASDAQ to bankrupt. They had 48-hours to raise the capital necessary to cover their losses (which they managed to do with a $400 million investment from around a half-dozen investors). Knight Capital Group was eventually acquired by Getco LLC (December 2012) and the merged company is now called KCG Holdings.

A Lesson to Learn

The events of August 1, 2012 should be a lesson to all development and operations teams. It is not enough to build great software and test it; you also have to ensure it is delivered to market correctly so that your customers get the value you are delivering (and so you don’t bankrupt your company). The engineer(s) who deployed SMARS are not solely to blame here – the process Knight had set up was not appropriate for the risk they were exposed to. Additionally their process (or lack thereof) was inherently prone to error. Any time your deployment process relies on humans reading and following instructions you are exposing yourself to risk. Humans make mistakes. The mistakes could be in the instructions, in the interpretation of the instructions, or in the execution of the instructions.

Deployments need to be automated and repeatable and as free from potential human error as possible. Had Knight implemented an automated deployment system – complete with configuration, deployment and test automation – the error that cause the Knightmare would have been avoided.

A couple of the principles for Continuous Delivery apply here (even if you are not implementing a full Continuous Delivery process):

Releasing software should be a repeatable, reliable process.
Automate as much as is reasonable.