Posts Tagged ‘Root Cause Analysis’
As you would expect in a conference entitled velocity, and in a follow-on devops day, speeding up things was an overarching theme. In the context of devops, the theme primarily manifested itself in lively discussions about the number of deploys per day. Comments such as the following reply to my post Ops Driven Dev were typical:
Conceptually, I move the whole business application configuration into the source code…
The theme that was missing for me in many of the presentations and discussions on the subject was the striking of a balance between velocity and quality. The classical trade-off in process control is between production rate and product quality (and safety, but that aspect [safety] is beyond the scope of this post). IMHO this trade-off applies to software just as it applies to mechanical or chemical processes.
The heart of the “deploy early and often” strategy hailed by advocates of continuous deployment is known deployment state to known deployment state. You don’t let the deployment evolve from one state to another before it has stabilized to a robust state. The power of this incremental deployment is in dealing with single-piece (or as small number of pieces as possible) flow rather than dealing with the effects of multiple-piece flow. When the deployment increments are small enough, rollback, root cause analysis and recovery are relatively straightforward if a deployment turns sour. It is a similar concept to Agile development, extending continuous integration to continuous deployment.
While I am wholeheartedly behind this devops strategy, I believe it needs to be reinforced through rigorous quality criteria the code must satisfy prior to deployment. The most straightforward way for so doing is through embedding technical debt criteria in the release/deploy process. For example:
- The code will not be deployed unless the overall technical debt per line of code is lower than $2.
- To qualify for deployment, code duplication levels must be kept under 8%.
- Code whose Cyclomatic complexity per Java class is higher than 15 will not be accepted for deployment.
- 50% unit test coverage is the minimal level required for deployment.
- Many others…
I have no doubt whatsoever that code which does not satisfy these criteria might be successfully deployed in a short-term manner. The problem, however, is the accumulative effect over the long haul of successive deployments of code increments of inadequate quality. As Figure 1 demonstrates, a Java file with Cyclomatic complexity of 38 has a probability of 50% to be error-prone. If you do not stop it prior to deployment through technical debt criteria, it is likely to affect your customers and play havoc with your deployment quite a few times in the future. The fact that it did not do so during the first hour of deployment does not guarantee that such a file will be “well-behaved” in the future.
Figure 1: Error-proneness as a Function of Cyclomatic Complexity (Source: http://www.enerjy.com/blog/?p=198)
To attain satisfactory long-term quality and stability, you need both the right process and the right code. Continuous deployment is the “right process” if you have developed the deployment infrastructure to support it. The “right code” in this context is code whose technical debt levels are quantified and governed prior to deployment.
Readers of The Agile Executive have been exposed to the “All In!” strategy used by Erik Huddleston to transform the software engineering process at Inovis and make it uniquely streamlined. In this post we follow up on the original discussion of the subject to explore the effect of Agile on IT Operations. As the title implies, Agile at Inovis served as a flywheel which created the momentum required to transform IT Operations and blend the best of Agile with the best of ITIL.
This guest post was written by Ray Riescher – a Six Sigma Black Belt, Agile evangelist and a business process change agent. Ray is currently responsible for business process management and IT governance at Inovis, a leading provider of business-to-business (B2B) e-commerce services, in Alpharetta, GA
Here is Ray:
When we converted to an Agile Scrum software methodology some 24 months ago, I never imagined the lessons I’d learn and the organizational change that would be driven by the adoption of Scrum.
I’ve lived by the philosophy that managing a business is managing its processes and that all of those processes, especially the operational processes, are interconnected. However, I don’t think I was fully prepared for effect Agile Scrum would have on our company operations.
We dove head first into Agile Scrum and adapted to it very quickly. However, it wasn’t until we landed a very large and demanding customer that Scrum was really put to the test. New enhancements, new features, and new configurations were all needed ASAP. Scrum delivered with rapid development and deployment in the form of releases that were moving into production with amazing velocity. Our release cadence hit warp drive and at one point we experienced several months where multiple teams’ production releases were deploying at the end of every two week sprint.
We’ve subscribed to the ITIL service support processes for Release, Change, Incident, Problem and Configuration Management. ITIL has served us well, giving us a common language and a clear understanding of process boundaries.
As the Scrum release cadence kicked in, the downstream ITIL processes had to keep up, adapt, and support the dynamics of rapid production changes. What happened was enlightening and maybe a bit ground breaking.
The Release Management process had to reassess its reliance on artifacts for gate keeping. The levels of sign offs had to be streamlined, the heavyweight deployment documentation had to be lightened, yet the process still had to control the production release to ensure deployment success. The rapidity of the release cycles meant that maintenance window downtime would be experienced too frequently by customers, so “rolling bounce” deployment strategies were devised and implemented.
Change requests could no longer wait for a weekly Change Management review board to approve and schedule the changes. Change management risk models had to be relied on for accurate detection of risky changes.
Early on in this dynamic environment, we weren’t quite as good as we needed to be and our Incident Management process was put to the test. Faster releases meant more opportunity for problems with service degradation and outages. This reality manifested itself more frequently than we’d ever experienced. Monitoring, detecting and repairing became paramount for environment stability and customer satisfaction.
What we found out was that we became very agile at this break/fix game. We developed a small team approach to managing incidents and leveraged the ITIL Problem Management process to rapidly perform root cause analysis. Once the true root cause was determined, a fix would be defined and deployed. Sometimes the fix was software related and went through the Scrum process, sometimes the fix was hardware related and went through the Configuration Management process, other times it was more operational and the fix took the form of training or corrections to procedural documentation.
The point is we’ve become agile across the entire IT spectrum. Whether it’s development via Scrum, the velocity with which we now operate our ITIL processes, or the integrated break/fix operational support processes, we are performing all of these with an agile mindset and discipline. We have small teams, working on priorities, and completing what needs to be completed now.
Scrum set the flywheel in motion and caused the rest of the IT process life cycle to respond. ITIL’s processes still form the solid core of service support and we’ve improved the processes’ capability to handle intense work velocity. The organization adapted by developing unprecedented speed in the ability to deliver production fixes and to solve root cause problems with agility.
What I think we are witnessing is a manifestation of Agile Business Service Management; a holistic agile methodology running across the IT process spectrum that’s delivering eye popping change and tremendous results.