DevOps

DevOps seeks to improve and speed up the delivery of software and services from development to operations to the hands of the customers.

This post is an attempt to summarize my key takeaways from the DevOps Handbook.

Brief History

DevOps and its resulting technical, architectural and cultural practices represent a convergence of many philosophical and management movements.

DevOps is the outcome of applying the most trusted principles from the domain of physical manufacturing and leadership to the IT value stream. It relies on bodies of knowledge from Lean, Theory of Constraints, The Toyota Production System, resilience engineering, learning organizations, safety culture, human factors and many others – such as high-trust management cultures, servant leadership and organizational change management.

Lean Movement

Major tenets:

manufacturing lead time was the best predictor of quality, customer satisfaction and employee happiness
small batch sizes of work was one of the best predictors of short lead time

Agile Manifesto

Key principles:

deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale
emphasize the desire for small batch sizes, incremental releases instead of large, waterfall releases
emphasize the need for small, self-motivated teams, working in a high-trust management model

Agile Infrastructure and Velocity Movement

At the 2008 Agile conference in Toronto, Canada, Patrick Debois and Andrew Schafer held a session on applying Agile principles to infrastructure as opposed to application code.

At the 2009 Velocity conference, John Allspaw and Paul Hammond from Flicker gave a presentation where they described how they created shared goals between Dev and Ops and used continuous integration practices to make deployment part of everyone's daily work.

The Continuous Delivery Movement

Jez Humble and David Farley extended the concept of continuous integration to continuous delivery which defined the role of a "deployment pipeline" to ensure:

that code and infrastructure are always in a deployable state
that all code checked in to trunk can be safely deployed into production

Toyota Kata

Improvement kata requires creating structure for the daily, habitual practice of improvement work because daily practice is what improves outcomes.

Agile, Continuous Delivery and the Three Ways

Manufacturing Value Stream

In Lean, the value stream is defined as "the sequence of activities of an organization undertakes to deliver upon a customer request".

This is often easy to see and observe in manufacturing operations as it starts when a customer order is received and the raw materials are released onto the plant floor.

Technology Value Stream

In DevOps, the value stream is defined as the process required to convert business hypothesis into a technology enabled service that delivers value to the customer.

The input to the the process is the formulation of business objective, concept, idea or hypothesis and starts when work is added to the backlog for Development.

Development teams that follow a typical Agile or iterative process likely transforms the idea into user stories which is then implemented in code into the application or service being built.

Deployment Lead Time

The technology value stream begins when an engineer checks in a change in version control and ends when the change is successfully running in production, providing value to the customer and generating useful feedback and telemetry.

Phases of Work

Design and Development (akin to Lean Product Development)
- highly variable and highly uncertain
- require high degress of creativity and one-off work
- results in high variability of process times
Testing and Operations (akin to Lean Manufacturing)
- requires creativity and expertise
- strives to be predictable and mechanistic
- goal of achieving work outputs with minimized variability (e.g., short and predictable lead times, near zero defects)

The goal is to have testing and operations happening simultatenously with desigh/development, enabling fast flow and high quality. This succeeds when working in small batches and build quality into every part of the value stream.

Lead time clock starts when the request is made and ends when it is fulfilled. Process time clock start when we begin work on the customer request - specifically, it omits the time that the work is in queue, waiting to be processed.

Lead time is what the customer experiences so we typically focus our process improvement attention there instead of on process time.

DevOps Principles

Enable fast left-to-right flow of work from Developent to Operations to the customer. This is done through making work visible, reduce batch sizes and intervals of work, build in quality by preventing defects and constantly optimize for global goals.
Enable fast and constant flow of feedback from left to right at all stages of the value stream. By seeing problems as, they occur, swarming them until effective counter measures, feedback loops are shortened and amplified.
Enable creation of a generative high-trust culture that supports a dynamic, disciplined and scientific approach to experimentation and risk taking.

Principles of Flow

First way requires the fast and smooth flow of work from Development to Operations to deliver value to the customers quickly.
Increase flow by making work visible, by reducing batch sizes and intervals of work and by building quality in, preventing defects from being passed to downstream work centers.
- Goal is to decrease the amount of time required for changes to be deployed in production and to increase the reliability and quality of those services.

This can be achieved by making work visible, limiting WIP, reducing batch sizes and the number of handoffs, continually identifying and evaluating constraints and eliminating hardships in daily work.

Make Work Visible

Significant difference between technology and manufacturing value streams is that our work is invisible.
- To help see where work is flowing well and where work is queued or stalled, we need to make work as visible as possible.
- use visualization boards, e.g. kanban or sprint planning boards to represent work on physical or eletronic cards.
- ideally the board spans the entire value stream, defining work as completed only when it reaches the right side of the board.
- work is not done when Development completes the implementation of a feature; it is only done when application is running successfully in production, delivery value to the customer.

Limit Work In Progress (WIP)

Studies have shown that the time to complete even simple tasks significantly degrades when multitasking as it incurs all the costs of having to re-establish context, as well as cognitive rules and goals.
Limit multitasking by using kanban board and codifying/enforcing WIP (work in progress) limits.
- Controlling queue size (WIP) is an extremely powerful management tool, as it is one of the few leading indicators of lead time.
- Limiting WIP also makes it easier to see problems that prevents the completion of work.

Reduce Batch Sizes

Another key component to creating smooth and fast flow is performing work in small batch sizes instead of large batch sizes
One of the key lessons in Lean is that in order to shrink lead times and increase quality, we must strive to continually shrink batch sizes.
The theoretical lower limit for batch size is single-piece flow (batch size of one, 1x1 flow), where each operation is performed one unit at a time. Small batch sizes result in less WIP, faster lead times, faster detection of errors, and less rework.
The equivalent of this in the technology value stream is realized with continuous deployment, where each change committed to version control is integrated, tested and deployed into production.

Reduce Number of Handoffs

In the technology value stream, the cause of long deployment lead times is often because there are hundreds (or even thousands) of operations required to move code from version control into the production environment.
- functional testing, integration testing, environment creation, server administration, storage administration, networking, load balancing, and information security
- too many handoffs
Even under the best circumstances, some knowledge is inevitably lost with each handoff.
- with enough handoffs, the work can completely lost the context of the problem being solved.
To mitigate these types of problems, we strive to reduce the number of handoffs, either by automating significant portions of the work or by reorganizing teams so they can deliver value to the customer themselves, instead of having to be constantly dependent on others.

Continually Identify and Elevate Our Constraints

Two reduce lead times and increase throughput we need to continually identify our system’s constraints and improve its work capacity.
Dr. Goldratt’s five focusing steps in addressing constraints
1. Identify the system’s constraint.
2. Decide how to exploit the system’s constraint.
3. Subordinate everything else to the above decisions.
4. Elevate the system’s constraint.
5. If in the previous steps a constraint has been broken, go back to step one, but do not allow inertia to cause a system constraint.
In typical DevOps transformations, the constraints usually follows this progression:
- Environment creation - countermeasure is to have environments that can be created on demand, and completely self-serviced so that they are available when we need them.
- Code deployment – countermeasure is to automate deployments as much as possible, with the goal of having it completely automated so they can be done self-service by any developer.
- Test setup and run - countermeasure is to automate tests so we can execute deployments safely and to parallelize them so that the test rate can keep up with the development rate.
- Overly tight architecture – countermeasure is to create loosely coupled architecture so that changes can be made safely and with more autonomy, increasing developer productivity.

Eliminate Hardships and Waste in the Value Stream

Waste and hardship in the software development stream is anything that causes delay for the customer, such cas activities that can be bypassed without affecting the result.
Categories (from Implementing Lean Software Development)
- Partially done work – becomes obsolete and loses value as time progresses
- Extra processes – add effort and increase lead times
- Extra features – add complexity and effort to testing and managing functionality
- Task switching - leads to additional time and effort
- Waiting – increase cycle time and prevent the customer from getting value
- Motion – Handoffs create motion wastes and often require additional communication to resolve ambiguities
- Defects – the longer the time between defect creation and defect detection the more difficult it is to resolve the defect
- Nonstandard or manual work – reliance on nonstandard or manual work from others such as using non-rebuilding servers, test environments, configurations etc. causes issues
- Heroics – heroic deeds – working late hours regularly, deployment at odd hours sap the energy and enthusiasm of the team.
Goal is to make these wastes and hardships visible, and to systematically do what is needed to alleviate or eliminate these burdens and hardships to achieve fast flow.

Improving flow through the technology value stream is essential to achieving DevOps outcomes. We do this by making work visible, limiting WIP, reducing batch sizes and the number of handoffs, continually identifying and evaluating our constraints, and eliminating hardships in our daily work.

Principles of Feedback

Enable the reciprocal fast and constant feedback from right to left at all stages of the value stream.
- Goal is to create an ever safer and more resilient system of work
We make our system of work safer by creating fast, frequent, high quality information flow throughout the value stream and our organization, which includes feedback and feedforward loops.
- this allows us to detect and remediate problems while they are smaller, cheaper, and easier to fix; avert problems before they cause catastrophe; and create organization learning that we integrate into future work.
- when failures and accidents occur, we treat them as opportunities for learning as opposed to a cause for punishment and blame.

Working Safely Within Complex Systems

Characteristics of a complex system
- defies any single person's ability to see the system as a whole and understand how all the pieces fit together.
- doing the same thing twice will not predictably or necessarily lead to the same result.
Complex systems typically have a high degree of interconnectedness of tightly coupled components and system-level behavior cannot be explained merely in terms of the behavior of the system components.
- failure is inherent and inevitable in such complex systems which demands the need for designing a safe system of work.
Dr. Steven Spear stated that designing perfectly safe systems is likely beyond our abilities
- we can make it safer to work in complex systems when the following conditions are met:
  - Complex work is managed so that problems in design and operatings are revealed (feedback and feedforward loops)
  - Problems are swarmed and solved, resulting in quick construction of new knowledge (Andon cord and swarming)
  - New local knowledge is exploited globally throughout the organization
  - Leaders create other leaders who continually grow these types of capabilities
- Each of these capabilities are required to work safely in a complex system.

See Problems As They Occur

In a safe system of work, we must constantly test our design and operating assumptions.
- Goal is to increase information flow in our system from as many areas as possible, sooner, faster, cheaper, and with as much clarity between cause and effect as possible.
- The more assumptions we can invalidate, the faster we can find and fix problems, increasing our resilience, agility and ability to learn and innovate.
This can be done by creating feedback and feedforward loops which are a critical part of learning organizations and systems thinking.
- The absence of effective feedback often contribute to major quality and safety problems.
- Goal is to create fast feedback and feedforward loops wherever work is performed, at all stages of the technology value stream, encompassing Product Management, Development, QA, Infosec and Operations.
  - This includes creation of automated build, integration, and test processes so that we can immediately detect when a change has been introduced that takes us out of a correctly functioning and deployable state
Create pervasive telemetry so we can see how all our system components are operating in the production environment, so that we can quickly detect when they are not operaing as expected.
- This allows us to measure whether we are achieving our intended goals and ideally is radiated to the entire value stream so we can see how our actions affect other portions of the system as a whole
Feedback loops not only enable quick detection and recovery of problems but they also inform us on how to prevent these problems from occurring again in the future
- This increases the quality and safety of our system of work and creates organizational learning

Swarm And Resolve Problems To Build New Knowledge

It is not sufficient to merely detect issues when the unexpected happens – we must swarm them, mobilizing whoever is required to solve the problem.
The goal of swarming is to contain problems before they have a chance to spread, and to diagnose and treat the problem so that it cannot recur.
- In doing so, Dr. Spear says, "they build ever-deeper knowledge about how to manage the systems ofr doing our work, converting inevitable up-front ignorance into knowledge."
The paragon of this principle is the Toyota Andon Cord which is a cord that every worker and manager is trained to pull when something goes wrong.
- when the Andon cord is pulled, the team leader is alerted and immediately works to resolve the problem; if it cannot be resolved within a specified time, the production line is halted so that the entire organization can be mobilized to assist with problem resolution until a successful countermeasure has been developed.
Swarming is necessary due to the following reasons:
- it is required to prevent the problem from going downstream where the cost and effort to repair it increases exponentially and technical debt is allowed to accumulate
- it prevents the work center from starting new work which will likely introduce new errors into the system
- if the problem is not addressed, potential to have the same problem in the next operation, requiring more fixes and work
Swarming enables learning as it prevents the loss of critical information due to fading memories or changing circumstances.
Swarming is the "discplined cycle of real-time problem recognition, diagnosis, ...and treatment. It is the discpline of the Shewhart cycle --plan, do, check, act--- popularized by W. Edwards Deming, but accelerated to warp speed."
To enable fast feedback in the technology value stream, we must create the equivalent of an Andon cord and the related swarming response.
- requires creating the culture that makes it safe and even encouraged to pull the Andon cord when something goes wrong, whether it is when a production incident occurs or when errors occur earlier in the value stream, such as when someone introduces a change that breaks our continuous build or test processes
- when conditions trigger an Andon cord pull, we swarm to solve the problem and prevent the introduction of new work until the issue has been resolved
- this provides fast feedback for everyone in the value stream (especially the person who caused the system to fail), enables us to quickly isolate and diagnose the problem and prevents further complicating factors that can obscure cause and effect.
Preventing the introduction of new work enables contrinuous integration and deployment, which is single-piece flow in the technology value stream.
- All changes that pass our continuous build and integration tests are deployed into production, and any changes that cause any tests to fail trigger our Andon cord and are swarmed until resolved.

Keep Pushing Quality Closer To The Source

In complex systems, adding more inspection steps and approval processes actually increases the likelihood of future failures.
- the effectiveness of approval processes decreases as we push decision-making further away from where the work is performed
- this lowers the quality of decisions and also increases cycle time thus decreasing the strength of the feedback between cause and effect and reducing our ability to learn from successes and failures
Examples of ineffective quality controls:
- requiring another team to complete tedious, error-prone and manual tasks that could be easily automated and run as needed by the team who needs the work performed
- requiring approvals from busy people who are distant from the work, forcing them to make decisions without an adequate knowledge of the work or the potential implications, or to merely rubber stamp their approvals
- creating large volumes of documentation of questionable detail which become obsolete shortly after they are written
- pushing large batches of work to teams and special committees for approval and processing and then waiting for responses
We need everyone in our value sream to find and fix problems in their area of control as part of our daily work.
- by doing this, we push quality and safety responsibilities and decision-making to where the work is performed, instead of relying on approvals from distant executives
We use peer reviews of our proposed changes to gain whatever assurance is needed that our changes will operate as designed
We automate as much as the quality checking typically performed by a QA or Information Security department as possible
- instead of devs need to request or schedule a test to be run, these tests can be performed on demand, enabling developers to quickly test their own code and even deploy those changes into production themselves
- this truly makes quality everyone's responsibility as opposed to it being the sole responsibility of a separate department.
  - Information security is not just Information Security's job, just as availability isn't merely the job of Operations
Having developers share responsibility for the quality of the systems they build not only improves outcomes but also accelerates learning.
- especially imporatnt for developers as they are typicall the team that is furthest removed from the customer
- Gary Gruver observes “It is impossible for a developer to learn anything when someone yells at them for something they broke six months ago - that is why we need to provide feedback to everyone as quickly as possible, in minutes, not months.”

Enable Optimizing For Downstream Work Centers

Lean defines two types of customers that we must design for:
- the external customer (who most likely pays for the service we are delivering)
- the internal customer (who receives and processes the work immediately after us)
According to Lean, our most important customer is our next step downstream
- optimizing our work for them requires that we have empathy for their problems in order to better identify the design problems that prevent fast and smooth flow
In technology value stream, we optimize for downstream work centers by designing for operations, where operational non-functional requirements (e.g. architecture, performance stability, restability, configurability and security) are prioritize as highly as user features.
- by doing this, we create quality at the source, likely resulting in a set of codified non-functional requirements that we can proactively integrate into every service we build.

Creating fast feedback is critical to achieving quality, reliability and safety in the technology value stream. We do this by seeing problems as they occur, swarming and solving problems to build new knowledge, pushing quality closer to the source and continually optimizing for downstream work centers.

Principles of Continual Learning and Experimentation

Focuses on creating a culture of continual learning and experimentation that enable constant creation of individual knowledge which is then turned into team and organizational knowledge
Goal is to create a high-trust culture, reinforcing that we are all lifelong learners who must take risks in our daily work
- applying scientific approach to both process improvement and product development, we learn from successes and failures, identifying which ideas dont work and reinforcing those that do.
- reserve time for the improvement of daiy work and to further accelerate and ensure learning.
  - consistently introduce stress into our systems to force continual improvement
  - simulate and inject failures in production services under controlled conditions to increase resilience

Enabling Organizational Learning and a Safety Culture

When working within a complex system, by definition it is impossible to perfectly predict all the outcomes for any action we take
- this is what contributes to unexpected, or even catstrophic, outcomes and accidents in our daily work, even when we take precautions and work carefully
- when accidents affect customers, we seek to understand the root cause whic is often deemed to be human error and the common management response is to "name, blame, and shame" the person who caused the problem
Dr. Sidney Dekker codified some of the key elements of safety culture and coined the term just culture said, "Responses to incidents and accidents that are seen as unjust can impede safety investigations, promote fear rather than mindfulness in people who do safet-critical work, make organizations more bureaucratic rather than more careful, and cultrivate professional secrecy, evasion and self-protection."
These issues are especially problematic in the technology value stream where work is almost always performed within a complex system and how management chooses to react to failures and accidents leads to a culture of fear which then makes it unlikely that problems and failure signals are ever reported.
Dr. Ron Westrum, one of the first to observe the importance of organizational culture on safety and performance defined three types of culture
- Pathological organizations – characterized by large amounts of fear and threat. People often hoard information, withhold it, distort it for their own good. Failure is hidden.
- Bureaucratic organizations – characterized by rules and processes, often help individual departments to hold on to their “turf”. Failure is processed through a system of judgement resulting in either punishment or justice and mercy.
- Generative organizations – characterized by actively seeking and sharing of information to better enable the organization to achieve its mission. Responsibilities are shared and failure results in reflection and genuine inquiry.
In the technology value stream, we establish the foundations of a generative culture by striving to create a safe system of work
- when accidents occur, instead of looking for human error, we look for how we can redesign the system to prevent the accident from happening again
- for instance, conduct a blameless post-mortem after every incident to gain the best understanding of how the accident occurred and agree upon what the best countermeasures are to improve the system, ideally preventing the problem from occurring again and enabling faster detection and recovery.
- by doing this, we create organization learning
Bethany Macri from Etsy stated, "By removing blam,e you remove fear; by removing fear, you enable honesty; and honesty enables prevention."
Dr. Spear observes that the result of removing blame and putting organizational learning in its place is that "organizations become ever more self-diagnosing and self-improving, skilled at detecting problems and solving them."

Institutionalize the Improvement of Daily Work

In the absence of improvements, processes don't stay the same --due to chaos and entrophy, processes actually degrade over time.
In the technology value stream, when we avid fixing our problems, relying on daily workarounds, our problems and technical debt accumulates until all we are doing is performing workarounds, trying to avoid disaster, with no cycles leftover for doing productive work.
Mike Orzen (Lean IT) observed "Even more important than daily work is the improvement of daily work."
We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of our code and environments by reserving cycles in each development interval or by scheduling kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want.
The result of these practices is that everyone finds and fixes problems in their area of control, all the time, as part of their daily work making it not only easier and cheaper but also the consequences are smaller.

Transform Local Discoveries Into Global Improvements

When new learnings are discovered locally, there must also be some mechanism to enable the rest of the organization to use and benefit from that knowledge.
When teams or individuals have experiences that create expertise, our goal is to convert that tacit knoledge into explicit, codified knowledge, which becomes someone else's expertise through practice.
In the technology value stream, we must create similar mechanisms to create global knowledge, such as making all blameless post-mortem reports searchable by teams trying to solve similar problems, and by creating shared source code repositories that span the entire organization, where shared code, libraries, and configurations that embody the best collective knowledge of the entire organization can be easily utilized.
All these mechanisms help convert individual expertise into artifacts that the rest of the organization can use.

Inject Resilience Patterns Into Our Daily Work

We introduce the type of tensions into our systems by seeking to always reduce deployment lead times, increase test coverage, decrease test execution times and even by re-architecing if necessary to increase developer productivity or increase reliability
We may also perform game day exercises where we rehearse large scale failures such as turning off entire data centers or inject ever-larger scale faults into the production environment such as the famous Netflix Chaos Monkey which randonmly kills processes and compute servers in production to ensure that we're as resilient as we want to be

Leaders Reinforce a Learning Culture

Leader's role is to create the conditions so their team can discover greatness in their daily work.
Creating greatness requires both leaders and workers, each of whom are mutually dependent upon each other.
Leader must elevate the value of learning and disciplined problem solving called the coaching kata.
- one that mirrors the scientific method where we explicitly state our True North goals such as "sustain zero accidents" or "double throughput within a year"
These strategic goals then inform the creation of iterative, shorter term goals, which are cascaded and then executed by establishing target conditions at the value stream or work center level (e.g. "reduce lead time by 10% within the next two weeks")
- Objectives and Key Results (OKR)

References

The DevOps Handbook