Building a mature DevOps practice: how to create a culture of experimentation

Posted in: Business Insights, DevOps, Technical Track

In our last post, Building a Mature DevOps Practice: Start with Why, we explored DevOps as a critical enabler of competitive advantage, and the importance of developing a DevOps mindset across the organization. Here we’ll expand on a couple of concepts we mentioned in the previous post—particularly the culture of experimentation, operational visibility, and continuous delivery/continuous integration —and how these three elements create a healthy DevOps culture that helps you get higher-quality applications to market faster and increase your competitive advantage.

Creating a culture of experimentation

In this new reality, software development and operations capabilities set the boundaries for success. The first step to creating a company-wide culture of experimentation is for executives, the business, developers, and ops–and other key stakeholders–to understand how changes in engineering productivity align with business outcomes.

Every product organization must continually make trade offs between key performance indicators (KPIs): velocity, performance, availability, cost efficiency, and security. Being aware of the impact of those trade offs and having the freedom to make them in a blameless, transparent environment will encourage experimentation. And, as we saw in the first blog post in this series, innovation is the key to success, and success depends on hypothesis-testing through experimentation.

In order to address how changes in engineering productivity align with business outcomes in a quantifiable manner, decision-makers should start by understanding the relationships of business performance and customer satisfaction with the KPIs mentioned above. Below is a brief snapshot of what that means:

Velocity is the rate of change in the functionality of an application or piece of software. It can be measured in a number of different ways, including builds per day, CI runs per day, production deploys per day, or post-production bugs on release.

Performance is defined as the perceived performance of the application to end users. It can be measured in terms of synchronous response times, data freshness, asynchronous delivery latency (messaging), and/or streaming quality.

Availability is the perceived readiness of an application to respond to user input or perform asynchronous actions as expected. It can be measured in percentage of successful responses to user requests, percentage of data processes over a specified time period, percentage of events successfully handled, and/or percentage of requests that do not encounter blocking or severely degraded functionality (bugs).

Cost efficiency is the total fixed and variable costs required to operate and deliver software capabilities to end users, including personnel, software, hardware, network, and other physical and/or logical costs. It is measured in capital costs, infrastructure costs, operations headcount, and/or cost per unit of value (per pageview, transaction, match, impression, message, interaction unit, etc.)

Security (privacy and compliance) is the control objectives, compliance regimes, and privacy protections that comprise an application’s security posture. It is measured in Data Classification (what is stored, sensitivity, impact of breach, required protections); Secure Software Development Lifecycle implementation (static analysis; pen testing; multi-party verification; library, framework, OS and other infrastructure software liability management, OWASP top-10 compliance); secure operations best practices (isolation, least-privilege, access controls and policy, separation of duties, etc); and compliance regimes (PCI, HIPAA, SOX2, FINRA, etc – what level of compliance and current audit status).

Once the impact that these five DevOps KPIs have on business has been identified, companies can tailor the focus of their DevOps efforts to maximize the benefits to the organization and minimize the risks of failed or regressive actions.

Regardless of the focus area, it is important to experiment through lighthouse projects and isolation of transformative exercises, so that experimentation and effectiveness can be learned and demonstrated on a small-scale,  and confidence in DevOps can be increased (while risks and timelines remain contained).

We believe that a healthy DevOps culture, when supported by comprehensive Operational Visibility in Continuous Delivery environments, forms a reliable base required for working towards competitive advantage.  We’ve already covered the culture part in some detail; let’s explore how OpsViz and CD fit into the equation.

Operational Visibility

Along with experimentation, DevOps is about creating a transparent culture of sharing and collaboration. Providing good operational visibility (OpsViz) to all stakeholders in your organization creates trust in DevOps, because it delivers insight (and actionable information) into the operational health of your platform.

The efficacy of your OpsViz system is the degree to which you can easily discover the operational characteristics of a system and its components, understand those characteristics at all levels and scales, and make meaningful predictions about the behavior of the system, and its components under varying demand and scale conditions. OpsViz maturity can be a constraint or enabler at all phases of the DevOps lifecycle, and impacts speed of development as well as the degree of operational excellence. Since what cannot be measured cannot be managed effectively, it is one of the key elements of the DevOps practice that underpin effectiveness and overall value delivered by your DevOps efforts.  

Good operational visibility into the entire stack enables a proactive stance towards platform management, it helps ensure your platform is healthy and running optimally, and generates trust in the DevOps efforts. There are also other benefits that should be derived from good OpsViz instrumentation. For example, a mature platform should be elastic and auto-scale up and down, in proportion to demand.  That is very easy to implement on AWS, by creating autoscaling groups. If there is a failure, you want to be alerted, but you also want your failing platform components to auto-heal by triggering an OpsWorks run to automatically provision replacements for the nodes failing health checks. The repair process should be automatically triggered and should proceed forward with as little human interaction as possible. In order to achieve this, you can instrument your platform with CloudWatch, and use the collected metrics to drive auto-scaling policies and trigger events to drive auto-healing. Coupling CloudWatch monitoring with services like SNS allows you to receive alerts on critical events, while simultaneously triggering other services to help remedy the situation quickly.

At minimum, a good OpsViz stack should include and consider the following:

  • Granular and comprehensive pplication and infrastructure instrumentation (both logs and key events)
  • Session and transaction reconstruction or audits in distributed systems
  • Platform-wide and component-level SLA reporting on all KPIs (velocity, performance, availability, cost, security)
  • Dashboards and drill-down information availability for all stakeholders (not just Ops and Development teams)
  • Real-time single pane of glass visibility into the platform’s key characteristics
  • Incident escalation and resolution tracking
  • Event correlation with high level of granularity and breadth of supporting data
  • Security and compliance violation events with drill-down detailed
  • Check execution and scheduling throughout all platform components
  • Time-series database scalable to millions of metrics and billions of datapoints
  • Log collection, parsing, metric extraction, search, archival and data mining
  • Non-repudiation/write-only audit logging

Continuous Integration and Continuous Deployment

Continuous Integration is a state of maturity for a DevOps lifecycle in which all changes committed by a developer are built, deployed, and thoroughly tested by automated means, thus providing lowest-possible feedback turnaround times for the developer. This concept can and should be applied to every layer of architecture, application and infrastructure, in order to minimize process latency and maximize velocity, while maintaining or improving application quality and stability.  At maturity, CI ensures that the application is in a releasable state at all times.

Continuous Deployment is a state of maturity that builds on the foundation laid down by a healthy and complete CI system. CD, in turn, enables new or updated software functionality to be deployed by a developer or business unit (modulo controls required for security or compliance purposes) to all environments, including production, without the need for significant manual intervention by any other party.  At maturity, this involves complete automation of the process for verifying and rolling out features, strict isolation of features and runtime control over their activation, segmentation of the end-user population, an ability to target or limit features to certain user segments over time, and a self-service interface for engaging in and managing these activities.

Continuous integration and continuous delivery techniques help you integrate code changes frequently. This iterative approach is inherently less risky than the traditional less frequent deployments, because you are introducing smaller changes, plus the more often you deploy, the better you get at it.

Be sure to check out our third post in this series where we look at what implementing DevOps best practices throughout the application lifecycle looks like in a mature DevOps organization.

Learn more about how Pythian can help your organization with DevOps.


Want to talk with an expert? Schedule a call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *