According to the 2019 State of DevOps Report, the highest performing organisations practicing DevOps deploy code 106 times faster than the lowest (that’s multiple deployments per day versus deployments between 1 and 6 months apart). That translates into how long it might take to get a change into production. This is combined with the lowest-performing organisations having a change failure rate seven times higher: changes are seven times more likely to fail and impact users.
In 2017, high-performing organizations were found to spend 21 percent less time on unplanned work and rework, and 44 percent more time on new work. That’s a significant increase in time spent rolling out new features, rather than on fixing defects.
Put this together with the finding that publicly traded companies that had high-performing IT teams had 50 percent higher market capitalization growth over three years and you have some compelling reasons for shifting your culture towards a DevOps approach.
Black Pepper Software has used DevOps techniques successfully on multiple projects. This case study highlights some of the ways in which we have been able to help our clients realise the benefits.
We’re going to look at some of the successes and challenges we’ve had implementing DevOps in two different financial services organisations:
- The London Multi-Asset Exchange (LMAX)
- Legal & General Investment Management (LGIM)
We worked with LMAX to build the LMAX Exchange: the software that is at the core of their business.
We are in the process of helping LGIM to build a new internal fund management platform and to build a platform for analysing fund performance and risk, informing investment choices and supporting compliance processes.
Challenges and Objectives
While the organisations we’ve used to illustrate our successes using DevOps are in the same sector, they are very different: one was set up as a completely new business in a complex, highly regulated environment; the other is a long-established business with a legacy in both software and culture. Their challenges were the same: how to build a complex software system in the most efficient way possible.
Amazon describes DevOps as “the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity”. Like Agile methodology, DevOps is focused on building a collaborative culture and learning mindset. Hence, there are many aspects to the solutions we employed on these projects, and it’s probably helpful to look at them in the context of a framework of some kind. Created by Jez Humble, co-author of The DevOps Handbook and Accelerate, CALMS can be useful in assessing whether an organisation is ready to adopt DevOps processes, or how well an organisation is implementing DevOps.
DevOps requires a culture of shared responsibility or at least a group of people devoted to establishing that culture with management approval and support.
The LMAX project had some ambitious goals: building a financial exchange capable of processing 6 million calculations per second, using commodity hardware and open-source software. The system would be operating in a regulated industry and would need to be available with little or no downtime. Regulation meant FCA audits, so features had to be thoroughly tested and traceable right from the original requirement to deployment. High availability meant a disaster recovery instance, and hence complex deployment processes. The high performance meant complex infrastructure, combining virtualisation and bare metal.
Probably the primary reasons for the project’s success were the people and the culture. The vision and goals for the project were stated at the outset and clearly communicated to everyone. The development and operations teams were exceptionally well integrated, almost to the point of being seamless; because they shared the same goals, they supported each other. If a problem arose, both teams would take ownership of finding a solution. In actual fact, the entire business was built this way: a clear mission, the minimum of politics, and an absence of silos.
At Black Pepper, our mantra is “Automate Everything!”
Possibly not quite as ambitious CALMS suggests, Automate as many manual tasks as possible, especially in continuous integration and test automation.
LMAX was implemented 5 years ago, as DevOps was just emerging. We had to implement tools to allow automation. We were able to use Cruise Control to implement Continuous Integration pipelines, but had to write our own repository for binary artefacts, as the tools we take for granted today didn’t exist.
As is the Black Pepper way we made extensive use of automated testing to increase throughput and coverage. It helped in tracing implementation back to requirements as well as validating the requirements had been implemented correctly, which was a significant factor in the system achieving FCA compliance. LMAX estimated that automated testing saved over £100K in manual regression testing effort in the first six months of the programme alone, savings which would be multiplied many times over the life of the system.
At the time we didn’t have the luxury of infrastructure as code, so creating multiple environments was done manually, but the team developed in-house tools to automate deployments as far as possible.
In the case of LGIM, we built a complete CI pipeline that, every time new code was committed, would:
- build it,
- unit test it,
- package it,
- run a security audit,
- check out Docker images,
- deploy it,
- run API tests,
- run UI tests,
- performance test it, and (assuming all tests passed)
- tag it as ‘ready for QA’.
Automating all these steps removed the dependence on human intervention. Requiring engineers to remember complex sequences of commands and options can be error-prone and less robust (what happens if the engineer with the best knowledge of an area is ill?). If you’re going to document the deployment process to pass the ‘what-happens-if-I-get-hit-by-a-bus?’ test, why not document it by writing executable scripts, and using tooling to do the hard work? Automation reduces cognitive load and enables engineers to concentrate on hard work. Here, it made everything work more smoothly and consistently and vastly improved the speed at which we could deliver features to the QA team for manual, exploratory testing. By automating tests at all levels, we could be confident that the package delivered was robust, and needed only the kind of testing that humans are particularly good at exploring edge cases and quirks that software engineers don’t always consider.
Collect data on your processes, deployments, etc. to understand your current capabilities and where you can improve.
We tracked defect rates to make sure we weren’t introducing new features at the expense of adding bugs at the same time. We had a process to link any broken tests back to the commit that introduced the breakage, a process we automated as soon as we could. By doing so, we reduced the time to identify and hence fix any defects.
Obviously, the performance was a key metric, so we incorporated performance benchmarking into the CI pipeline. We made the results visible (also a Lean principle) and plotted trends so we could track the impact of the changes being made to the codebase, and identify problems long before they reached production.
Build a culture of openness and sharing within and between teams to keep everyone working toward the same goals.
LMAX had an open culture where information was easily shared or deliberately made visible as a matter of course: large displays with build status, system health, performance test metrics, for example. People working in operations could walk up to a Kanban board and see the stories in development and their status, so they knew exactly what would be making its way into test and production environments in the near future. Developers could easily see the status of production systems and identify patterns that might indicate potential issues before they happened.
Where LMAX was a new organisation, built around a new product, LGIM is a long-established investment management company with legacy systems and some legacy team structure, including silos of responsibility. It also operates in a regulated industry, so there is understandable aversion to risk, particularly where financial or sensitive data is concerned. Using the learnings from LMAX, we worked with the team to demonstrate how it is possible to lower some of the barriers while still mitigating the risks. This enabled us to improve productivity by enabling things such as automated environment provisioning, which leads to greater consistency of configuration across environments.
Our project with LGIM included a third-party software supplier, whose product was being integrated into the system. We focused on building strong relationships and good communication with the supplier. When problems arose, as they inevitably do, we approached them with shared responsibility and a no-blame rule: all parties working together to identify the cause, wherever it lay, and resolve the issue to deliver a solution for the customer.
The LMAX team embraced DevOps from the outset (we were lucky enough to be able to work with Dave Farley, co-author of Continuous Delivery with Jez Humble). LMAX Group is now recognised as a global, high-growth financial technology company, built around the LMAX Exchange. They have won numerous business growth, excellence and innovation awards.
When we started working with LGIM, they were very much in the early stages of their DevOps journey. There was a legacy of manual, error-prone processes, with features taking weeks to make the journey from development into production. It took a very long time to get information into the hands of users, who should have been relying on it to make decisions potentially involving very large sums of money.
The automation we introduced meant that the time from code commit to deployment into the QA environment was reduced to a couple of hours, including unit tests, API tests and performance tests. Even with the necessary gates in place, features could reach production in 2-3 days, where previously it would have taken weeks. The impact of this was to enable us to iterate quickly with users, getting feedback within hours, and get completed new functionality into their hands within a few days. Engagement from the user community increased rapidly because they could see their feedback having an effect, and they could ask for more and more new information to improve the quality of their decisions. They also saw greatly improved application reliability as a result of a huge increase in test coverage, which also enabled testers to focus on functional and exploratory tests.
While we can deploy the software to testing environments automatically, there are still manual steps needed to control deployment to production. Because we have introduced consistent, automated processes across the development stages, we have made deploying to production much more predictable and incident-free.
We have made information about the system state and operational data visible to development and operations staff. There are metrics and alert monitoring built-in, allowing problems to be identified and resolved more quickly, and in some cases anticipated. We have made external service providers part of the team and included monitoring that will allow us to identify issues that may occur with their services as well as those in house.
We have handed over one of our LGIM projects to an internal team, who continue to use and improve upon the processes we introduced. The second DevOps project is a work in progress: we have made enormous strides in reducing the time between code being committed and in production, in-process and test automation, and in at least lowering the walls between siloed parts of the organisation. There is still work to be done to achieve the unity of purpose that marks the LMAX team out as an example of how DevOps should be done.