Case Study :How to Improve the Reliability of the IT Delivery Process

Case Study :How to Improve the Reliability of the IT Delivery Process

Context: Companies must adapt to new IT compliance requirements set by the European regulations

Our client in the insurance sector faced new European IT compliance requirements, which required they enhance the reliability of their software delivery process and simplify the gathering of production delivery history. 

We joined a newly established DevOps platform team that was tasked with this responsibility, which inherited significant challenges from the existing setup.

Initial Challenges Observed: A poorly documented delivery pipeline with outdated IT compliance.

We discovered that the software delivery pipeline was built on bad practices. It had been managed by one unique person using inconsistent methods with minimal documentation, leading to: 

  • Pipeline troubleshooting difficulties: Identifying and resolving issues was time-consuming and error-prone. 
  • Compliance gaps: The cloud resource lacked alignment with the company’s cloud compliance policies. 
  • Limited developer involvement: Developers had little insight into or control over deployment processes managed exclusively by a single DevOps. 
  • Lake of delivery validation: Verifying whether a production release was successful or with issues was a manual process.

We limited our scope to infrastructure and CI/CD pipelines, excluding the application code. 

Identified operational IT issues.

We identified the following key issues impacting deployment: 

  • Confusing incorrectly named git tags. 
  • Production release of features/feature toggles that were not properly declared. 
  • Challenges in rolling back release. 
  • Sensitive manual operations appearing in production pipelines e.g. enable the maintenance page. 
  • Frequent git branch conflicts. 
  • A growing tolerance for failed jobs or jobs ending with warnings. 

The root causes of these issues were mainly the use of environment-specific git branches. Involving several configuration files, rules, and variables, which made the pipeline complex.   

Solutions implemented to ensure IT process compliance, enhance modularity, and reduce costs. 

We optimised the deployment workflow by removing manual interventions (CI/CD) and streamlining infrastructure with active/non-active environment instead of blue/green. Eliminating human errors in release activation and enabling automated transitions.  

Adopting GitHub Flow reduced branching complexity, while it standardized quality gates ensured consistent code validation. 

Continuous Deployment was accelerated by minimizing job overhead, cutting deployment times, and automating pre-implementation checks to meet regulatory and operational standards. Security was reinforced with OIDC authentication for controlled deployments.  

Streamlining Terraform Infrastructure 

The infrastructure codebase was rewritten to be modular, reusable, and policy-compliant: 

  • Easier to manage with numerous Terraform state file reduction. 
  • Saves development and ensures that deployed resources comply with the company's cloud policies by removing self-managed resources. Replaced with standardized Terraform module provided by the company cloud-DevOps team. 
  • Testing faster the infrastructure changes with the ability to run a local Terraform plan in the development environment. Because of the huge number of variables provided to Terraform from the GitLab CI/CD pipeline, it was difficult to run a local Terraform plan without pushing the code. This allows testing faster the infrastructure changes and reduces the time to validate fix and features. 

Eliminating Blue/Green confusion: simplified operations between active and inactive environments.

In a blue-green deployment model, two identical production environments exist: one active with the current application version (blue) and one running the new application version (green). After successful validation, traffic is switched to the green environment and becomes the active one. 

The existing process was prone to human error, as developers had to know which color (blue or green) was currently active and manually update the Terraform files to specify which release should be activated in the non-active environment. 

Our enhancements simplified the release activation process, removing the color association requirement and allowing multiple hidden releases for testing in a feature-branched environment.  

For Continuous Deployment (CD), the release promotion from “hidden” to “active” is moved into a pipeline job instead of a manual commit. With one click, the developer can run the manual activation job, which will automatically switch the traffic to the new version. 

Une image contenant texte, capture d’écran, ligne, Police

Description générée automatiquement

The CloudFront routing logic based on a release cookie with Lambda@Edge is refactored, incorporating unit tests and SonarQube quality gates for improved reliability. See the article How to setup blue-green deployment with aws lambda edge.

Continuous Deployment

From Git Flow to GitHub Flow

The multiple branching strategy, inspired by Git Flow, led to confusion and conflicts when merging and deploying code. By eliminating environment-specific branches in application and infrastructure repositories, similar to GitHub Flow, we clarified the deployment process: 

  • Automatically fill variables to the triggered deployment job from the tag trailer message 
  • Move a release from one environment to another without having to go through the code quality gates again. 

Quality Gates 

We standardized quality gates using generic templates, ensuring all projects adhere to the same standards. This included SonarQube for code quality analysis, unit and integration tests for code validation, and dependency checks for vulnerabilities

Continuous Deployment 

Reducing the number of jobs 

In GitLab CI/CD, when a new job is instantiated, several steps occur to prepare the environment and execute the job script: 

  1. GitLab runner initialization: Identifies the runner that will execute the job and assigns it a unique identifier. 
  2. Secrets resolution: Any secrets required for the job are fetched securely. 
  3. Executor preparation: The runner prepares the executor environment, including the container image and any required dependencies. 
  4. Docker authentication and image pull: The runner authenticates with the Docker registry and pulls the required container image. 
  5. Environment preparation: The execution environment is set up, including mounting volumes, setting up network configuration or initializing paths. 
  6. Source code retrieval: The project repository is cloned into the working directory and specific commit is checked out. 
  7. Artifact Download: Artifacts from previous pipeline stages are downloaded. 

When GitLab runners are busy, it can take up to 2 minutes for the container to be ready before starting the script execution. By reducing the number of jobs in the pipeline, we could reduced the deployment time by up to 14 minutes, just in container initialization time. 

Automating pre-implementation checks and delivery evidence 

The CD pipeline now deploys artifacts triggered from the CI pipeline based on the commit short sha, and the release name is obtained from the commit ref slug

This automation replaces manual commits into the terraform code, introducing a pre-implementation check as a dedicated safety measure.  The job performs a series of automated validations before the new version is promoted to the active one: 

  • Is the hidden version to activate accessible when setting the release cookie? 
  • Is the currently active version different from the version to activate? 

Once deployed, we can gather evidence of what has been delivered to production. According to the Digital Operational Resilience Act (DORA) European regulation, this is a mandatory part of the change management process. As part of the post-implementation, we verify: 

  • Is the production still accessible? 
  • Is the responded release cookie in accord with the release deployed? 

Enhancing Security 

We improved security by implementing OIDC (OpenID Connect) on GitLab jobs, an authentication protocol that allows secure and reliable identity verification. This means that only users with the appropriate permissions, as defined by their roles, can initiate deployment processes. 

   

Next steps to assess the project's success and plan its future development.

A simplified process has been established, running compliant deployments faster and allowing rollback with ease. 

Implementing a DevOps metrics platform based on DevOps Research and Assessment (DORA) will be our next step. This helps to track: 

  • Deployment frequency 
  • Lead time for changes 
  • Mean time to recovery 
  • Change failure rates 

Lessons Learned 

This project provided valuable insights into modern DevOps practices: 

  • Understanding blue-green deployment strategy helped minimize downtime and reduce risks during deployments. 
  • Securing deployments with OIDC. 
  • Modular infrastructure codebase, compliant with company policies. 
  • AWS Lambda@Edge for routing through application versions improved deployment flexibility. 
  • Adopting GitHub Flow for continuous integration streamlined the development process. 
  • Standardized quality gates with SonarQube ensure consistent code quality. 
  • Automated delivery checks and evidence collection to meet compliance requirements and maintain a reliable delivery pipeline. 

Subscribe to Lenstra newsletter and stay updated.

Don't miss anything. Get all the latest posts delivered straight to your inbox. It's free!
Great! Check your inbox and click the link to confirm your subscription.
Error! Please enter a valid email address!