Major Incident Report which covers the root cause of the issue that took place on 2nd July 2024, mitigation actions to prevent future reoccurrence, and the associated learning.
Availability Major Incident Report Form |
|
Incident Details |
|
Date of Incident: |
2-3 Jul 24. |
Location of Incident: |
IEG4 Platform |
Summary of Incident:
(State facts only and not opinions. Include details of staff involved and any contributing factors) |
The IEG4 platform is a cloud native platform using a wide range of Microsoft Azure resources to deliver both platform level services supporting multiple products and individual customer specific services. At around 5pm on Tue 2 Jul 24 a member of the software development team initiated an Azure DevOps Deployment Pipeline Task using a specific Azure Resource Monitor Template to deploy some routine changes to a small set of resources within a large resource group in a specific MS Azure region. Unfortunately, the task was mis- configured and this led to significant unintended consequences and a large number of resources in the Resource Group became unavailable. This was due to incorrect parameter selection which meant that resources not specifically identified in the Template were unintentionally impacted by the pipeline being run. These resources included some key traffic management, load balancing and other networking elements, as well as specific web apps and storage accounts. This led to disruption to the platform and the inability of many customers to access many services. Some customers were not impacted due to their lack of reliance on specific Azure resources in this particular resource Group. Unfortunately, the nature of the DevOps Task mis-configuration required intervention by Microsoft Azure Support and our own systems and development teams.
A small number of resources were recovered by Microsoft through their MS Azure Support services, but the remainder of the resources were recovered by IEG4 staff in a measured and controlled manner, utilising scripts designed to recover both resources and configuration, focusing on the highest priority live services in the first instance. The restoration of the networking components enabled services to become gradually accessible to customers who had been impacted from around noon on 3 Jul 24, though some subcomponents of the main services, such as bank validation for certain forms, were brought back into service that afternoon. We closed the major incident before 4:30pm on Wed 3 Jul 24 and requested that customers raise support tickets to enable us to resolve customer specific residual issues as part of the normal support process. |
|
The root cause analysis established that an incorrect configuration of a Pipeline task in MS Azure DevOps caused critical resources in a particular MS Azure Resource Group to become unavailable. This was an error on the part of the member of staff who initiated the task. For context, a different tool for undertaking routine resource changes is generally used in IEG4 but Azure DevOps is the market leading and fully integrated tool for this purpose and we have been gradually adopting its use in certain use cases. Up to this point in time over the past seven years, IEG4 has not experienced this type of system failure.
The Software Engineer concerned believed that the deployment pipeline task had been configured correctly but this was not the case. Specific further training has been provided in this specific instance. In addition, the lessons learnt process will ensure more general training improvements and enhanced change management processes as well as a requirement for detailed peer review of the configuration of the task prior to its future execution.
There were no data or security issues due to this incident. |
Classification: (See Information Security Incident Reporting and Management procedure for guidance) |
Incident – Availability |
Brief Description of Action Already Taken |
The following actions have been undertaken:
· Investigated the original issue and established root cause · Restored services incrementally with some assistance from Microsoft Azure Support · Implemented resource locks on key resources |
Actions Taken to Prevent a Recurrence |
The following actions have been undertaken to prevent a recurrence:
· Changed the permissions for software engineers for Azure DevOps Pipeline tasks to prevent these being run without peer review · Initiate a review of change management process to assess if further changes are needed · Initiated a review of the decision to move from our existing tooling to MS Azure DevOps · Increased the training requirements for all staff with access to Azure DevOps · Specific training for the individual who mis-configured the pipeline task |
|
· Increased company-wide awareness related to the need for caution and peer review when undertaking any changes with the potential to impact live services · Increased monitoring and planned further reviews of the resource lock policy for specific resources to ensure that there are no negative implications · Reviewed the general processes and training requirements for new tool adoption in relation to the software development lifecycle and in particular automation tooling related to continuous integration and development |
|||||
Has the IG Lead Been Informed? |
Yes T No * |
Has The SIRO Been Informed? |
Yes T No * |
|||
Has Caldicott Guardian Been Informed? |
Yes * No T |
Has Customer Been Informed? |
Yes T No * |
|||
Details of Any Advice Provided to Customer. |
Customers were provided with regular updates on resolution progress during the incident, using email, a ‘Banner Notification’ message on the IEG4 Customer Portal Page and the IEG4 Website. |
|||||
Reporter’s Details |
||||||
Name: |
Alan Powell |
Job Title: |
Head of Projects and Support |
|||
Contact No: |
Via email |
Email: |
||||
Information Governance Lead Follow Up (Investigations, Findings & Planned Actions) |
||||||
Root cause analysis has established that this incident was caused by an MS Azure DevOps Pipeline task being executed with incorrect parameters. Services were restored successfully but there was significant disruption for customers as a result of this error. Appropriate actions have been put in place to minimise the risks of recurrence. |
||||||
IG Lead Name: |
Peter Banahan |
Date: |
18 Jul 24 |
|||
- Change History Record
Issue |
Description of Change |
Author |
Date of Issue |
1.0 |
Initial issue |
Alan Powell |
18 Jul 24 |
|
|
|
|