Harvard Partners was engaged to provide Incident Management training after the company experienced a major data center outage. While there, we observed that they were constantly battling smaller outages, which prevented them from tackling the bigger issues. Outages (incidents) typically involved more than 75 people, including senior executives. Communication was not effective. Staff was tired and frustrated and looking for management to help them “get out of the trenches.”
Harvard Partners quickly performed a production stability assessment identifying and prioritizing remediation efforts yielding the greatest relief for the staff. We interviewed staff, held Rapid Envisioning Sessions (RES), analyzed Remedy tickets, and reviewed incident and changelogs.
Our findings were reviewed with the division head, and a decision was made to expand the Incident Management process and add a Communications Officer as a way to reduce the number of people involved in each incident. This was followed by the creation of an Incident Review Committee and strengthening of the Change and Resource Management processes.
- Adding the Communications Officer created managed and targeted communications to non-involved staff, executives, and customers.
- Dropped incident participation from 75 people to only those directly involved in each incident.
- Allowed staff to manage incidents more easily, with customers noticing improved communications and commending the company on delivering better service.
- Implemented an Incident Review meeting and set a target of always keeping remediation projects at a 50% level (at any point in time, half of all incident remediation projects were finished), and started to deal with root-cause issues and reduce the overall number of incidents.
- Harvard Partners was asked to stay on to provide data center capacity planning assistance.