Enable javascript in your browser for better experience. Need to know to enable it?

黑料门

How to takeover a large-scale and complex legacy system

How to take over a large-scale and complex legacy system

Project background

Project takeover from a client to a 黑料门 team is critical and has a lasting impact on subsequent speed and quality of delivery. From the end of October 2020 to the end of December 2020, our team (C Team) took over a major payment gateway system with a nearly 20-year history from the client's team.

C Team officially took over the project鈥檚 daily operations and maintenance work in January 2021, becoming responsible for day-to-day operation of the system including 24/7 On-Call, and new feature development.

The entire takeover process, including the challenges we faced and the experiments we carried out, taught us much. By sharing the practices we set up: incremental goal setting, service takeover template, C4 model etc., we hope to guide other teams through this process.

Challenges of this takeover

Before we start exploring what we learnt, it might be useful to provide a little more complexity about the situation our team faced. Given the complexity and long history of this payment gateway, we expected to face several challenges from the beginning.

  1. Complex business domain and outdated engineering practices

    In terms of business, the payment domain is always complex, due to the many different functionalities that it supports. Lack of clear business and technical documentation worsens the situation.

    In terms of technology, there were more than 100 services and over 300 repositories in total. This led to a number of problems and challenges: severe service coupling; many services having no pipelines, no testing environment or even source code; solving problems depending on manual changes on production databases; operating systems and software package versions being outdated; etc.

  2. Aggressive timeline with large number of services

    There were 100+ services involved and the initial timeline to complete transition was only 30 working days at the demand of the client team, as their team members were about to leave for other projects.

  3. Lack of hands-on experience

    Ultimately, true understanding requires hands-on experience. Taking over a long-established internal system is tough: after all, the people with knowledge about it are the internal team. As outsiders, we not only have a steep learning curve, we still need to build up practical experience as we go. 听

  4. Remote collaboration

    Remote collaboration also proved challenging as we needed to maintain frequent communication with client team members located in Melbourne, three hours ahead of the C team in Xi鈥檃n, China. Different first languages also made communication harder between the teams.听

Our practices

Incremental goal setting

How do we generally measure the maintenance quality of a legacy project?

Immediate goal: at the very least, we need to run the business as usual (BAU). This means that our team should have the required knowledge and skills to handle online incidents and daily business work after the departure of the client team.

Long-term goal: start to make small continuous improvements. This means the team has enough business and technical context to build an improvement plan and action on it to deliver great value.

Based on this, our team divided the project into three stages. This incremental approach allowed us to build the plan and adjust based on feedback.

Figure 1: Eight stages of handover, from inception to delivering value

1. Takeover period

Goals:听acquiring as much experience and knowledge as possible from the client's team.

Activities:听reduce On-Call pressure; frequent presentation of periodical achievements to clients.

Focus:听business knowledge, basic information about system and services, manual process, knowledge and skills based on experience.

2. Practicing via doing period

Goals: turn the team into a unit and have the ability to solve common problems.

Activities: start to handle on-call.

Focus: identifying the project focus; pair programming; resolving end-to-end business issues; learning from previous incidents, etc.

3. Continuous Improvement period

Goals: each team member is capable of dealing with on-call independently; delivering more value to the customer.

Activities: team members take turns to perform on-call with exposure to a variety of problems,

Focus: team sharing; solving specific problems and spreading knowledge one-to-one; deliver value through continuous optimization.

Setting up a baseline through the service takeover template

To ensure that the details of each service takeover were covered, standardized, and to promote the quick aggregation of useful information and output standard information for each service, our team defined a service takeover template. This template covered all the necessary information for a service, such as core functions, code repository links, test coverage, and easily omitted content such as technical debts and pitfalls or whether there had been any online issues. This template served as a clear acceptance criterion for the takeover of each service.听

Each service also had a separate page on Confluence with clear records and documentation that could be easily referenced by the team.

We ended up creating 109 documents recording the basic service information, which greatly helped in our subsequent maintenance work and allowed the client team鈥檚 developers to keep a permanent record. This served as a powerful input for future improvements.

Using the C4 model to clarify system architecture

Ultimately, the service template simply records information. It鈥檚 useless if it can鈥檛 be used to understand and solve the real problems: business issues.

Therefore, after handing over an independent service or a series of services, we would use the C4 model, drawing two high-level C1 (system context) and C2 (container) diagrams to visualize the inputs, outputs and dependencies of each service. Experience showed that the drawing process itself helped the team better absorb fragmented knowledge. Images were also a more effective means of communication with clients given the language barrier.

Figure 2: System context diagram of the payment process system

In the next part, we鈥檒l share more of the practices we implemented to overcome the challenges of takeover.

Filling in gaps through internal discussion

Given our time-limited and task-heavy situation, we adopted the 鈥1 plus 1鈥 model: pairing one of our team members with a member of the client's team, then selecting one service that the client was relatively familiar with to take over. In an ideal situation, seven services could be handed over in parallel daily while obtaining basic information according to the service takeover template to maximize knowledge gain.

However, this model also brought some problems:

  1. Team members didn't know about the services that others had handed over and the relationship with their own services.

  2. Some aspects would be missed when two people handed over a service.

  3. During the service takeover process, it was still impossible for team members to get a comprehensive understanding of the entire system even with the upstream and downstream of the service listed.

In order to address these problems, we introduced daily internal discussions. Team members would take turns sharing what they鈥檇 learned while other members provided feedback, ensuring that services would be handed over more effectively. We received enough information for 3 hours of daily internal discussions. Due to limited time, we focused on high level understanding of the system instead of going down rabbit holes. This also reduced the risk of single point failure. Pairing further helped to improve availability.

Visualizing takeover progress

听It鈥檚 important to show progress during the transition period from different perspectives:

  • Weekly takeover plan based on number of services to be handed over and the duration of takeover. Use it to show the weekly progress of each iteration and compare it to our initial projections. Risks and issues are also part of the scope.

Figure 3:听Weekly takeover plan based on number of services to be handed over

  • Service architecture map. Team updated it daily to show progress, using green markers for completed services and gray markers for pending services. The visual element of piecing the map together even added a fun and motivating factor to the boring work of takeover.

Figure 4: The service architecture map

Minimizing risks through communication

It was incredibly important to synchronize the takeover issues and risks with the project鈥檚 main interface person, so we communicated with the client's Delivery Manager (DM) weekly. Our communication centered on the following:

  1. takeover progress: weekly progress updates on takeover and unaccomplished content based on the service list and Burndown Chart to keep the client informed.

  2. Obstacles: clarifying the issues encountered in the current stage, such as missing accounts and permissions, so that the DM could help us coordinate resources and eliminate obstacles.

  3. Risks: emphasizing and documenting risks to the DM, such as the risk of delays. For high-risks issues, we invited our China Leadership Team and the client鈥檚 Senior Leadership Team to assist us.

We not only communicated frequently with the DM, but also established a good relationship with the client team鈥檚 main technicians and their L2/L3 Operations Support Team and Client Success.听

Finally, we conducted retrospective sessions with the entire client team in each iteration, so that members of both teams could give feedback and share knowledge.

Increasing confidence through incident drills

Undoubtedly, 24/7 On-Call was a great challenge for the team. Our team felt stressed due to听 our lack of practical On-Call experience and of in-depth understanding of business implementation details. We found that rehearsing past incidents was an excellent learning tool to assess the impact of online incidents and learn to resolve them quickly.

  1. The organizer would select a representative incident from past online failures for simulation, such as an incident integrating services with other gateways.

  2. The team dedicated 2 hours to simulating an online incident, asking pertinent questions without relying on prior knowledge.听

  3. The team was divided into two groups, both of which identified problems and proposed potential solutions.

  4. The organizer then reviewed and clarified the relevant knowledge points.

Adopting the above methods allowed us to quickly adapt to the rhythm of On-Call and allowed each team member to have first-hand experience as Primary On-Call.

Post takeover issues

Customize inception for legacy systems takeover

As we learned from the first stage of takeover, the Inception activities that we generally apply to the launches of new projects had very limited impact on the takeover of complex legacy systems.

Activities like 鈥淗opes and Concerns鈥 and RAIDs Logs were of great help, as they could effectively identify key problems at the beginning of the project so that we could carry out targeted management. They were useful tools in presenting our plan and scope for takeover to the client and let us adjust it based on feedback.

Activities like Trade-off Sliders, Elevator Pitch, Stakeholder Mapping and Empathy Maps provided little value in this context. It鈥檚 more important to have deep-dive sessions with the key stakeholders, rather than everyone

End-to-end view during takeover

It鈥檚 important to define clear roles and responsibilities without relying on assumptions. It鈥檚 a good opportunity to review and identify opportunities with other teams as well, e.g. L2/L3 operation support team. It was important for the team to have an end-to-end view regarding the entire process, including collaboration with other teams, so that team members could:

  1. Understand the whole system

  2. Build good communication and relationship with the support team

  3. Have the opportunity to optimize and improve

Achievements

C Team鈥檚 achievements since taking over daily operations and maintenance in January 2021 speak for themselves:

  • Significant decrease in the number of incidents (went down from 11 to three per month between February and April) while availability increased to 100%.听听

Figure 5: Incident count pre- and post-takeover

  • Capable of providing 24/7 support, not only handling our own incidents, but also supporting other teams. The scope varied from user configuration/onboarding issues to complex performance issues.

  • Reduction of Main Time to Recovery to an average time of 3 hours. Significant uplift in knowledge management, including 109 basic service information documents and 30+ architecture/business diagrams.

  • Operation time improvement, up to 11 hours shorter for some business critical manual operations.

Grateful acknowledgement is made to Weibo Wang, Hao Gu, Shuo Gao, Mengyang Sun, Li Yan, Claire Boquet, May Ping Xu , Sichu Zhang, Kaifeng Zhang.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of 黑料门.

Explore the latest Technology Radar