The systems that drive productivity and profitability in the modern enterprise create an internal and external ecosystem. These ecosystems grow to be quite large, comprised of hundreds of applications usually centred around primary applications like SAP, Workday and Oracle. The data that flows between these applications is a critical part of these ecosystems, and the connections between the apps are crucially important. Think of the applications as neurons in a company’s electronic brain, and the integrations between these applications are the synapses. The number of applications (neurons) drives an even higher number of integrations (synapses), as each application receives and transmits information to many other applications.
Dispatch specializes in building these integrations to optimize workflow, streamline master data management, and seamlessly connect to vendor managed services. The IT ecosystems in most companies are becoming increasingly complex and sophisticated. This complexity increases the number of integrations and makes the security, reliability and robustness of each integration critically important.
We understand how vital high-quality integrations are to a business, so when Dispatch builds integrations, we take our DIVE protocol project management approach very seriously. This methodology consists of Discover, Innovate, Validate and Empower.
The team at Dispatch cares about the quality of integration projects because all of us have been clients in both IT and business roles and understand the realities of life in production. Integration projects can be high risk and failures can be highly visible and very expensive. The Empower phase of our DIVE protocol is focused on what happens during go-live and in production. Let’s dive into life in production.
Life in Production Stages
Life in production is composed of four stages: go-live, shakedown, stabilization, and monitor. Each of these stages is composed of a timeframe, resource stress, risks, and actions.
Go-live: If you’ve ever been involved in a significant software change in an organization, this is a familiar step. The precursor to go-live is always months of hard work, which includes requirements gathering, planning, development iterations, test and remediation, change management, and training. Go-live consists of the steps necessary to implement the systems that appear stable in pre-production systems, including configuration changes, deployment of software assets, security configuration changes, final training, and coordinated execution by both business and IT teams. Go-live is a high stress, high stakes time for all members of the team. It is inevitable in a complex system implementation for mistakes to happen, and visibility on those mistakes is typically high. Dispatch mitigates some of those stresses by creating and rehearsing a go-live script of all tasks that need to be completed. We also follow a strict four-eyes rule – two people paired together, working through the scripted steps of implementation. We always insist on smoke testing once live to ensure there aren’t any uncontrolled variables that could cause problems. We recognize thorough testing and validation in representative test systems is important. Still, it is very difficult to replicate behaviours of a complex system entirely in pre-production, so go-live monitoring for unexpected results is critical.
Shakedown: For all the developers who have been through an implementation project, this is also a very stressful period. Some implementations have legendary stories of the shakedown period. Months of work – from requirements to design to implementation, can be scarred by unanticipated problems found once in production. Shakedown happens once the smoke test has cleared, and the implementation team gives the all clear that the migration to production is complete. Then the users are released into the system and can leverage the new feature or functionality released.
The acceptance of the changes by the users and stakeholders is most fragile during this period. Typically, shakedown can take between 1-5 days. Dispatch supports this period with hyper-care, which is when we are extra vigilant to monitor, detect, and resolve unexpected behaviours that weren’t caught in pre-production testing.
Users may be unfamiliar with the new features and don’t know how to react to workflow changes. With teams already pushing the envelope of managing complexity, there is usually little patience with changes that make their lives harder or more complex. Change management, training and communication are essential for users during this period, and often IT teams underestimate the importance of additional training & reinforcement during shakedown. Our preference is that all stakeholders and user groups are engaged early on in the project – way before go-live so that they have time to understand what changes will occur and have a voice in defining the best workflow from their perspectives.
The most insidious risk during shakedown is related to data integrity. Once the machinery of the new software is turned on and is actively processing, reformatting, updating, saving, removing data – undoing a rollout, or performing a rollback is really hard. The window for a data rollback is typically measured in hours or even minutes and could mean restoring systems from backups. Undoing a go-live is the last resort as this effectively means data loss and means something went very wrong during test, validation, and go-live.
Effective triage of issues should classify the impact, any workarounds, and if immediate actions are required or can be deferred. Issues that could affect data quality should be classified with the highest priority as data quality issues tend to amplify the longer a system is live and are the most difficult to remediate.
Managing Shakedown requires a complete understanding of how the entire new system works and interacts with existing or legacy applications. It requires clear lines of responsibility for triage and fast communication. Complex systems create complex behaviours that can cause complex problems that can be difficult to detect and resolve. We always recommend “instrumenting” any new system to collect data on behaviour and performance. Instrumenting helps efficiently detect and capture issues instead of having systems fail silently. Rapid and open communication facilitates good decision-making and problem-solving. People are notoriously poor problem-solvers when under high stress. A good project manager with an understanding of human psychology will help ensure that all stakeholders are on the same page and working together through this stage.
Stabilization: Once initial shakedown has occurred, the stabilization period ensures all stakeholders are satisfied with the outcome of the project rollout. The duration of stabilization can vary depending on the number of outstanding issues deferred from the project or discovered during shakedown. This part of life in production incorporates the voice of the system, the voice of the customer and the voice of operations into a stabilization plan. During stabilization, non-critical or low-priority issues that were postponed during shakedown are addressed. Sometimes these fixes involve minor development work, sometimes refined documentation & training. Often users provide input during this stage about how life in production is different than they expected and ask for tweaks or minor change requests to address irritants. Usually, these issues are addressed with software hotfixes.
Critical for success with the hotfix approach is regression testing – that is, the testing of existing behaviour such that already working code isn’t negatively affected by the correction. You don’t want the cure to be worse than the ailment.
During stabilization, the stress level of all participants is starting to decline, but project leadership needs to be mindful of project exhaustion. Prioritization, again, is key, with clear lines of communication and alignment of triage between the stakeholders. An assessment of risk and reward will ensure that only essential hotfixes are completed. Other minor issues and change requests are best incorporated into a backlog for a regular development cycle to be completed.
Monitor: This part of the empowerment process has the longest cycle and can sometimes get the lowest attention, yet without proper forethought can become a vampire process, sucking time, attention, and compute cycles because the system never really stabilizes.
The monitor part of the cycle starts once there is consensus that the release is meeting stakeholder expectations, and the rate of change has slowed down. The project team has dispersed, and ownership has transitioned from the original developers to an operations team.
Ensuring that the operations team has everything they need to be successful is yet another transition that the project team has to consider carefully. The “support model” is often deferred in the project lifecycle until very late into delivery. Sometimes the transition to operations support is not as smooth as desired because the operations teams are left feeling that the project team just threw the new system over the wall to them, with inadequate training and guidance.
Dispatch works as the support team for many of our clients and is therefore quite opinionated about the best way to transition from the development team to operations. The operations team is consulted during initial requirements gathering to ensure that their needs for manageability, recoverability, scalability, etc. are addressed in the initial design. One tool that we use to capture these considerations is the FMEA (Failure Modes and Effects Assessment) model. The FMEA is borrowed from the automotive and aerospace industries and is a document summarizing what kind of failures may occur, what the system’s behaviour is during the events, and what the impacts would be in these scenarios. The FMEA is a beneficial tool for operations teams as a guide to help them address issues in production. A breakdown of the FMEA is a topic for another blog post, but we’ve found it quite useful in our projects to capture the voice of the operations team.
One particular challenge with integrations in production is that an integration between systems typically doesn’t have a user interface and runs silently, which can make monitoring system health and issue detection problematic.
More often than not, email notifications are the primary channel for integrations to tell their owners what is taking place. With Workday, for example, a notification can be created to email a security group or an email address directly that there was some sort of problem with an integration run or event. These notifications are intended to let operators know that something negative has taken place. Emails are a good start, but often miss the context to enable root-cause problem-solving.
A poorly designed integration error handling system can also create false-positive email events, which end up spamming the operations team with false failures that end up being ignored. Too many false positive emails undermine the intention of the email alerts and impacts productivity.
On the other hand, integration failures may also be undetected – for example, data files may be delivered, but the records may be incomplete or corrupt. In these cases, the integration system would happily report that all is well. These false negatives can result in significant costs and negative business impacts. Our FMEA analysis helps address these types of situations by anticipating these kinds of issues and building in detection mechanisms during design.
The Dispatch approach to the Empowerment phase of the development process pays attention to the elements of life in production by thinking about the end in mind and working backward. A successful project doesn’t end at the go-live – it’s actually just beginning!
Contact us to learn more about our products and services.
About Dispatch Integration:
Dispatch Integration is a software development and professional services firm that develops, delivers, and manages advanced data integration and workflow automation solutions. We exist to help organizations effectively deal with the complex and ever-changing need to integrate data and optimize end to end workflows between cloud-based, mission-critical applications.
Read More from Dispatch Integration:
- Data Integration: Life in Production - April 29, 2020
- Integration Case: Workday payroll replacement project at a national fashion retailer - April 30, 2019