This deep-dive technical post has been written by our CTO Dominic. It’s pretty technical and unashamedly uses acronyms and jargon.
This post reviews a particular challenge we faced in delivering a brand new music catalogue system for De Wolfe Music. Our remit was not just to produce a state-of-the art web platform for their world wide user base, but also to provide full catalogue management capabilities and systems integration for the back office. One challenge was to manage data synchronisation across multiple systems.
Any technologist will tell you that managing data that is duplicated across disparate systems is a hornets nest of problems best avoided. But sometimes you have to do the unavoidable! In our case, we had three systems involved: A music catalogue system, Salesforce and an internal accounting system.
Entities principally involved were accounts, contacts and products (to use Salesforce terminology). Some top level rules;
Best case scenario: Whatever is changed in one system is seamlessly, reliably and immediately reflected in the other two as necessary.
“2-phased commit” I hear you murmur. This implies the following: For each CRUD change a system is considering, the following logical procedure is followed:
Sounds simple. But it’s not. For many practical reasons. For starters, none of the systems could be reasonably changed such that their UI’s would operate without updating their native database prior to any asynchronous validity callout. And I’ve completely ignored the thorny issue of tracking key equivalents (i.e. we must record a cross reference of entity Id’s).
Given a limited budget, we settled for a pragmatic approach. The pragmatic approach took the form of;
“We expect synchronisation failures. We will try and make them as infrequent and inconsequential as possible. We will detect and report them. We will have procedures to rectify them”.
Leaving out the inevitable complexities involved, here’s the summary of what we did:
First, we ensured that the validation rules for each system was as closely aligned as possible. This means that if a CRUD succeeds in one, it will very likely succeed in the others. Second, we had a “fire and forget” notification strategy. In other words, once a CRUD operation succeeded in the native system it dispatched a message to a central service without the need for further error processing – the system simply assumed that all would be well. Third, each system would read a queue of inbound CRUD changes, apply those changes (if it so chose) and be given the opportunity to report back if it could not.
Thus we made the implementation task for each system both feasible and simple, putting all the smarts into the central management service. Each system interacts with the central management service by reading and writing standardised messages to designated queues.
So what does our central management service do? Its primary task is to manage the cross referencing of entity identifiers. As each message is received it will have at least one identifier specific to the entity to which the CRUD operation applies but may have other identifiers to “joins” – like the account to which a contact belongs. The service maintains a cross reference table of identifiers from all three systems and performs a translation on the incoming message before forwarding the message to the inbound queue for the other systems. If the primary identifier cannot be translated it’s interpreted as a new item.
From each system’s perspective, it only knows about its own identifiers and these are what it encodes in the messages it sends (together with other data content). If a system creates a new item, it has the responsibility of sending a message back to the central management service with the new identifier so that the server can record it as appropriate.
One little point of detail worth mentioning: Each system must differentiate between a change made by the application of a message from the queue versus any other means (eg via its UI). Changes made by the former must not dispatch a CRUD message back out or else we’ll have a merry-go-round!
So what can possibly go wrong? Well, a few things. Principally, it’s about messages failing to get applied into a system, and these errors are collated and reported for easy diagnostics and recovery by the central service. It is also possible for a change to be made on one system to overwrite an almost simultaneous change made in another on the same object. For a banking system this would be disastrous. In our situation, the business is happy to take the compromise based on a) it being extremely unlikely to occur and b) it being not particularly bothersome if it does. Besides, even the 2-phased commit procedure mentioned earlier would require enhancement to avoid this problem.
The use of a persistent queuing system, like RabbitMQ provides a great decoupling of system components and ensures system availability, even if functional components have to be taken down from time to time – messages are simply queued up.
In summary, we implemented a design pattern that supports any number of systems and minimised the work required to introduce each system. The service is resilient and responsive achieving near real-time synchronicity with appropriate error handling support.