The launch we covered in the previous post was a major milestone, but not the final destination. In fact, it was a solid foundation for the many improvements made later - from small bugfixes and tuning, to supporting new business initiatives. The design choices we made - eventsourcing, statefulness, distributed system, etc. - affected all of those changes; most often making hard things easy, but sometimes making easy things complex.
The events, systems, designs, implementations, and other information listed here are not in any way connected with my current work and employer. They all took place in 2017-2019 when I was part of the Capacity and Demand team at Redmart - retail grocery seller and delivery company. To the best of my knowledge, it does not disclose any trade secret, pending or active patent, or other kinds of protected intellectual property - and instead focuses on my (and my team’s) experience using tools and techniques described. Most of it was already publicly shared via meetups, presentations, and knowledge transfer sessions.
Back to the series overview
- Journey through eventsourcing: Part 1 - problem background and analysis - June 27, 2020
- Journey through eventsourcing: Part 2 - designing a solution - July 14, 2020
- Journey through eventsourcing: Part 3.1 - implementation - August 1, 2020
- Journey through eventsourcing: Part 3.2 - implementation - August 14, 2020
- Journey through eventsourcing: Part 4 - Pre-flight checks and launch - November 30, 2020
- Journey through eventsourcing: Part 5 - Support and Evolution - February 15, 2021
TL;DR: schema evolution in eventsourcing systems is much more convoluted than in state-sourcing. Before making a decision to go with eventsourcing vs. state-sourcing, familiarize yourself (and the team) with the implications - it might influence the decision heavily.
Schema evolution is relatively trivial in a classical state-sourced system. When persisted data changes its structure, the solution is to write and run a schema migration script. The script might be a bunch of instructions in a fancy DSL, literally a python/go/ruby/JS/etc. script to read-update-save DB records in a loop or just a SQL command to execute. Many languages, frameworks, and 3rd party libraries exist to support that1. Shortly put: problem solved, nothing to see here, folks.
On the contrary, in eventsourcing, the events are expected to be valid indefinitely - the current version of the application code should be capable of handling any prior events. It makes the “just run the migration script” approach much harder and not always possible2. The eventsourcing community had come up with multiple solutions, such as in-application event adapters, copy-transform event updates, snapshot+discard techniques, and so on (one of my old presentations has a slide on it) - each having a different impact on the application itself and related systems3.
In our case, we went with the in-app adapter approach - the one promoted by the Akka Persistence4. All-in-all it was an interesting exercise - writing and enabling the adapter was easy; however, one of the models changed three times in a couple of months, each change producing yet-another-event-adapter. So we were on the brink of needing something more radical (I was exploring the snapshot+discard options), but then the data model finally stabilized.
Logistics optimization project
Image source: Wikimedia
TL;DR: eventsourcing makes it easy to implement CQRS, which in turn makes it easy to implement many things. One nice trick is to build a “private” query endpoint to provide hints to the command endpoints. It has many uses: using a materialized view as an optimized query, access some data usually not available for the command endpoint, etc.
One of the business initiatives our new system unlocked revolved around optimizing logistics efficiency. From a technical perspective, the change was to allocate customer orders into capacity pools more efficiently. Depending on what other orders were in the same capacity pool, both the financial and capacity cost of fulfilling the order could vary significantly. That required capacity pool actors inspect other pools that could potentially fulfill the order.
A straightforward approach - poll other actors about their “cost of fulfillment” - was possible but inefficient. Instead, my colleague came up with a different solution - these additional data requirements could be formulated as a simple index lookup if we could “slice” the data differently. Essentially it meant building and maintaining a materialized view for a new projection of the system state. We had CQRS embedded deeply into our solution, so building another query endpoint was simple enough - in fact, we had multiple options for a particular implementation. Because of the issues with Distributed Data, we faced earlier, we went ahead with a Cassandra-based solution - with tables carefully designed against the query, allowing the data to be read from memory almost all the time.
TL;DR: integrating with logistics systems of Redmart’s parent company imposed even stricter latency and availability requirements. The choice of technology allowed us to scale out with ease and migrate to more efficient networking - while keeping correctness and consistency requirements intact.
This project had relatively little to do with the eventsourcing but was an ultimate test of the entire solution and technology choices we made along the way. It touched both “traditional” and eventsourcing systems we had in our solution5. The requirements were simple - we needed to expose the existing API via a new “medium” (proprietary solution on top of gRPC), achieve <100ms latency @ 99th percentile for 5x current traffic and 99.95% availability6.
Creating gRPC endpoints were relatively straightforward - the choice to go with Akka HTTP played out quite well due to the design philosophy - it is a library to build HTTP interfaces, not a framework to structure the app. Due to that, we just had to add Akka gRPC alongside the existing HTTP endpoints and wire them to the application logic. It wasn’t just a configuration switch - some effort was necessary to “reverse-engineer” the DTOs we used in REST+JSON endpoints into protobuf declarations - but still straightforward enough.
The initial gRPC implementation needed some more work down the road to meet the aggressive latency targets - essentially my teammates had to build an analog of the “smallest-mailbox” routing strategy over the Akka-gRPC client connections - to achieve better client-side load-balancing.
Reducing latency required quite a lot of tinkering and tweaking - although a “classical” system would need most of them as well. To name a few: tweaking JVM garbage collection to cause less stop-the-world GCs, enabling Cassandra’s speculative execution, and aggressively caching everything we could (via Hazelcast Near Cache with some custom cache warmup code).
One thing that is extremely relevant to the technology we used was the move to a new Akka remoting backend - we moved from Netty-based TCP remoting (goes by “Classic Remoting” now) to Artery UDP remoting. While this is a large change that delivered tangible latency saves (~10-20ms, depending on the load), the code changes were small - mostly configuration and dependency changes.
Overall, the integration project was a major success - we integrated in time and with huge margins towards the load and latency targets - the system could sustain about 10x the target traffic (50x actual traffic) while having ~20% buffer on latency (~80ms 99th percentile) in synthetic load tests.
There are some other minor-but-noteworthy learnings. So I decided to put them here, at the end of the journey, in the last section.
Planning for a 10x/20x/etc. throughput is not an overkill: there was some minor debate about throughput targets we should set. 10x seemed a little excess, especially taking into account that the growth was limited by a real-world operations. Nevertheless, our stake in designing for an order of magnitude higher throughput paid off. During a spike of demand due to COVID and operational constraints, the system handled 40x traffic “in the wild” regularly and with unnoticeable customer experience degradation. Failing at that time would mean tremendous financial and reputation losses to the company.
“Time travel”: one of eventsourcing selling points is the ability to restore the system to any prior state - also known as “time travel.” Having such an ability sounds quite exciting for debugging and audit - but it is not that straightforward to achieve. The main question is schema migrations - some approaches to migration destroy the “time continuum” and make it impossible to “time travel” beyond a certain point. Moreover, developers will need to build some infrastructure to expose the time-traveled state alongside the current state - a separate set of endpoints or a dedicated service deployment is necessary. Simply put, “time travel” is not free and impacts other decisions heavily.
Monitoring: One trick that helped us a lot was to enrich our liveness probe (is the system running?), with some usefulness information (does it do the right job?). It was mostly a hack - the liveness endpoint was already integrated with many monitoring tools and dashboards and was mandatory for all services, while usefulness monitoring was not a thing7. Putting usefulness information into the liveness check, we made alerts report a somewhat higher problem impact than there was, but notify us about the problems much earlier. It was handy during the stabilization phase, shortly after launch - there were cases when certain groups of actors would go down while the rest of the system (including the liveness probe) worked as expected - such cases would be hard to notice otherwise.
Eventsourcing systems come with a unique set of capabilities but bring some issues/concerns to address. These unique features make evolving eventsourcing systems a very different process compared to the classical, state-sourced systems - many sophisticated things become simple, if not trivial (such as audit, system state provenance, derived data streams, etc.); but many simple things become much more convoluted (schema evolution, persistence, etc.).
My team experienced both “good” and “challenging” aspects - some business evolution projects and business initiatives hit the sweet spot where the system design and technology choices allowed us to achieve the goals fast and without compromising the core system values. On the downside, some other changes required much higher effort to implement compared to similar changes in “state-sourced” systems.
Here the journey ends. Looking back, we have covered the full lifecycle of a software system built on eventsourcing principles - from gathering the requirements and evaluating the goals/challenges against eventsourcing promises, to make concrete decisions regarding technology and implementation, to launching and evolving the system.
I hope that you have learned a lot (or at least a few things ) and are better informed about the strengths of the eventsourcing systems, as well as the challenges they pose. I hope that next time when tasked to build a new system or service, you’ll consider eventsourcing as an option - and make an informed decision.
for example, when adding a new event attribute, it is not always possible to “go back in time” and infer what the value of that attribute would have been for an old event. ↩︎
In-app adapters generally work well for the app, but any derived data streams are left behind. Copy-transform tailors to the derived data much better, but the application itself requires more changes and planning. Reasoning about the system is easier with snapshot+discard, but it erases history. Other techniques also have pros and cons. ↩︎
initial requirements were <10ms max latency and 100% availability. We were able to negotiate to reasonable values given the deployment specifics and network infrastructure. ↩︎
or at least it was assumed that usefulness === liveness ↩︎