Table of contents
This table is automatically updated when new posts are published.
- Journey through eventsourcing: Part 1 - problem background and analysis - June 27, 2020
- Journey through eventsourcing: Part 2 - designing a solution - July 14, 2020
- Journey through eventsourcing: Part 3.1 - implementation - August 1, 2020
- Journey through eventsourcing: Part 3.2 - implementation - August 14, 2020
- Journey through eventsourcing: Part 4 - Pre-flight checks and launch - November 30, 2020
Journey through eventsourcing: Part 1 - problem background and analysis
Eventsourcing is not a straightforward approach to building software systems. However, in the face of challenging non-functional requirements, it might offer a unique set of features that make achieving seemingly conflicting and contradictory goals possible. In the case of my project, the main driver towards eventsourcing, was the tight latency budget and strong(-ish) consistency requirements.
On top of that, eventsourcing have a set of unique “functional” features, such as “time-travel”, having an audit log of events out of the box and support for (re)building data streams or views with the historical data. In our case, however, these were “nice to have” features and they have not affected the decision making a lot.
Journey through eventsourcing: Part 2 - designing a solution
There is almost always more than one way to do something - systems architecture and design is no exception. Choosing a good architecture style and metaphor is similar to laying a building foundation - choices made at this stage have an enormous impact on the quality, speed, and success of the project implementation. I strongly believe that allocating some resources to experimentation and design discussions during the early phases of the project is a good investment of time even in the most time-constrained situations.
Having spent a week or two investigating, experimenting and analyzing different architecture styles, I and my team came to the conclusion that eventsourcing offers the strongest foundation to achieve consistency, availability, scaling, performance and other goals. One of the key findings/decisions were to have different consistency models between reads and writes, and to achieve the acceptable availability levels by the means of rapid fully-automatic recovery from a failure.
Journey through eventsourcing: Part 3.2 - implementation
Building distributed systems are hard - it requires solving dozens of technical challenges not found in the classical, single-machine computing. It is also very intellectually rewarding - for the same reason. While going distributed should not be the first design choice, sometimes truly distributed system (as opposed to a replicated system - one that has multiple identical, independent and non-communicating nodes) offers unique capabilities that can alleviate or completely remove other hard challenges, such as scaling, consistency, concurrency, and such.
The good thing is that one does not have to solve all the hard problems from scratch. There are tools and frameworks out there, such as Akka, zookeeper, Kafka, and so on - that solve the toughest problems and let the developers handle the important ones. In this project, I and my team have used Akka to handle the challenges brought to us by the stateful, distributed, and lock-free architecture we chose. Akka offered a variety of tools and techniques for development and laid a strong foundation for the project’s success.
Journey through eventsourcing: Part 4 - Pre-flight checks and launch
Launching a critical system, especially based on novel technology and/or design approach is a risky operation - there’re a lot of unknown unknowns that risk materializing into devastating outages, data corruption, and other serious issues. However, many of those unknowns can be found and mitigated before the system takes flight - via load tests, canary release, A/B testing, or “dark launch”.
Mitigating all the risks right away might not be possible - but this is not always necessary. Quite often, it is enough to bring their potential impact to a manageable level - via reducing detection and recovery time, containing failure, or reducing the probability of failure. These usually come as documentation, instructions, or checklists, but can also take the form of scripts, additional diagnostic or administration endpoints, or tighter monitoring.