This is the first issue of the newsletter I’ve kickstarted for insights into the backend engineering space encircling topics like distributed systems, databases, data engineering, system design, architecture, scalability and the like, also the latest tools and technologies in the space. To get the content delivered to your inbox, you can subscribe to the newsletter at the end of this post.
Distributed Task Scheduling with Akka, Kafka and Cassandra at PagerDuty
PagerDuty, in a series of blog posts (Part 1, part 2 and part 3), discussed how they developed an open-source library with Akka, Kafka and Cassandra to solve the problem of task scheduling in a distributed infrastructure.
Their engineers needed to ensure that the tasks they schedule run on time in an orderly fashion in their infrastructure. The issue was that these tasks were arbitrary chunks of code (written to send an SMS or communicate with a database, etc.) that generally had to be scheduled at random times (one minute from now or one year from now).
If a task is scheduled to be one year from now, the infrastructure would possibly change. This would make the tasks either fail or behave in an uncertain way.
Initially, they used a solution called WorkQueue that leveraged an Apache Cassandra partitioned queue (which is an anti-pattern) to distribute tasks. The only way to execute tasks was to poll the Cassandra queue partitions. The team had to ensure all the partitions were regularly polled to maintain the task execution throughput. If a service instance polling the queue went down it required a complex sequence of steps to be replaced. Also, the WorkQueue was quite slow.
To tackle the issues, they developed a solution called the Scheduler, written in Scala. It uses Cassandra for task persistence, Apache Kafka to handle task queuing and partitioning and Akka to handle concurrency.
Redis Stack
Redis Stack clubs several Redis modules such as Redis search, Redis JSON and more in a single product to simplify the developer experience when working with various Redis modules.
Redis stack supports modern data models and data processing capabilities such as search, document, graph, time series, and probabilistic data structures—all implemented as dedicated Redis modules. In addition to this, it also provides an efficient tool to visualize and optimize Redis data.
With Redis Stack, developers can: index and query Redis data, perform full-text search, run aggregations and advanced vector similarity searches, manage time-series data, leverage graph data models and manage JSON documents efficiently.
HarperDB: More than Just a Distributed Database
With HarperDB, devs can define their own API endpoints with custom functions without having to manage the backend server. The ability to use custom functions makes HarperDB a distributed application development platform as opposed to just being a distributed database. So, as opposed to business logic residing on a dedicated backend server, it moves on to a custom function (like AWS Lambda functions).
This is something along the lines of what Firebase offers. What is different?
HarperDB is cloud platform agnostic. It can be deployed on the Edge, on-prem or used as a managed service. It can be deployed on devices as small as microprocessors like Raspberry Pi.
As opposed to traditional replication, it uses a pub-sub replication model to move data across instances within the network, ensuring we’re only moving the data we need.
LSM Tree: Data Structure Powering Write Heavy Storage Engines
Most of the leading databases leverage the B-Tree data structure for storage. But in the case of high-frequency writes accessing random nodes in the tree for updates due to the balancing operation of the tree can result in a bottleneck.
To overcome this, databases such as Cassandra and HBase leverage the Log-Structured Merge Tree or LSM for short to tackle high-frequency writes. Read on….
Directus – Open Source Data Platform
Directus is an open-source data platform that helps us visualize the data stored in our SQL databases better. Existing database tools like MySQL Workbench, phpMyAdmin and the like help visualize data, but they are more catered towards the technical folks with extensive knowledge of relational databases and SQL.
Directus is a data platform that sits on top of a SQL database (mirroring the content and the schema), providing a data toolkit for engineers as well as business people. Once configured, we immediately get a dynamic API (REST and GraphQL) and a no-code app to manage and view our data. No need to write any backend solely to fetch the data to the UI. Also, since the data is mirrored, the original data stays unaltered.
If you found the content interesting, consider subscribing to my newsletter to get the content delivered right to your inbox and share it with your network.

Shivang
Related posts
Zero to Mastering Software Architecture Learning Track - Starting from Zero to Designing Web-Scale Distributed Applications Like a Pro. Check it out.
Master system design for your interviews. Check out this blog post written by me.

Recent Posts
- System Design: Hone Your System Design Skills By Exploring Real-World Web-Scale System Architectures [Feed Updated Daily]
- Single-threaded Event Loop Architecture for Building Asynchronous, Non-Blocking, Highly Concurrent Real-time Services
- Understanding SLA (Service Level Agreement) In Cloud Services: How Is SLA Calculated In Large-Scale Services?
- Database Architecture – Part 2 – NoSQL DB Architecture with ScyllaDB (Shard Per Core Design)
- Parallel Processing: How Modern Cloud Servers Leverage Different System Architectures to Optimize Parallel Compute
- Database Architecture – A Deep Dive – Part 1
Follow Me On Social Media