State of Backend #1- Distributed Task Scheduling with Akka, Kafka and Cassandra
This is the first issue of the newsletter I’ve kickstarted for insights into the backend engineering space encircling topics like distributed systems, databases, data engineering, system design, architecture, scalability and the like, also the latest tools and technologies in the space. To get the content delivered to your inbox, you can subscribe to the newsletter at the end of this post.
Distributed Task Scheduling with Akka, Kafka and Cassandra at PagerDuty
PagerDuty, in a series of blog posts (Part 1, part 2 and part 3), discussed how they developed an open-source library with Akka, Kafka and Cassandra to solve the problem of task scheduling in a distributed infrastructure.
Their engineers needed to ensure that the tasks they schedule run on time in an orderly fashion in their infrastructure. The issue was that these tasks were arbitrary chunks of code (written to send an SMS or communicate with a database, etc.) that generally had to be scheduled at random times (one minute from now or one year from now).
If a task is scheduled to be one year from now, the infrastructure would possibly change. This would make the tasks either fail or behave in an uncertain way.
Initially, they used a solution called WorkQueue that leveraged an Apache Cassandra partitioned queue (which is an anti-pattern) to distribute tasks. The only way to execute tasks was to poll the Cassandra queue partitions. The team had to ensure all the partitions were regularly polled to maintain the task execution throughput. If a service instance polling the queue went down it required a complex sequence of steps to be replaced. Also, the WorkQueue was quite slow.
To tackle the issues, they developed a solution called the Scheduler, written in Scala. It uses Cassandra for task persistence, Apache Kafka to handle task queuing and partitioning and Akka to handle concurrency.
Redis stack supports modern data models and data processing capabilities such as search, document, graph, time series, and probabilistic data structures—all implemented as dedicated Redis modules. In addition to this, it also provides an efficient tool to visualize and optimize Redis data.
With Redis Stack, developers can: index and query Redis data, perform full-text search, run aggregations and advanced vector similarity searches, manage time-series data, leverage graph data models and manage JSON documents efficiently.
HarperDB: More than Just a Distributed Database
With HarperDB, devs can define their own API endpoints with custom functions without having to manage the backend server. The ability to use custom functions makes HarperDB a distributed application development platform as opposed to just being a distributed database. So, as opposed to business logic residing on a dedicated backend server, it moves on to a custom function (like AWS Lambda functions).
This is something along the lines of what Firebase offers. What is different?
HarperDB is cloud platform agnostic. It can be deployed on the Edge, on-prem or used as a managed service. It can be deployed on devices as small as microprocessors like Raspberry Pi.
As opposed to traditional replication, it uses a pub-sub replication model to move data across instances within the network, ensuring we’re only moving the data we need.
LSM Tree: Data Structure Powering Write Heavy Storage Engines
Most of the leading databases leverage the B-Tree data structure for storage. But in the case of high-frequency writes accessing random nodes in the tree for updates due to the balancing operation of the tree can result in a bottleneck.
Directus – Open Source Data Platform
Directus is an open-source data platform that helps us visualize the data stored in our SQL databases better. Existing database tools like MySQL Workbench, phpMyAdmin and the like help visualize data, but they are more catered towards the technical folks with extensive knowledge of relational databases and SQL.
Directus is a data platform that sits on top of a SQL database (mirroring the content and the schema), providing a data toolkit for engineers as well as business people. Once configured, we immediately get a dynamic API (REST and GraphQL) and a no-code app to manage and view our data. No need to write any backend solely to fetch the data to the UI. Also, since the data is mirrored, the original data stays unaltered.
If you found the content interesting, consider subscribing to my newsletter to get the content delivered right to your inbox and share it with your network.
> Spotify Engineering: From Live to Recording
> Ingesting LIVE video streams at a global scale at Twitch
> $64,944 spent on AWS, to support 25,000 customers, in August by ConvertKit.
> Read how Storytel engineering computes customer consumption of books transitioning from batch processing to streaming bookmarks data with Apache Beam and Google Cloud.
> How Pokemon Go scales to millions of requests per second?
> Insight into how Grab built a high-performance ad server.
SUBSCRIBE TO MY NEWSLETTER to be notified of new additions to the list. Fortnight/monthly emails.
Looking for developer, software architect jobs? Try Jooble. Jooble is a job search engine created for a single purpose: To help you find the job of your dreams!!
- State of Backend #2 – Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB. Here’s Why.
- State of Backend #1- Distributed Task Scheduling with Akka, Kafka and Cassandra
- Live Video Streaming Infrastructure at Twitch
- Web Application Architecture Explained With a Real-World Example
- Wide-column Database, Column Databases – A Deep Dive