Backend Engineering, Distributed Systems, Newsletter

State of Backend #1- Distributed Task Scheduling with Akka, Kafka and Cassandra

This is the first issue of the newsletter I’ve kickstarted for insights into the backend engineering space encircling topics like distributed systems, databases, data engineering, system design, architecture, scalability and the like, also the latest tools and technologies in the space. To get the content delivered to your inbox, you can subscribe to the newsletter at the end of this post.

Distributed Task Scheduling with Akka, Kafka and Cassandra at PagerDuty

PagerDuty, in a series of blog posts (Part 1, part 2 and part 3), discussed how they developed an open-source library with Akka, Kafka and Cassandra to solve the problem of task scheduling in a distributed infrastructure.

Their engineers needed to ensure that the tasks they schedule run on time in an orderly fashion in their infrastructure. The issue was that these tasks were arbitrary chunks of code (written to send an SMS or communicate with a database, etc.) that generally had to be scheduled at random times (one minute from now or one year from now).

If a task is scheduled to be one year from now, the infrastructure would possibly change. This would make the tasks either fail or behave in an uncertain way.

Initially, they used a solution called WorkQueue that leveraged an Apache Cassandra partitioned queue (which is an anti-pattern) to distribute tasks. The only way to execute tasks was to poll the Cassandra queue partitions. The team had to ensure all the partitions were regularly polled to maintain the task execution throughput. If a service instance polling the queue went down it required a complex sequence of steps to be replaced. Also, the WorkQueue was quite slow.

To tackle the issues, they developed a solution called the Scheduler, written in Scala. It uses Cassandra for task persistence, Apache Kafka to handle task queuing and partitioning and Akka to handle concurrency.

Redis Stack

Redis Stack clubs several Redis modules such as Redis search, Redis JSON and more in a single product to simplify the developer experience when working with various Redis modules.

Redis stack supports modern data models and data processing capabilities such as search, document, graph, time series, and probabilistic data structures—all implemented as dedicated Redis modules. In addition to this, it also provides an efficient tool to visualize and optimize Redis data.

With Redis Stack, developers can: index and query Redis data, perform full-text search, run aggregations and advanced vector similarity searches, manage time-series data, leverage graph data models and manage JSON documents efficiently.

HarperDB: More than Just a Distributed Database

With HarperDB, devs can define their own API endpoints with custom functions without having to manage the backend server. The ability to use custom functions makes HarperDB a distributed application development platform as opposed to just being a distributed database. So, as opposed to business logic residing on a dedicated backend server, it moves on to a custom function (like AWS Lambda functions).

This is something along the lines of what Firebase offers. What is different?

HarperDB is cloud platform agnostic. It can be deployed on the Edge, on-prem or used as a managed service. It can be deployed on devices as small as microprocessors like Raspberry Pi.

As opposed to traditional replication, it uses a pub-sub replication model to move data across instances within the network, ensuring we’re only moving the data we need.

LSM Tree: Data Structure Powering Write Heavy Storage Engines

Most of the leading databases leverage the B-Tree data structure for storage. But in the case of high-frequency writes accessing random nodes in the tree for updates due to the balancing operation of the tree can result in a bottleneck.

To overcome this, databases such as Cassandra and HBase leverage the Log-Structured Merge Tree or LSM for short to tackle high-frequency writes. Read on….

Directus – Open Source Data Platform

Directus is an open-source data platform that helps us visualize the data stored in our SQL databases better. Existing database tools like MySQL Workbench, phpMyAdmin and the like help visualize data, but they are more catered towards the technical folks with extensive knowledge of relational databases and SQL.

Directus is a data platform that sits on top of a SQL database (mirroring the content and the schema), providing a data toolkit for engineers as well as business people. Once configured, we immediately get a dynamic API (REST and GraphQL) and a no-code app to manage and view our data. No need to write any backend solely to fetch the data to the UI. Also, since the data is mirrored, the original data stays unaltered.

If you found the content interesting, consider subscribing to my newsletter to get the content delivered right to your inbox and share it with your network.

Tags #PagerDuty

Backend Engineering, Distributed Systems, Newsletter

State of Backend #2 - Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB. Here's Why.

Architecture, Distributed Systems, Real World Architecture

Live Video Streaming Infrastructure at Twitch

Shivang

Hello World! I am Shivang, I write about the architectures of large-scale internet services, scalability, distributed systems, databases, data engineering and backend engineering in general. You can connect with me on LinkedIn & Twitter.

About Me

Search

Categories

Trending News

System Design Case Study #5: In-Memory Storage & In-Memory Databases – Storing Application Data In-Memory To Achieve Sub-Second Response Latency

System Design Case Study #4: How WalkMe Engineering Scaled their Stateful Service Leveraging Pub-Sub Mechanism

Why Stack Overflow Picked Svelte for their Overflow AI Feature And the Website UI

A Discussion on Stateless & Stateful Services (Managing User State on the Backend)

System Design Case Study #3: How Discord Scaled Their Member Update Feature Benchmarking Different Data Structures

System Design Case Study #2: How GitHub Indexes Code For Blazing Fast Search & Retrieval

Recent Posts

Follow Me On Social Media

About Me

Latest

Popular

System Design Case Study #5: In-Memory Storage & In-Memory Databases – Storing Application Data In-Memory To Achieve Sub-Second Response Latency

System Design Case Study #4: How WalkMe Engineering Scaled their Stateful Service Leveraging Pub-Sub Mechanism

Why Stack Overflow Picked Svelte for their Overflow AI Feature And the Website UI

A Discussion on Stateless & Stateful Services (Managing User State on the Backend)

Instagram architecture & database – How does it store & search billions of images

YouTube database – How does it store so many videos without running out of storage space?

Facebook database [Updated] – A thorough insight into the databases used @Facebook

What is Grafana? Why Use It? Everything You Should Know About It

About Me

Search

Categories

Trending News

System Design Case Study #5: In-Memory Storage & In-Memory Databases – Storing Application Data In-Memory To Achieve Sub-Second Response Latency

System Design Case Study #4: How WalkMe Engineering Scaled their Stateful Service Leveraging Pub-Sub Mechanism

Why Stack Overflow Picked Svelte for their Overflow AI Feature And the Website UI

A Discussion on Stateless & Stateful Services (Managing User State on the Backend)

System Design Case Study #3: How Discord Scaled Their Member Update Feature Benchmarking Different Data Structures

System Design Case Study #2: How GitHub Indexes Code For Blazing Fast Search & Retrieval

State of Backend #1- Distributed Task Scheduling with Akka, Kafka and Cassandra

Distributed Task Scheduling with Akka, Kafka and Cassandra at PagerDuty

Redis Stack

HarperDB: More than Just a Distributed Database

LSM Tree: Data Structure Powering Write Heavy Storage Engines

Directus – Open Source Data Platform

Shivang

Related posts

Database Architecture – A Deep Dive – Part 1

An Insight into Bazaarvoice Scalable Architecture with Over 300 Million Visitors

Full list of distributed systems articles

Wide-column Database, Column Databases – A Deep Dive

A Discussion on Stateless & Stateful Services (Managing User State on the Backend)

Understanding the Actor model to build non-blocking, high-throughput distributed systems

Recent Posts

Follow Me On Social Media

Latest

Popular

System Design Case Study #5: In-Memory Storage & In-Memory Databases – Storing Application Data In-Memory To Achieve Sub-Second Response Latency

System Design Case Study #4: How WalkMe Engineering Scaled their Stateful Service Leveraging Pub-Sub Mechanism

Why Stack Overflow Picked Svelte for their Overflow AI Feature And the Website UI

A Discussion on Stateless & Stateful Services (Managing User State on the Backend)

Instagram architecture & database – How does it store & search billions of images

YouTube database – How does it store so many videos without running out of storage space?

Facebook database [Updated] – A thorough insight into the databases used @Facebook

What is Grafana? Why Use It? Everything You Should Know About It