How Razorpay handled significant transaction bursts during events like IPL
Razorpay is India’s leading payment-gateway service that offers a suite of products to businesses to accept, process and disburse payments enabling them to establish an online presence.
They dealt with an interesting problem: the problem of sudden transaction bursts during events such as the Indian premier cricket league. Businesses, especially in food delivery and gaming, offer flash sales and various offers to their customers minutes before and during the cricket match, which caused the sudden transaction burst.
The company’s infrastructure is hosted on AWS and configured to autoscale. But to do that, the autoscaler of the cluster and the nodes have to kick in. And this took 3-4 minutes. By this time, the system used to go down due to the heavy traffic influx.
During the IPL 2019 season, their system went down 3 days in a row for small durations. They were forced to rate-limit their customers to ensure their system was up and running.
After the season, they reviewed their architecture thoroughly and zeroed in on a few key issues:
They had no throttling implemented. This averted them from having control over the traffic hitting their servers.
The monitoring alerts were configured in a way that they got triggered after the issue occurred. There was a significant time lag between the issue occurrence and the alert.
The latency of the API calls to different banks went significantly high during the event. This created back-pressure on their services. Also, there was no automatic traffic routing to the banking systems with low latency.
Their backend is written in PHP, which is not scalable with databases since the programming language doesn’t support connection pooling natively. There were a lot of idle MySQL DB connections that ate up resources.
The MySQL servers had a lot of idle connections due to the high latency response from the banks. Since PHP does not support connection pooling natively, the DB connections couldn’t be used efficiently.
This became a bottleneck. To fix the issue, ProxySQL was added as the interceptor between the application and the MySQL servers.
ProxySQL deployed on Kubernetes not only provided an efficient pooling mechanism, but it enabled the persistence tier to be more highly available and scalable.
Rate-limiting and throttling
Rate-limiting and throttling were implemented to safeguard their system against a deluge of requests also DDoS attacks.
Initially, they had a very basic rate-limiting implemented on their application server using the leaky bucket algorithm. They upgraded their rate-limiting implementation to an Nginx-based proxy server having its dedicated cache.
The algorithm was changed from leaky bucket to fixed window, which proved more efficient. Several different algorithms can be leveraged to implement rate-limiting. Each has their use case.
Moreover, just implementing rate-limiting on the backend doesn’t stop the clients from sending the requests. In this scenario, the bandwidth is continually consumed as well as the rate-limiting logic has to constantly run on the backend consuming additional compute resources.
If we have control over the client, we need to implement rate-throttling on it to reduce the rate at which it sends the requests to the backend as and when it starts receiving errors in response.
If you are a developer and find it hard to cope with constant changes in technology. You are sick and tired of it. You are looking for ways to jump off that endless upskilling treadmill staying relevant and hireable.
You might want to check out my ebook, DEVELOPER’S ROADMAP TO EXCELLENCE AND BUILDING YOUR OWN THING, where I share with you the roadmap and techniques that I follow to keep my sanity in this ever-changing world of software development without killing myself. In it, you’ll find actionable advice and critical points that will enable you to make informed career decisions and accelerate your career at MACH speed.
Smart routing of requests
To deal with the challenge of high latency bank system responses during the traffic burst, a machine learning-based system was written to smartly route requests to other available bank systems.
Originally this routing was done manually, but when the peak load got to 200 to 1500 requests per second, manually routing the requests wasn’t viable anymore.
The machine learning system consumed payment success and failure events to predict in real-time where the payment requests should be directed. The model provided a probability of success with each banking gateway improving the payment success rate by 60%.
Besides these major changes, several other minor tweaks were made in the application and the infrastructure to help the system scale, such as code optimization, infrastructure automation, moving from synchronous to asynchronous processing, and so on.
Source: IPL: Razorpay’s second innings
Learn to design distributed systems
Learn to design distributed systems from Educative.io, check out the below courses:
Educative.io is a platform that helps software developers level up on in-demand technologies & prepare for their interviews via interactive text-based courses with embedded coding environments. They have over 975,000 learners on their platform.
The links are affiliate links. If you buy the course or a subscription, I get a cut without you paying anything extra.
If you liked the article, share it on the web. You can follow scaleyourapp.com on social media, links below, to stay notified of the new content published. I am Shivang, you can read about me here!
> Spotify Engineering: From Live to Recording
> Ingesting LIVE video streams at a global scale at Twitch
> $64,944 spent on AWS, to support 25,000 customers, in August by ConvertKit.
> Read how Storytel engineering computes customer consumption of books transitioning from batch processing to streaming bookmarks data with Apache Beam and Google Cloud.
> How Pokemon Go scales to millions of requests per second?
> Insight into how Grab built a high-performance ad server.
SUBSCRIBE TO MY NEWSLETTER to be notified of new additions to the list. Fortnight/monthly emails.
Looking for developer, software architect jobs? Try Jooble. Jooble is a job search engine created for a single purpose: To help you find the job of your dreams!!
- State of Backend #2 – Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB. Here’s Why.
- State of Backend #1- Distributed Task Scheduling with Akka, Kafka and Cassandra
- Live Video Streaming Infrastructure at Twitch
- Web Application Architecture Explained With Designing a Real-World Service
- Wide-column, Column-oriented and Column Family Databases – A Deep Dive with Bigtable and Cassandra