Best Courses, Books, Research Papers & Repos To Learn Software Architecture, System Design and Distributed Systems
In this article, I’ve put together a list of resources that I believe are super helpful in building a solid foundation in software architecture and designing large-scale distributed systems like Facebook, YouTube, Gmail, Uber and such.
I’ll start with the courses and then will move on to talk about the books.
Affiliate Disclaimer: Some resources stated in this article contain affiliate links. That means if you find these resources helpful and worthy of spending your money on, and you buy them, I get a small cut without you paying anything extra.
I recommend these resources to you because I think the content they offer is pretty good and these will assist you big time in upskilling yourself, enabling you to soar in your career.
Featured Platforms/Courses
CodeCrafters
CodeCrafters lets you build tools like Redis, Docker, Git and more from the bare bones. With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
Zero to Mastering Software Architecture
Zero to Mastering Software Architecture is a learning path comprising a series of three courses I have authored intending to educate you, step by step, on the domain of software architecture, cloud infrastructure and distributed system design.
This learning path offers you a structured learning experience, taking you right from having no knowledge on the domain to making you a pro in designing web-scale distributed systems like YouTube, Netflix, ESPN and the like.
Mongo DB University
Free MongoDB courses – practice your skills with hands on labs and quizzes, and earn MongoDB certification. Learn in your programming language of choice with Node, Python, C#, PHP and Java developer courses.
Neo4J Graph Academy
Master Neo4j (a graph database) with free, hands-on courses. Learn how to read from and write to Neo4j, including the more advanced cypher functionality, APOC, and everything in between.
The Platform includes Neo4j Graph Data Science – the leading enterprise-ready analytics workspace for graph data – the graph visualization and exploration tool Bloom, the Cypher query language, and numerous tools, integrations and connectors to help developers and data scientists build graph-based solutions with ease.
GitHub Repo
CDN Up & Running
With this repo, understand how CDNs work by coding one from scratch. The CDN they are designing uses: Nginx, Lua, Docker, Docker-compose, Prometheus, Grafana, and Wrk.
They start with a single backend service and expand from there to a multi-node, latency simulated, observable, and testable CDN. In each section, there are discussions regarding the challenges and trade-offs of building/managing/operating a CDN.
Research Papers
Efficiently Archiving Photos under Storage Constraints
This paper addresses the data storage problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained.
Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System
This paper, presents an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations to ensure the continuous availability of the hardware in large-scale internet services that run on a fleet of distributed servers.
A Design Framework for Highly Concurrent Systems
This paper presents a general-purpose design framework for building highly concurrent systems, based on three design components — tasks, queues, and thread pools — which encapsulate the concurrency, performance, fault isolation, and software engineering benefits of both threads and events.
It also contains a discussion on a set of design patterns that can be applied to map an application onto an implementation using these components.
The Tail at Scale
This article outlines some of the common causes of high latency episodes in large online services and describes techniques that reduce their severity or mitigate their impact in whole system performance. In many cases, tail-tolerant techniques can take advantage of resources already deployed to achieve fault-tolerance, resulting in low additional overheads. We show that these techniques allow system utilization to be driven higher without lengthening the latency tail, avoiding wasteful over-provisioning.
Books
Building Secure & Reliable Systems – Google SRE
This ebook provides insights about system design, implementation, and maintenance from practitioners who specialize in security and reliability. Targets folks who design, implement and maintain systems.
Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. The book encourages us to think about the fundamentals of reliability and security from the very beginning of the development process and integrating those principles early in the system lifecycle.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Designing data-intensive applications by Martin Kleppmann is one of the best sellers in the domain of designing large-scale applications. This book helps you understand the pros & cons of picking different technologies for processing and storing data in your application. It discusses the fundamentals of data processing and also takes a deep dive into concepts like scalability, high availability, consistency, reliability, different kinds of databases, distributed systems and more.
If you work on the backend, deal with databases to store data when developing mobile apps, web apps and such, if you want to understand how to make data systems scalable, this book will help you big time in developing a good foundation in large-scale system design.
The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise
The Art of Scalability is written by industry consultants that educate you on how to scale products and services for different requirements. The authors discuss case studies from their consulting practice giving the readers insights into cloud transitions, NoSQL, DevOps, business metrics, measuring availability, capacity, load and performance and more. The insights and recommendations of the authors reflect more than thirty years of experience at companies from eBay, Visa, Salesforce and Apple.
Web Scalability For Startup Engineers
This book discusses core concepts and best practices for developing scalable applications in a startup environment. It describes how infrastructure and software architecture blend together when building scalable systems. The book also contains diagrams and real-world examples to help understand the concepts better.
Readers of this book will learn the key principles of software design for scalable systems, concurrency and throughput, designing APIs, implementing caching, how to leverage asynchronous processing, messaging, event-driven architecture and more.
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
Data Streaming, both in real-time and in batches is a key component in modern web applications. This book helps the readers understand the underlying architecture and fundamentals of streaming systems, right from the introductory level of how data processing streams function. This is a practical guide with real-world examples for software developers, data engineers and data scientists on how to work with streaming data in a conceptual and platform-agnostic way.
Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale
Architecting modern data platforms contains in-depth information on big data technologies. It takes a practical approach to educate the reader on how to build big data infrastructure both on-premises and in the cloud.
It walks you through different component layers in a modern data platform and also on concepts like high availability, disaster recovery, deployment, operations, security and more.
Database Internals: A Deep Dive Into How Distributed Data Systems Work
Database internals as the title says takes a deep dive into how distributed data systems work. This book is a practical guide to the concepts behind modern databases and the internals of their storage engines. You’ll understand how storage is organized and how the data is distributed across the system.
The book talks about storage engines explaining concepts like storage classification, B-Tree based & immutable log-structured storage engines with their respective use cases. How database files are organized to build efficient storage using data structures such as Page Cache, Buffer Pool & Write-ahead Log. You’ll learn how nodes and processes work in conjunction with each other in distributed systems, how data consistency models work and so on.
Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services
Designing distributed systems discusses patterns used in the development of reliable distributed systems. The author who is the director of engineering at Microsoft Azure explains how we can adapt existing software design patterns for designing and building reliable distributed applications. System engineers and application developers will learn how they can improve the quality of their systems using the patterns discussed in the book.
The book also touches upon the distributed system patterns for large-scale batch data processing involving work queues, event-based processing and coordinated workflows.
Building Microservices: Designing Fine-Grained Systems
This book educates the reader on the techniques of modeling, integrating, testing, deploying and monitoring a microservice. All the concepts are discussed with the help of an example of a fictional company.
The book discusses key concepts & challenges involved in scaling the microservices architecture, managing security with the user-to-service and service-to-service models, dealing with complexities of testing and monitoring distributed services, deploying microservices through continuous integration, splitting monolithic codebases into microservices and more.
Microservice Architecture
Microservice Architecture discusses the right way to approach microservices architecture. It discusses technologies and methodologies involved in building microservices from the ground up along with the experiences of large-scale services that have adopted microservices architecture.
The book is split into three parts that discuss –
How microservices work & what it means to build a system using the microservices architecture.
A design-based approach for implementing the microservices architecture.
Best practices on how to handle the challenges of introducing the microservices architecture in your organization.
Site Reliability Engineering – How Google Runs Production Systems
The site reliability engineering book discusses the entire application deployment lifecycle that includes building, deploying, monitoring and maintaining the services at Google. Readers will learn the principles and practices that enable Google engineers to make their services more scalable, reliable and efficient.
The book is split into four parts – where the first part gives an introduction to SRE site reliability engineering and how it differs from traditional IT practices. The other two parts talk about the patterns and behavior involved in the day-to-day work of an SRE engineer when building and operating large-scale distributed computing systems. The last part touches upon Google’s best practices for running its infrastructure.
You can read the book online here.
This list of software engineering resources will be continually updated as I find new quality resources in the domain.
To stay notified of new developments, subscribe to my newsletter:
Shivang
Related posts
Zero to Software Architecture Proficiency learning path - Starting from zero to designing web-scale distributed services. Check it out.
Master system design for your interviews. Check out this blog post written by me.
Zero to Software Architecture Proficiency is a learning path authored by me comprising a series of three courses for software developers, aspiring architects, product managers/owners, engineering managers, IT consultants and anyone looking to get a firm grasp on software architecture, application deployment infrastructure and distributed systems design starting right from zero. Check it out.
Recent Posts
- System Design Case Study #5: In-Memory Storage & In-Memory Databases – Storing Application Data In-Memory To Achieve Sub-Second Response Latency
- System Design Case Study #4: How WalkMe Engineering Scaled their Stateful Service Leveraging Pub-Sub Mechanism
- Why Stack Overflow Picked Svelte for their Overflow AI Feature And the Website UI
- A Discussion on Stateless & Stateful Services (Managing User State on the Backend)
- System Design Case Study #3: How Discord Scaled Their Member Update Feature Benchmarking Different Data Structures
CodeCrafters lets you build tools like Redis, Docker, Git and more from the bare bones. With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done.
Get 40% off with this link. (Affiliate)
Follow Me On Social Media