- Biweekly Engineering
- Posts
- Real-Time Messaging at Slack - Biweekly Engineering - Episode 18
Real-Time Messaging at Slack - Biweekly Engineering - Episode 18
What is a Distributed System? - How Slack implemented a real-time messaging system - Large-scale event storage at Datadog
Hola! How is it going dear readers? Glad to be back with the 18th episode of Biweekly Engineering!
Today, we go back to basics of distributed systems, along with two blog posts from Slack and Datadog. Let’s enjoy!
From the top of Berlin Cathedral
What is a Distributed System?
As you probably know, I have my own course on distributed systems. And it’s now part of Educative’s Become a Distributed Systems Professional learning path.
Guess what, this article is the one where it all started. This is the first article I ever read on this topic!
The great thing about the article is that the author first discusses what not a distributed system, and builds the later discussion on top of it. And we end up with the following definition:
A distributed system is nothing more than multiple entities that talk to one another in some way, while also performing their own operations.
After the definition, the author also discusses the concept of node. We all learnt about nodes in Graph Theory. And here, there is actually quite a bit of similarity.
Lastly, the author shows us how clocks are critical in distributed systems, and every node performs in its own clock. In fact, clock synchronisation between nodes is a complex problem in the distributed systems world.
If you have little to no idea of distributed systems, this article is a great place to start. Highly recommended!
How Slack implemented its real-time messaging system
For sure a very common system design interview question, but today we see how it is done in real-life! It’s super cool to see that Slack Engineering has an article on the real-time messaging system - one of their core products.
The beautifully written article starts with introducing different server types to the readers. Then we learn how a Slack client is booted up and gets connected to the Slack backend for being online and transmitting and receiving messages. Very briefly, upon starting, a Slack client-
Fetches user data (like auth token and websocket setup information) from the webapp backend.
Establishes a websocket connection with the nearest Gateway Server.
The Gateway Server fetches user information (like channels the user belongs to) from the webapp backend.
The Gateway Server also subscribes to all Channel Servers that hold channel-specific data and responsible for dealing messages for those channels.
Upon establishing the connection, now the Slack client is capable of sending and receiving messages. The process briefly looks like the following:
A client sends a message which gets routed to the specific Channel Server responsible for handling any messages on that channel.
The Channel Server then sends the message to the Gateway Servers that have already subscribed to the specific channel where a message is being sent to.
The Gateway Server sends the message to all the clients that belong to the channel over the websocket connections.
The Channel Server can also forward the message to any Gateway Servers in a different region.
So in short, a client subscribes to channels in the Gateway Servers. And a Gateway Server subscribes to channels in the Channel Servers on the behalf of the clients. Channel Servers receive messages from the webapp backend and broadcast them to the Gateway Servers, which in turn broadcast them to the participants in a channel.
If this intrigues you, don’t forget to go through the article!
Husky - the large-scale event store at Datadog
Datadog is a company that provides observability products to tech companies. It has been providing metric aggregation and visualisation services for a long time now, and a few years ago, the company launched Datadog Logs - its log management platform.
And this brought in the need for revamping their event storage platform. So how did they do it?
The initial version of the Datadog Logs system was simple - a bunch of search clusters connected to corresponding Kafka clusters. Events are read from Kafka directly, sharded, and indexed. The nodes in the cluster are responsible for both storing the data and distributing among themselves in the cluster. Also, the search clusters are multi-tenant, meaning, each node stores data for various different systems and customers.
The greatest bottleneck with this first version of the system was that if a node is down, it affected all the tenants, and could possibly affect the whole cluster. This is bad.
The next version of the system decoupled storage and sharding. Events would be read from the Kafka cluster by a new type of node called shard routers. These shard routers partition the data and write them back to a new Kafka cluster. Storage nodes read from the new Kafka cluster and only read the shards that are assigned to them. In this design, the team also added a query engine for querying the data from the storage nodes.
But interestingly, Datadog Logs went through hyper-growth, which drove the team to revamp the system, and create a brand new architecture to support the expected higher scale. This is why they created Husky.
How did Datadog finally implement Husky, the new generation log management platform? Don’t forget to learn more from the article!
And that’s a wrap for today! Thanks a lot for reading, and hopefully, you are going to learn something new from today’s episode. See you in the next one!
Reply