Biweekly Engineering
Posts
Highly Scalable Media Service at Canva - Biweekly Engineering - Episode 3

Highly Scalable Media Service at Canva - Biweekly Engineering - Episode 3

From Canva, Shopify, and Twitch

Biweekly Engineering
December 11, 2022

Welcome to the third issue of Biweekly Engineering Blogs! Hope everyone is having a great time during the FIFA World Cup season, and managing some time to go through the articles shared in our last email. We will discuss some techniques on how to consistently read complicated engineering articles in a future issue. For now, let's begin!

Today we have articles from Canva, Shopify, and Twitch.

Spiral stairs in the Vatican Museum

How Canva scaled its media service to serve 50 million uploads per day

#canva #mysql #dynamodb #datamigration

From Zero to 50 Million Uploads per Day: Scaling Media at Canva

The evolution of media persistence during hypergrowth at Canva

https://canvatechblog.com/from-zero-to-50-million-uploads-per-day-scaling-media-at-canva-c81fa0c92f34

Canva is a well-known platform for designers that hosts billions of media content like stock photos or graphics. It's a safe haven for designers to ideate and create beautiful designs. The platform has 100 million monthly active users right now with a staggering 25 billion user-uploaded media.

It's a standard practice to store media files in some object store while storing metadata related to the files in some database. The metadata generally contains a URL to the actual media which can be downloaded from there. To store and manage this metadata, Canva has a media service, which was initially built on top of MySQL. But as happens with MySQL, at high scale, problems started to show up. Some steps were taken to remedy the issues, for example, removing foreign key constraints, denormalising some tables, reducing metadata updates, database sharding, etc. But the team knew that it wouldn't be enough.

After some thorough investigation, the team decided to replace the MySQL layer with DynamoDB. As expected, they would also need to migrate to DynamoDB without disrupting the operations of the service.

The article shows how complicated migration tasks can be at high scale, specially when you decide to do it live. Briefly, the process was as follows -

For every create or update request for metadata on the media service, update the MySQL database as usual. After that, send a message to an Amazon SQS high priority queue which is consumed by a worker instance. Upon receiving the message, the worker reads the state from the MySQL master, and writes the data on DynamoDB.
For every read request, send a message to SQS the same way as before. But this time, send it to a low priority queue. The workers will pick the messages from the low priority queue only when the high priority queue is exhausted.

What was the benefit of this approach? The above actually ensured that recently read or written data is migrated to DynamoDB first. For the older data, a scanning process is used that would publish messages to the low priority queue, the same way to be picked up by the media workers.

Server Sent Events (SSE) to scale a real-time system at Shopify

Using Server Sent Events to Simplify Real-time Streaming at Scale

We walk through how we implemented an SSE server that's scalable and load-balanced to simplify and improve a real-time data visualization application.

https://shopify.engineering/server-sent-events-data-streaming

#shopify #realtime #flink #bigdata

Shopify has a system that provides a live dashboard - Black Friday Cyber Monday (BFCM) Live Map. The idea is to showcase real-time sales data, like total sales, number of orders, number of unique shoppers, trending products, etc.

Given Shopify's scale, it's an interesting problem to solve - to show live sales data at scale in real-time. So how did they go about it?

The design in 2021 was based on WebSocket servers. There is an Apache Flink system that would process data from Kafka sales data topics, and also from Google Cloud Storage parquet files. Then the data is fed into multi-component Golang based system called Cricket, built on top of WebSocket, Redis, and MySQL. Cricket sends the data into a mailbox system based on Redis, from which webclients would poll every 10s to fetch live data.

This system worked but it had room for improvement in terms of data latency. For example, trending products data would need minutes to show up on the live dashboard.

In 2022, Shopify built the BFCM system based on Server Sent Events (SSE). It's a server-push model where data is sent from the server to the client uni-directionally, whereas WebSocket provides bidirectional communication. This effort reduced the data latency significantly, as well as simplified the system.

A good example for us to learn which communication protocol makes sense in which use-cases.

How Twitch optimised live video streaming

Ingesting Live Video Streams at Global Scale

About the Authors We work on the Video Ingest team, a part of the Video Platform organization at Twitch. Our team develops the distributed systems, services, media formats, and protocols that (a) acquire live streams from the community of creators (“Contribution”), (b) perform real-time processing, like video transcoding on these streams (“Processing”), and (c) provide a high throughput control plane for making them available for world-wide distribution at scale (“Playback Edge”), offering an end-to-end low latency video experience.

https://blog.twitch.tv/en/2022/04/26/ingesting-live-video-streams-at-global-scale/

#twitch #livestream #optimisation

To facilitate live video streaming in high quality and low latency, Twitch has a specific type of server, called point-of-presence (PoP). These servers are hosted in different geographic regions around the world. When users stream live videos, nearby PoPs receive the streams, and then these are sent to an origin data centre connected via the private backbone network. The origin hosts the resources to do computationally intensive video processing after which live streams are delivered to the viewers.

Initially, Twitch had one origin that would receive and process streams from the PoPs around the globe. Over the years, due to the growth of the company, the team had to build multiple origins to meet the need of increasing number of Twitch creators and viewers. But this came with a new set of problems.

PoPs have HAProxy, an open-source load-balancing software which is used to forward the streams from the PoPs to one of the origins. HAProxy can have routing rules of static nature. As a result, it becomes challenging to optimise when we have multiple origins.

Since origins are located in various geographical locations, the utilisation of an origin depended on the time of the day. So at a given time, one origin might be doing heavy computations, whereas another one is mostly idle. This is not an efficient way to use resources.

Moreover, each of the origins had a variety amount of resources available. It was not straightforward to set rules in HAProxy to figure out the most optimised origin to send streams. On the other hand, there could be sudden spike in traffic which needed special care.

Overall, the team at Twitch saw the need to retire HAProxy with their own solution, which is why the homegrown solution Intelligest was born.

That's all for today. See you all in the next issue! 👋

Reply

or to participate.