Biweekly Engineering
Posts
Analytics at Scale: How Uber Analyses App Crashes | Biweekly Engineering - Episode 24

Analytics at Scale: How Uber Analyses App Crashes | Biweekly Engineering - Episode 24

App crashes analytics at Uber | How Gojek built its push notification system | Media search through machine learning at Netflix

Biweekly Engineering
December 20, 2023

Good greetings dear subscribers! Another day and another episode of your favourite Biweekly Engineering is here!

As usual, today we have three more exclusive engineering articles from the internet from Uber, Gojek, and Netflix.

Ready to dive in? Let’s go!

Somewhere in the beautiful Oslo Public Library

Healthline: Uber’s Real-Time Analytics for App Crashes

Real-Time Analytics for Mobile App Crashes using Apache Pinot

In today’s fast-paced world of software development, new changes (both in code and infrastructure) are being released at breakneck pace. At Uber, we roll out ~11,000 changes every week and it’s important for us to have a way to quickly be able to identify and resolve issues caused by these changes.

www.uber.com/en-NL/blog/real-time-analytics-for-mobile-app-crashes

Uber has an internal tool called Healthline.

What is Healthline exactly?

❝

Healthline is a crash/exception logging, monitoring, analysis, and alerting tool built for all Uber

Basically, Healthline is a platform where errors of all other platforms (including mobile apps and microservices) are analysed and visualised in real-time, as they happen.

At Uber’s scale, this is a big deal. Imagine how much data the platform has to deal with!

Fortunately, we get some idea from the article. At peak, the platform processes more than 1.5 million errors and app crashes per second, which results in 36TB of data per day.

This is a solid example of a system built for massively high scale!

While the article discusses the context and the architecture, the crux of the article is Apache Pinot.

So what is Apache Pinot? From the official website:

❝

Realtime distributed OLAP datastore, designed to answer OLAP queries with low latency

OLAP stands for Online Analytical Processing. It simply means you want to process the data and generate insights from the data. The challenge is both the data and the insights generated from the data can be humongous. This is why we need specialised databases like Apache Pinot to efficiently store and serve the analytical data to the consumers.

I personally found the article a tremendous example of Apache Pinot based systems where real-time data analytics is at its core. As you can understand, when crashes and errors are happening significantly somewhere in the wild west of Uber systems, bells have to be rung (aka on-call engineers have to be alerted). Without a real-time system, such a responsive mechanism would be impossible. And this is one of the many reasons why over the years, we have seen the rise of real-time systems and OLAP databases.

I also recently shared the following talk on LinkedIn:

This talk gives us a high-level overview on how three popular OLAP databases - ClickHouse, Druid, and Pinot compare with one another.

Push Notification in Real-Life: A Gojek Case Study

How We Manage a Million Push Notifications an Hour

3 million+ orders a day across 20+ products on multiple devices, operating systems, and services. That’s a lot of notifications. 😅

medium.com/gojekengineering/how-we-manage-a-million-push-notifications-an-hour-549a1e3ca2c2

Design a push notification system - a famous system design interview question we all know about. You have probably learnt about it in your favourite system design courses or tutorials. Let’s take a look how Gojek designed its push notification system that handles a million push notification an hour.

Firstly, the article sets up the context by discussing challenges they have:

Multiple apps - Gojek has multiple apps for its different products.
Multiple push notification providers - Since Gojek supports both Android and iOS, Firebase Cloud Messaging (FCM), deprecated Google Cloud Messaging (GCM), and Apple Push Notification Service (APNS) - all providers have to be supported.
Multiple devices - A user can have more than one device while also having more than one app in a device.
Multiple services - Gojek has a microservice-based architecture which means multiple services would want to send notifications and the system has to support all of them.

The article discusses the architecture of the notification systems solving all the challenges above in a good detail. To give a high-level summary-

Notification server - The consumer facing service that receives push notification requests from the consumers and creates a job in the job queue for actually sending the notifications.
Token store - A service that stores data like user_id and device_token which can be used by the notification server to create a push notification job.
Job queue - All the push notifications job created by the notification server are put in this queue. Gojek has a separate queue for each application and push notification provider type.
Notification workers - These are the actual workers that pick up jobs from the queue and send API calls to the push notification providers like FCM or APNS to actually send the notification to user devices.

If you think about it, the architecture is pretty similar to what we have learnt in our system design trainings.

Media Search at Netflix

Building a Media Understanding Platform for ML Innovations

The media understanding platform serves as an abstraction layer between Machine Learning (ML) algos and various applications.

netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7

Netflix relies on machine learning heavily to enable creators and editors to search through and generate content. The article comes as a great example how Netflix is leveraging ML to support media search at scale.

Note that the article doesn’t discuss the ML models or data strategies, rather it discusses how such models are integrated with a system and used at scale.

The use-cases discussed in the article give us a glimpse of how machine learning helps to reduce hours and days of tedious work for Netflix. Briefly,

Dialogue search - In a media file, editors might need to search for a specific or catchy dialogue by an actor.
Visual search - Here, editors need to search from visuals. An example could be searching for red race cars across a media catalog.
Reverse shot search - In this case, editors provide a frame or clip and ask for similar shots across the library.

In the end, the team designed an end-to-end system with a few major components:

API interfaces receive requests from clients.
Search gateway receives requests from the interface layer which processes each request to match the internal representation of various forms of data, and routes the requests to searchers. Note that the platform needs to support video, text, images, etc for searching which means input data has to be transformed to the correct formats.
Searchers host and query different resultant data and serve the results.
ML platform (managed by separate teams) is where the models are built, executed, and results are streamed for consumption.
Indexers consume the streams and index the results so that searchers can query them.

The above is a very brief discussion of the architecture. I would say the discussion is a bit high-level but enough for us to get an idea on the overall architecture.

That’s all for today. Hopefully you have found the articles useful, and as always, learnt something new. Remember, for us software engineers, there is hardly any better method for learning than reading quality engineering blogs. Let’s make it a habit!

See you next year! 🙌

Reply

or to participate.