Technical Incident Management - Biweekly Engineering - Episode 8

With an article on how to break up a monolith.

Welcome back my dear subscribers! Hope life has been treating you well!

I am back with the 8th episode of the newsletter. Today, we have the following topics to discuss -

  • My experience of technical incident management in the industry.

  • And a post from the Pragmatic Engineer newsletter.

Hope you will enjoy this episode and learn something new.

Let’s begin!

The view of beautiful Lisbon from the top of Castelo de S. Jorge

Technical incident management at scale

#incidentmanagement #oncall #pagerduty #datadog #graphite

Context

Things go wrong all the time. That’s the hard truth of life.

The larger a system gets, the more chances there are that parts of it will break now and then. If your business booms and a broken or partially broken system means loss of money or a bad user experience, you cannot help but keep alert to respond immediately when anything goes wrong. Because, money matters, and you cannot afford losing customers due to bad user experience.

Guess what, even if your system is built by the best engineers in the world following the best software engineering practices known to the humankind, it will break now are in near future. That’s the hard truth of developers’ life.

So what to do about it knowing that your system or part of it will surely fail? The answer is, in short, incident management.

How to conduct incident management

Successful companies with large businesses serving users 24/7 need to take incident management very seriously. There is really no scope of slacking here.

In my experience, I have seen the following pattern for incident management in systems with many services in operation -

  • Each service emits metrics to an external system where the metrics are persisted. It’s generally a time-series database with keys and values.

  • Another system is used to query the persisted data and visualise it on some UI. This system periodically queries the data and updates the visualisations accordingly.

  • A dashboard is created for better visualisation where a bunch of metrics related to the service are put together.

  • Since the system is querying the system metrics periodically, rules can be setup in it to publish alerts to an alerting system.

  • The alerting system, upon receiving an alert, can trigger the actual alert to communication systems to notify system owners and developers.

An incident management architecture

The above might sound a bit vague. Let me be more specific with the following diagram.

A high-level incident management flow in a considerably large system

Let’s assume you want to design an architecture to build well-automated incident management for your company. Based on my experience, the above diagram shows a possible solution.

  • Service - a critical service in your system that requires tracking. If anything goes wrong in this service, you want to mitigate it as quickly as possible.

  • Graphite - a popular system that supports storage of the metrics data in time-series format. It’s important that all the metrics are stored against a timestamp, because that’s how you can track how your service is behaving over time. All different system metrics from the service are persisted on Graphite.

  • Datadog - a well-established monitoring system. Datadog will query Graphite to fetch metrics and serve visualisations. Also, rules will be set up in Datadog to publish alerts so that owners of the service can be notified.

  • PagerDuty - Datadog publishes the alerts to PagerDuty, a widely-used system to provide alerting services. PagerDuty receives the alert from Datadog and raises the actual alert. PagerDuty has integration support with instant-message services and telecommunication systems. For example, in our case, as depicted in the image above, PagerDuty sends a message to a proper channel on Slack, and call the respective number to notify about the issue that’s observed in the service.

How responsibilities are handled

One obvious question is who would receive the call?

Companies have on-call policies in place where at least one engineer is available round-the-clock to respond to the notification sent by the alerting system. In our example, the idea can be briefly broken down as the following -

  • The team, typically the owning team of the service sets up on-call policies on PagerDuty. For example, in a team of 5 members, each member can be marked as the on-call engineer for a week in a round-robin manner.

  • On PagerDuty, the schedules for on-call along with phone numbers and Slack usernames of the engineers are also configured.

  • During a shift, the engineer being on-call is the first line of defence against any issues in the system. The engineer is expected to respond immediately when they get notified.

  • When an on-call engineer receives a notification, they will have to acknowledge the alert. It just requires clicking a button to let PagerDuty know that the engineer is now active to work on the issue.

  • Escalation policies are also configured on PagerDuty. It means, if an on-call engineer fails to respond to a call, PagerDuty will escalate the alert to someone else, typically the manager of the owning team.

  • There could also be multiple levels of escalation policies based on the criticality of the issue. PagerDuty will continue notifying everyone in the chain until receiving an acknowledgement from someone.

Don’t assume that it’s solely the on-call engineer’s responsibility to find a resolution for the issue in the service. Based on the criticality and difficulty, on-call engineers can always ask for help by manually triggering alerts for other respective engineers on PagerDuty.

The most crucial expectation from an on-call engineer (or a group of engineers) is to resolve the issue as quickly as possible by taking whatever steps required. It is not a must to find the root cause immediately.

For example, assume that your service started facing out-of-memory errors after a recent deployment. Now, an on-call engineer has quickly identified the deployment that is causing the issue, so it is expected from them to revert the changes. It is not mandatory to find the root-cause (the specific part of the code) right away that was responsible for the OOM errors.

Post-incident remedies

Now the big question comes - what happens after an incident is resolved?

After an on-call engineer has responded and eventually mitigated the issue, the incident management practices do not conclude there. It is super crucial to take a few further steps to reduce such incidents in future. At bare minimum, you can expect the following in companies with good engineering culture -

  • Based on the criticality of the issue, a postmortem or RFO (reason-for-outage) doc is prepared by the owning team.

  • In the doc, no fingers are pointed at anyone. This is very important to avoid blaming someone specific for the issue. The language of the RFO doc should not be aggressive.

  • A review process of the doc is also conducted by the other stakeholders. It is also common to have regular RFO docs review meetings.

  • In the RFO doc, some specific questions are answered. For instance -

    • What was the reason for the issue or outage?

    • What steps were taken to mitigate it?

    • What could have been done better?

    • What was the timeline of the outage?

    • What was the business impact?

    • What steps are taken to prevent the same thing to occur in future? What steps will be taken in future?

As we can see, the idea is to put down as much detail information as possible in a doc and share the knowledge with relevant teams so that future occurring of the same issue can be reduced.

Designing a well-architected system is hard. Protecting a well-architected system from failures is harder. That’s where good incident management practices make the difference.

Your humble author

This is it for today’s discussion on incident management in tech companies. Thanks for reading! I expected the discussion to be shorter. Sorry for making it too long! 😁 

Breaking up a monolith - the Khan Academy example

#thepragmaticengineer #monolith #microservices #go #python

The next topic we have is a very well-written article from the Pragmatic Engineer newsletter by Gergely Orosz.

In the previous issue, I shared the story of scale at LinkedIn where we saw how LinkedIn moved to a service-oriented architecture from a monolith. Here is a detail analysis from Gergely how Khan Academy, a non-profit edtech did the same.

Khan Academy started with a monolith written in Python. Eventually, they decided to break it up into services. Not only that, they also rewrote the services in a different programming language - Go.

The article discusses in detail how the company moved from the monolith to services in three and a half years. It was a large project involving one-hundred engineers. The best part of the article is that Gergely personally talked to two engineers who were directly involved in the project, and shared their thoughts on the migration.

Stories like this are precious. I personally enjoyed the post a lot and I cannot recommend it enough.

That would be the end of today’s discussion. Hope you have had a good time reading today’s issue. Thanks for reading, and take care! 👋 

Reply

or to participate.