How Code Search Works at Github - Biweekly Engineering - Episode 9

And an introduction to Event-Driven Architecture

Welcome back to the 9th episode of Biweekly Engineering!

(Yes, I have decided to call it “episode” instead of “issues” - sounds better, right?)

Today, we have a complex yet quite fascinating blog post from Github to share, along with an introductory article on event-driven systems. Why don’t we jump right into it?

The bottom view of the Eiffel Tower - a hard to miss landmark in Paris

Code search at scale - a Github case study

#github #search #microservices #kafka #distributedsystems #scalability

The ability to search through code is a crucial feature for developers, and Github, a popular code hosting platform, released its public beta version of code search in November 2022. Given the platform's vast scale, how could Github possibly build such a feature in its platform that enables users to search through the enormous numbers of repositories? Fortunately, Github precisely discussed their code search system in a well-written article in their engineering blog.

To be more specific, Github currently hosts 200 million repositories, of which 45 million were included in the beta release. At a high level, the code search system consists of three parts:

  1. Storing: The data is stored in Elasticsearch, sharded by blob object ID, which enables Github to handle and distribute large amounts of data efficiently.

  2. Indexing: The system indexes all the code hosted in the 45 million repositories, and the initial indexing of all the codebases takes a long time. Github uses a distributed indexing approach to speed up the process.

  3. Querying: The code search feature is served by a service called Blackbird, which distributes queries to all the shards.

It is essential to keep in mind that code search is fundamentally different from text search, and it requires special care. Code search involves searching through programming languages and code structures, whereas text search involves searching through plain text. Github has developed a sophisticated system to address these differences and provided a reliable code search feature to its users.

Personally, I found the article complex but entertaining. There are terminologies here that I was not familiar with. If it is the same for you, looking them up to learn more is highly recommended!

Introduction to Event-Driven Architecture (EDA)

#eventdrivenarchitecture #eda #kafka #messaging

In the past, Event-Driven Architecture (EDA) was a relatively unknown concept. However, in the last fifteen years or so, there have been significant advancements in computational capabilities that have enabled us to handle massive amounts of data through reading, writing, and processing. As a result, EDA has become a reality.

EDA is a system where every action is initiated by an event. There are three primary actors in an EDA system: producers, brokers, and consumers.

  • Producers - responsible for generating the events and sending them out to the brokers.

  • Brokers - Brokers act as intermediaries that receive events from producers and distribute them to consumers.

  • Consumers - as the name implies, consume events from the broker's queue.

EDA has several advantages over traditional approaches to system design. One of the most significant benefits is that it allows for greater flexibility and scalability. Because events are processed asynchronously, different components of the system can operate independently, making it easier to add or remove functionality without disrupting the system as a whole. This also makes it possible to scale the system horizontally by adding more instances of the components as needed.

Another advantage of EDA is that it can help reduce coupling between different components of the system. In a traditional system, components are often tightly coupled, meaning that changes to one component can have a ripple effect throughout the system. With EDA, components are loosely coupled, which makes it easier to modify or replace components without affecting the rest of the system.

One thing I personally liked about the article is that the author correctly points out-

EDA is not a silver bullet. It does not eliminate the notion of coupling altogether — otherwise, components in the system would no longer function collectively.

If you are looking to gain an understanding of the principles behind EDA, this article is an excellent starting point.

Introducing my own course!

As the third topic for today’s episode, I have got a good news to share. I now have my own course at the famous educational platform for software engineers - Educative.io!

The course is structured in a way that ensures that the basics are in place for the learners. I begin with the definition of distributed systems and explain why we need them. Then I discuss core concepts like fault-tolerance, availability, scalability, etc. Since data is a fundamental part of distributed systems, the next chapter covers concepts such as replication, partitioning, and consistency.

The following chapters focus on communication in distributed systems, data processing, and a few popular architectural patterns. Finally, I touch on two real-life systems: Apache Spark and Apache Druid.

The course is aimed at learners who have little to no knowledge of distributed systems. Whether you are a recent graduate, starting your career, or looking to switch your focus to backend engineering, this course will help you build a strong foundation. For others, the course can certainly help you brush up on your knowledge.

This marks the end of today’s episode (I am enjoying the word “episode” to be honest instead of the word “issue” which sounded problematic). Thanks for reading, and see you next time. Till then, adios! 🤝 

Reply

or to participate.