Koo’s data platform — part 1: Apache Kafka and NiFi

By Phaneesh Gururaj

Data as they say is core to the success of any organisation especially for a product like Koo ; We capture various data points in real time which helps us to understand our users’ journey better. With the growth we have been generating, there is a continuous challenge to scale our data platform both from an analysis and streaming perspectives. On this note, when we decided to build our data platform we took a holistic approach keeping in mind the following architecture goals.

In Part 1 of this blog series, we will cover Nifi and Kafka.

Scalable ingestions pipelines
  • Low cost data storage.
  • Easy to query and bring in external data as necessary.
  • Open Source.
  • Access Controls.
  • Store snapshots of processed data.

We evaluated few common patterns that are built on top of Apache stack — Kafka, NiFi, Hudi, Parquet, Spark

Here are some stats about the size of our data

  • > 500 GB of Application logs everyday.
  • > 20 GB of structured critical data of users — profile, journey, critical actions per day.
  • > 3M Impressions per day.
Kafka

we needed a battle tested message queuing system and Kafka was the ideal choice. For our scale, we needed something that is horizontally scalable and fault tolerant.

The concept of topics in Kafka helped us to efficiently layer our ingestion architecture as there are multiple producers of events. Broadly, we capture

  • [Koos read by users → Impressions]
  • [Koos liked, re-kooed → Reactions]
  • [People follow / unfollow → Network]
  • [Visits → Profile, Image, Koo-Details, Screen]
NiFi

to orchestrate the various ETLs from the producers and chain the data pipeline, NiFi is a good candidate. The various built-in connectors come quite handy while stitching the transformations and pipelines.

Few important characteristics of NiFi, that is super important

  • Runtime flow management is possible.
  • Dynamic prioritisation.
  • Data Provenance → Tracking the data path.
  • Effectively manage back pressure and scale processors.

As a Significant Social Media entity, Koo will curate communities of creators and users around regional languages and local themes leading to meaningful, enriching interactions which matter in everyday life.

The below graph shows the chart for our impressions (a critical analytical data that powers many of our ML pipelines). As the data flows through the Kafka and finally settles in S3, a

  • bunch of transformations are applied
  • data is at times pulled from other sources for quick reference
  • data is sent to next processor in the pipeline depending on some conditions

Our NiFi use case demands a lot of merging → more details here. As you see from the graph above. the rpm of merging is quite high. At times, we do trigger some flows to accelerate the pipelines by provisioning more resources for certain requests compared to the steady state that is usually the case. This is a big advantage that helps us to manage the resources without disturbing the infra.

Kafka and NiFi form a potent combination to set up data pipelines efficiently. The horizontal scalable nature of both the technologies is a very critical aspect as well. To understand NiFi better one needs to understand FlowFiles and Processors in a bit more depth.

FlowFiles — A flowfile is a basic processing entity in Apache NiFi. It has data contents and attributes, which are used by NiFi processors to process data. The file content normally contains the data fetched from source systems.

Processors — A processor forms the basic building block for creating an Apache NiFi dataflow. Processors provide an interface through which NiFi provides access to a FlowFile, its attributes and its content.

Scenario — Sudden spike during notification events or certain Koo’s going viral

To handle these sudden spikes NiFi works really well and we can fine tune back pressure to suit our needs.

P1 → P2 → P3

Eg: We have 3 processors — P1, P2 and P3. The back pressure for p3 is configured let us say to 10K. These are soft limits and can be configured depending on the data path and time. Also, if p2 creates 1M flow files, all these 1M would be dumped to p3 which is a downstream system. Till p3 completes processing all its messages, p2 scheduler is paused. In this way, we are able to control the sudden spike as well.

Parameter Context

Eg: At Koo, we receive media that is in the form of audio, video, images. We have a message object that contains some meta-data about a media object → (could be image, video or audio), parameter context is a good use case. When this message object hits the NiFi pipeline, the context object gets triggered first and appropriate projection and chaining is established. Different pipelines for audio, image and video can be built. Also, some common properties can be extracted only once as well and passed on. This parameter context inheritance is available in latest version of NiFi — 1.15.0 which is quite efficient. As an engineer one can design well lay out pipelines similar to how one writes efficient code.

Summary

NiFi is a great data pipeline builder. For a data engineer who is keen to set up strong pipelines and efficiently manage the same, it is a great add to his / her repertoire. We are hiring and building our data platform team. Looking for folks to join this team — please share your profiles @ ta@kooapp.com

In the next part of our blog, we will go deep into our s3 architecture, partition strategy and some use cases.

Leave a Comment

Your email address will not be published. Required fields are marked *