Welcome to the first ever post to The Flow, a blog by the Datazoom Engineers. We want to use this space as an opportunity to dive deep into our technology and highlight the unique ways in which our Engineers code and design our cutting edge #AdaptiveVideoLogistics platform.
Transparency is highly valued at Datazoom. We firmly believe that everyone who uses our technology, or is simply interested in it, should be able to understand how it works without an extensive coding or technical background. At the same time, we don’t want to take away from the highly nuanced and complex workings of what powers our platform.
Datazoom Engineers specialize in building simple solutions to complicated data management challenges. We’ll start today by providing a high-level overview of Apache Kafka, Datazoom’s technological backbone.
Kafka is a publisher-subscriber messaging framework for transmitting data. One of the many innovative contributions to the software world to emerge from LinkedIn, Kafka was open-sourced in 2011 by the Apache Software Foundation. It is a powerful, fault-tolerant, distributed messaging framework, which allows Datazoom’s technology to process a high volume of data with sub-second latency. The Kafka framework allows Datazoom to capture, standardize, and route data generated across the video-tech stack into a more robust dataset to serve as a premium “data fuel”.
There are many tutorials about Kafka on the web if you want to go deeper. In this post, we want to provide a high-level overview of some of the key Kafka components that Datazoom employs today:
- Producers: Datazoom uses beacons to capture data from different sources. Beacons hand this information off to Producers which generate messages containing this data which are sent into Kafka.
- Consumers: AKA Datazoom “Connectors”, subscribe to incoming Kafka messages, transforms them and sends to 3rd party destinations.
- Brokers: These are the servers which manage the messaging between producers and consumers. Since Kafka does not hold a state, these brokers are managed by Zookeeper.
- Partitions: This will help Kafka determine how many messages ought to be processed in parallel on a topic. The partitions are spread across multiple brokers. It is recommended to keep the number of partitions equal to the threads existing in consumers. Partitions are distributed across brokers to be fault-tolerant.
- Streams: Real-time processing, where we get a continuous stream of data, Datazoom uses the Kafka Stream library to transform incoming data then aggregates it into different time window slices for faster reporting.
Now, let’s reconsider and enhance our definition of Kafka. Producers publish data; Consumers subscribe to data (think as if it were an email newsletter.) In this light, we can come to appreciate the way in which Producers and Consumers are linked so we can better understand what constitutes Datazoom Collectors and Connectors.
The Data Broker
Now that we have a basis for understanding how Kafka works, from a technical perspective, let’s add another layer to our understanding: how Kafka facilitates real-time data routing. We know the Broker is a facilitator between the producer & consumer and that arranging multiple brokers creates a Kafka Cluster.
Now, a single broker existing in a given Cluster is not isolated from other brokers. In fact, they are each updated synchronously. Data passing through one broker is mirrored in other brokers within a cluster based on the replication factor of the topics. Larger clusters increase throughput or the percentage of successfully delivered messages. This is especially useful if there are multiple subscribers to a given message (In short: more servers means better data transmission, especially if you have many systems which use the same critical data.)
Though this scheme, Kafka Clusters amass data collected from around the world and processes it in an order based on the priority of a message to a consumer. This all occurs in less than a second and is responsible for the “spice” which gives Datazoom’s low-latency data standardization capabilities. The destinations for the data emerging from the Cluster - analytics, advertising, audience, storage, and even internal solutions - all benefit from this clean and accurate, standardized, data.
The multi-broker & multi-partition setup is also useful for scalability purposes. The larger the volume of data being transmitted between producers and consumers, the easier it is for multiple brokers to ensure reliability when transmitting messages. If one broker goes down, much like in a parallel circuit, the “power” does not go out. At Datazoom, we employ a replication factor of 3 on partitions. At the end of the day, it is Kafka’s scalability and throughput enhancing qualities which converge to produce something absolutely essential when streaming video: a real-time feedback loop.
We want to take some time to explore some of the ways Datazoom enhances Kafka’s powerful streaming infrastructure. Docker and Kubernetes are two technologies we use to set up containers and ensure high availability. In our next article, we’ll share a little bit of the “how”.
Michael Skariah is the CTO of Datazoom. He’s head-honcho of the Datazoom Engineers. When he isn’t leading the techies behind the world’s first Adaptive Video Logistics Platform, you can catch him spending time with his family and volunteering.