At EXADS, real time events are an important data source for our SaaS platform. Every impression, click, action, bid, etc. has to be tracked and stored appropriately. Building a system that is capable of handling tens of billions of such events daily is not a trivial task.
Events generated by our systems have to be processed and transported to different databases to help us with the following main use cases:
A few years ago we started to design and build this new system to replace an old monolithic application. The old system was updating our databases using rsync to move logs across the different servers.
The new solution had to be built with the following goals:
To achieve those goals we evaluated the possibility of using a cloud based solution; but since we already own and manage our infrastructure adding this new system to the current setup was the most cost effective option.
The next step was to find an open source event streaming platform that could suit our needs. Apache Kafka has evolved considerably over the last few years, proving itself to be a reliable solution for heterogeneous data streaming while providing outstanding throughput and performance; it met our requirements. And at last, a robust ecosystem, especially the connectors and client libraries, future proofs the system, facilitating the ongoing migration to a microservices architecture.
Our web application currently provides the right ad for each request in less than 10 milliseconds, and once such an ad is selected the logs are created in JSON format. Adding any other library and external communication could negatively impact this impressive accomplishment.
To avoid any impact on our web servers we decided to keep the JSON output and create a microservice running locally which collects all those messages and sends them to Kafka. This solution eliminates the need to get reception acknowledgment from Kafka before the next message. Instead, the response is sent back immediately and the message is added to the buffer. This feature greatly reduces the response time, and improves the batching and system’s overall availability. At last, messages are sent using Avro binary encoding, reducing their size.
Since the start, we specified a unique format for events generated and processed. This solution has greatly simplified the complexity in consuming events and aligned consumers and producers.
Any kind of event has a standard envelope and payload. The envelope contains the context of the event (time, server, the type of event, etc..). The payload is the event data as we had in our previous monolithic system. This has enabled us to quickly implement the new system without having to implement significant changes in the producer or the storage of those events.
Kafka works in conjunction with ZooKeeper to form a complete Kafka Cluster — with ZooKeeper providing the distributed clustering services, and Kafka handling the actual data streams and connectivity to clients.
Our Infrastructure Team was already used to manage ZooKeeper clusters which facilitated the deployment of Kafka.
As usual with our systems, we created a production and a development environment. The creation of both clusters was straightforward and scalable as needed.
Moreover, AKHQ was installed to simplify Kafka management. AKHQ provides a Kafka Graphical User Interface to manage topics, topics data, consumers group, schema registry, connect among other things.
At the beginning of this article we discussed how storing all those events were critical for different needs of the company. Furthermore, we described how different solutions are used for those different needs.
We decided to create a specific ingestor for each of the databases we use, facilitating the ingestion of only the specifics of the event that are relevant for each case.
Our Admin Panel relies on ClickHouse to provide real time data statistics to our clients. A new tool, Clickhouse Data Consumer, was developed to consume events from Kafka (and keep backward compatibility to read events from a JSON file) and ingest them into the various ClickHouse clusters (development and production).
Keeping track of balances is another key functionality. Another consumer was created, Balances Data Consumer, which updates users and campaigns balances stored in MariaDB. This consumer only collects events that affect balances. All those different events are then grouped together and the balances for each campaign are updated regularly.
Finally, one last consumer was developed, HDFS Data Consumer, to upload the events to Hadoop. This cluster is used for data analytics and requires to store all the events we generate.
It was paramount to build an event processing system that could fulfil our current and future needs, paying special attention to its efficiency and scalability. Choosing the right building blocks is essential in this kind of task and Apache Kafka has proven to be an excellent solution for this enterprise.
There is still a lot we want to achieve and we are already discussing our next steps. We are analysing how we could split the events into different messages and using a schema registry to store the different versions of the messages. This will enable us to be more flexible when it comes to new releases and facilitate new deployments.
Business Intelligence Director
Subscribe to receive via email more information about EXADS and the ad serving market.
What are the Digital Marketing and Ad Tech trends that are shaping up the future? We put together a list of insights from DMEXCO 2022.
We are excited to announce that EXADS will be attending DMEXCO, Europe’s leading digital marketing & tech event, in Cologne, Germany on September 21- 22!
How does EXADS quickly develop new products and features? We explain all the stages involved from conception to launch.