Realtime data, from MVP to the big time

Published in

Busuu Tech

6 min readOct 11, 2017

Intro

Over the last year at busuu, I have been trying to move the frontier of our usage and processing of our data from simply business reporting towards using it to provide a better experience for users. Knowing how our users are utilising one of our features or AB testing is very useful to improve the product, increase monetisation or highlight key areas to work on next, however in the age of personalised Spotify playlists, Amazon purchase recommendations and tailored Netflix binge sessions, we can do better.

At a basic level busuu is similar to each of these players because we all deliver valuable content to our users for consumption or for purchase and users have developed a taste, moreover a thirst for personalisation of their online experience.

From the start, the key has been to get the basics right, as no matter what, the old paradigm of “throw rubbish in and you will get rubbish out” stands true with any data process or infrastructure. So we set out to engineer a framework and scaleable environment for data to be accepted and processed in real time through the use of streams and event processors, which would allow us to use data in real time to get insights, inform users, and provide analytics to our

I spoke at Snowplow analytics London about my plans for this stack and the state of affairs at the MVP stage (link here — https://www.youtube.com/watch?v=wSmTC1JDSyc&t=852s ). In this blog would like to talk here about the results, both the wins of this project and the lessons that I faced along the way.

My Hypotheses

A centralised system for collecting, validating, processing, distributing and storing data will result in better data quality.
Real time data will provide better machine learning predictions for use within a user’s session.
Tracking everything once and distributing the data to third parties after processing will reduce SDKs & reduce frontend team dev time spent on tracking.

Implementation

Due to the experimental nature of this project I used a lot more managed services from AWS than our company would usually like. This reduced the dev time needed from DevOps team and allowed me to move quicker. Below is a base diagram of what the stack ended up looking like.

AWS Serverless Architecture

One of the positive lessons from this project is how trying to be as server-less as possible resulted in more flexibility, scalability and reduced cost. Since Ken Fromm brought this idea to the fore in 2012, systems engineers have been trying to build in the flexibility of server-less apps & stacks whilst reducing vendor lock and retaining control. What we have found is the deeper of an abstraction you go to, the more you can worry about the code and less about the hardware.

That being said I have found on this project there have been some significant findings when it comes to adopting serverless methodologies.

Pros

You only pay for what you use, therefore costs are potentially lower in production given certain distribution of traffic over time.
Stability is much better. Because Amazon are dealing with all of your hardware and sometimes use different hardware entirely for each instantiation of the code, hardware problems and availability are dealt with so that you can go away and concentrate on your code.

Cons

We have found that logging becomes a massive issue when dealing with AWS Lambda specifically. Using Amazon’s managed logging feature “Cloudwatch” becomes ridiculously expensive at high data volumes. To solve this problem we created a syslog server for debugging and have build our own testing suite to check the status of the stack.
Vendor locking is a key concern from your CTO on a project like this and admittedly a lot of off the shelf AWS architecture is used here. However due this common worry, libraries that deploy your code to many different top cloud services are starting to reach maturity. A good example is serverless — https://github.com/serverless/serverless. This deployment system can change cloud provider with one change to your conf file, meaning once again you can concentrate on your code and outsource the hardware to Amazon, Microsoft or Google if you want.

Snowplow

The core concept of this infrastructure came from snowplow analytics open source event pipeline, which we really liked due to the built in schema validation, data enrichment and data collection SDKs. The instructions from snowplow made it easy to select which of their products you need, to achieve what you need and their code builds first time without a problem. Assuming that you are careful with your schema and configuration file syntax, there have been very few errors in production. Try using the snowplow-mini repo. It is a useful tool for building quick integration environments — https://github.com/snowplow/snowplow-mini.

This setup was not without issue however. The enricher and validator are highly parallelised meaning that there is a lot of strain on your schema file to validate each event. In the snowplow setup guide it suggests hosting your validation schemas on s3, however if you host them locally on an apache webserver the increase in performance is phenomenal (see below).

Events processed per minute were increased once we locally stored our configuration file

To ensure realtime processing, high parallelism is the way to chomp through all that data quick enough. We achieve this in two different ways:

Many EC2 instances within an autoscaling group
Many workers (processes) within each of these boxes managed by supervisor

This takes care of the processing power, but the good news continues; AWS kinesis has a lease handler build into it that not only ensures each record is only processed by one worker that is attached to the system, safeguarding against duplication but it also load balances the data load against the workers automatically. This means that as long as the workers are up and running and have the resources they need, kinesis will sort everything else out.

How did my Hypotheses work out?

A centralised system for collecting, validating, processing, distributing and storing data will result in better data quality.

A soon as the production system was fully online and implemented by each of the data creating platforms this was the first thing we noticed as a company. The extra data brought in by this stack has provided us new insights to our users and how they use our platform. It has been enhanced with extra metadata such as screen sizes, user agents and a dynamic number of event specific parameters for each event type just to name a few. Once proliferation of users sending events to the stack was nearing it’s maximum (we had to wait for apps to update for example), the number of events being output to our data warehouse matched what we were expecting satisfactorily and we were quickly ready to use the data.

Real time data will provide better machine learning predictions for use within a user’s session.

Having good realtime data has allowed us to cache and update user features that are applied to our prediction models. Ensuring that more features are available when needed, has meant that the predictions use the most up to date data and the precision and accuracy of our models has exceeded 90%. Sessionisation is a next step for us and we aim to build in a way that it can help every department in their own way to give our users the best experience possible, even in their first session

Tracking everything once and distributing the data to third parties after processing will reduce SDKs & reduce frontend team dev time spent on tracking.

This was not necessarily the case as most of the third party tools integrated inside the apps had other purposes such as push messaging, chat or device level marketing attribution therefore most of the SDKs have had to stay. However these are becoming out of the box functionalities, instead of internal systems that have to be maintained by our front end developers. The tracking that gives these third party tools insight on what to do is delivered to them serverside after validation and enrichment ensuring better data quality and higher effectiveness.

Realtime data, from MVP to the big time

How did my Hypotheses work out?

Written by Bruce Pannaman