The journey to advanced analytics
Advanced Analytics is a feature that is targeted to help Sendbird users analyze and get insight into usage behavior of their applications. Currently Advanced Analytics is supported as a premium feature, and it provides metrics around messages, channels, and users of a given application with segmentations on various custom types.
Advanced Analytics was not called Advanced Analytics in the beginning. The original implementation of these analytical metrics were based on actual database entries representing entities such as messages, channels, and users. When users of the application create users, create channels and send messages, these entries would be created or updated in the relational database.
If application users were deleted or deactivated, that user entry will be updated indicating such action. To provide analytics of this data, there was a cron job regularly running that fetched these entries and performed aggregation such as simple summation based on the type of the metric, and updated separate database tables with these values. These metrics were then served to the customer via our customer-facing dashboard and platform API on a per metric basis.
Problems with the original approach
There were several problems with this approach. There were increasing use cases and with further additional analytics we wanted to add, this implementation was not very extensible and scalable. There were concerns about the data quality of metrics as the database entries representing the entities did not depict historical changes, thus making backfill scenarios impossible, and in turn not being able to provide correct metrics to customers.
If there was a new metric to be supported, there had to be a new platform API to serve it as the current APIs were served per metric. In turn, this would mean a change in schema or a new database table tracking this metric in our internal database. Not only this leads to more engineering, but also bad usability experience to the customers as they have to onboard with the new API every single time a new metric is added.
Another problem with the approach was that we wanted to provide the metrics with varying cadence, such as monthly, and again, this would mean changes to the database, and so on. And as our customer size and usage grew, we foresaw that our cron job would not be able to scale to handle the load and cause a burden on our relational database.
How does your mobile engagement score stack up?
Creating a new pipeline
So with these limitations of the original approach, we took this as a high priority and decided to create a whole new pipeline and provide it as a whole new feature. The new pipeline was based on logs that were created at a point in time of user activity, and these data would be used to compute the original metrics. Instead of providing separate APIs, there would be a single platform API supported which would look up a single database table for all metric types.
To get to the performance we needed as well, there were major optimizations we made. We had previously formatted our logs in JSON, but this format was not the optimal way to consume the data and extract what is needed. Our intuitive approach was to utilize AWS Athena which we had readily available, but with its limitation on concurrency and resource allocation between production versus ad-hoc queries, we sought out a different processing engine. As a result, the metric computation job was created as a Spark job (using AWS Glue) that can handle the growing data size with its distributed processing as well as handle more complex logic in metric computation.
Also with logs in JSON format, the amount of data that needed to be read and scanned was unnecessarily big, and queries took way too long. After some research, we switched our logs to Parquet format, and with its columnar format, the amount of data scanned decreased incredibly by on average 65%. On top of that, compacting log files of minimal size into larger chunks reducing the number of files read per partition greatly reduced the performance by another ~70% along with significant S3 cost reduction.
As a result of this whole new pipeline, the customers had metrics that are less error-prone readily available with a single API and there were opportunities for more extensive types of metrics that can be supported. In addition to the daily metrics provided, monthly metrics were provided for the customers to see a bigger picture and pattern of the application usage.
Plus by reducing the amount of data to be read and changing the input log files in a compatible format for the metric processing job, we were able to get the performance we needed to calculate the metrics for all applications, not just for applications that have the feature enabled. That way whenever our customers opt in for this feature, they would have the metrics available even for the past since the feature was launched. In March 2020, the original analytics metrics were completely replaced by the metrics computed from this pipeline, and this was the birth of Advanced Analytics version 1.0.
Soon after Advanced Analytics version 1.0, we prioritized adding a new metric called message_viewers with the rationale that this metric needs to be available along with message_senders for customers to get a full picture of users actively using their application. With the addition of the new metric, we launched Advanced Analytics version 1.5.
The future of Advanced Analytics
As our customer base grows every year, continuous efforts are being made to improve Advanced Analytics. With current time resolution of metrics being monthly and daily, we are considering how we can provide this metric on a granular level and more frequently. Also, we have realized that use cases of Sendbird applications vary across the spectrum and therefore the analytics that the customers would like can be very different. To be able to provide customized advanced analytics is another challenge that we are willing to solve to help our customers gain more detailed insight and make business decisions accordingly.