How we do BI at Cytora - Part II

Published in

Engineering at Cytora

6 min readMar 2, 2022

About the authors:

Xiao Liu is a Senior Software Engineer and backend tech lead at Cytora Underwriting Productivity Team. He led the Business Intelligence Team in Cytora before joining Underwriting Productivity.

Nicole Gillett is an ex-Software Engineer at Cytora. She had been part of the Underwriting Productivity Team, as well as Business Intelligence.

1. Recap

In the previous blog post we talked about the BI architecture that we use at Cytora and some examples of how BI has made a difference to the lives of Cytorians, along with our customers. We also touched upon topics like billing and a case study of how BI has been pivotal to one of our products, Underwriter Productivity.

In this blog post, we will do a deep discussion of the Serverless technology that underpins our infrastructure and walk through the challenges and technical decisions that we have made on our BI journey.

2. Why Serverless

There are many benefits of using Serverless architecture and cloud resources described as Serverless or managed, especially in a start-up environment with limited developer resources and a strong focus on speedy feature delivery. Serverless resources require less infrastructure maintenance, and development teams can focus more on features. That’s why at the beginning of designing Cytora BI infrastructure, we chose to use Serverless architecture with managed resources when possible.

In detail, the BI Ingestion service is backed with AWS Lambda for computing. BI topics and queues are backed with SNS and SQS. Google Cloud Storage and BigQuery back BI data lake. When the service has been running, we barely have to worry about a down container or an errored database instance.

Another benefit of AWS Lambda is that if a service’s workload is non-constant, there could be a good amount of cost-saving. An example of a non-constant workload is if your customer is regionally based instead of globally based, then the service barely gets any request during the night. AWS Lambda and other Serverless computing resources are billed for the duration of a request, which means that you don’t pay for the time when there are no requests made to the service, unlike Kubernetes clusters and VM instances that are long-running. For a long period of time, we are not even paying for Serverless computing.

3. Challenges of BI
3.1 Single source of truth

One of the challenges of BI is having a single source of truth that we can refer to throughout the BI lifecycle. For data analytics, we all know that “rubbish in, rubbish out”. Data quality is in our minds. That’s why we have been using Protobuf definition as the single source of truth for generating and validating data and creating BigQuery table and Looker data models. A single source of truth ensures that we are collecting the same data at different points throughout the BI flow, and it gives users of the data confidence that we are securing the correct data points.

For example, if we start recording an extra data point, we want to populate a new field at each level in the BI infrastructure. Having the definition validated at each level in BI by comparing it with the single source of truth ensures data consistency. We keep a record of previous versions of the schema definition, using major and minor versioning to make the history clear for developers. By separating different versions, we avoid the case where this new field could be missing some historical data points.

We use a Protobuf definition as our single source of truth, which is namespaced by service and stored in our BI message definition store. Golang or Python code is generated from the definition and used to compose the BI message in individual services. In the BI ingestion service, we use the Protobuf definition to validate the incoming messages to match the defined schema. BigQuery tables are created using the schema generated from the definition in the BI data lake. In BI analytics, the schema is used to build data models. It is a single source of truth, ensuring that all BI messages arrive in BI in good shape.

In detail, a service team will firstly create a BI message definition in the form of a Protobuf file. The service team will then use “protoc-gen-go” to generate Go files with struct declarations. When the message arrives at the BI ingestion service, the ingestion service will use the Protobuf definition to validate the message. CI will automatically create BigQuery tables using the Protobuf definition with “protoc-gen-bq-schema”.

Below is an example of our Protobuf definition for our example service “platform-hello-go”. Since it’s an example service, we are only capturing the status code of each request. In a real production service, we capture a good amount of information based on the business needs.

Example Protobuf Definition

3.2 Large volumes of data

Our BI infrastructure handles millions of messages each month. Thanks to our fully elastic Serverless architecture, the end-to-end journey of a single message usually only takes less than a second. However, for any queue-based architecture where a producer adds events onto the queue while a consumer processes events from the queue, it is crucial to ensure the consumer comes with a capacity that keeps up with the growth of the queue. Otherwise, it will end up with an ever-growing queue that is never going to be processed.

Simply put, we need to make our message processor fast enough. One of the optimisations we have used a lot is, as AWS writes on their Lambda best practices documentation, ‘Take advantage of execution environment reuse to improve the performance of your function’. By storing clients and database connections outside of the Lambda handler and using /tmp, we reused and saved expensive API calls between invocations to a single instance. We also noticed that a Lambda instance persisted for much longer than we expected before, making such reuse much more meaningful and vital.

Another observation is that although Lambda only allows the configuration of memory size of a function, a bigger memory does result in a higher CPU capacity leading to faster processing of events. Since higher memory allocation means a higher cost per second, it is useful to benchmark your Lambda function with different memory sizes to work out a more efficient and cost-effective memory size. With such experiments, it is possible to get events processed faster with a cheaper bill.

Last but not least, as we are using SQS as the event source, it is essential to set the queue’s visibility timeout to be larger (usually twice as) than the Lambda function’s execution timeout or expected execution time. If it is the other way around, it will end up while the Lambda is still processing the event, the visibility timeout is reached, and SQS puts the event back to queue, making the event be processed multiple times. This will create quite a lot of overhead of processing and de-duplication of events.

These are three major optimisations we have had in our BI infrastructure. By conducting some of the Lambda best practices that AWS suggests, we can create a BI infrastructure with both high throughput and low latency.

4. Future works
4.1 Storage class optimisation and cost-saving

We store every single piece of message from all our services across Cytora. This is currently a huge amount of data, and it is ever increasing exponentially. One optimisation that we have been planning is to use the different tiered storage classes for our data, which will move old, unused data into a cooler but cheaper storage class, thus saving cost. Providers like AWS already have “Intelligent Tiering”, and GCP provides an object lifecycle that can be used to achieve our goals.

4.2 Self-service BI integration and analytics

For the future, we will be focusing on self-service BI integration and analytics. The BI team should focus more on maintaining the BI infrastructure and tooling to improve developer experience than helping the team integrate with BI and generate analytics. BI should be self-serviced, and data should be generally accessible internally.

To deliver the objectives above, we need to provide comprehensive documentation around self-service together with self-service focused tooling like BI CLI. At the same time, the BI team will regularly host BI workshops and lightning talks to share, encourage and train teams across the company with BI capabilities.

5. Conclusions

We hope that you have enjoyed this adventure through the architecture of BI at Cytora! We have covered some of the challenges we faced when designing and have shared some of our tried and tested solutions. Some of the key technologies we have touched upon include using Protobuf definitions as a single source of truth and optimising event-driven, Serverless architecture. We also touched upon some of the users of BI, both external and internal Cytorians. We hope that we have inspired you to try out something new with your BI architecture!

How we do BI at Cytora - Part II

Written by Xiao Liu