About CMCRC-SIRCA

CMCRC-SIRCA provides unrivalled expertise in data management, mining and visualisation. We conduct research and develop technology that improves the fairness and efficiency of global markets. This significantly improves the health and wealth of market participants.

Our clients

CMCRC’s clients include market operators, regulators and participants in various domains, including stock markets, derivatives markets, energy markets, cryptocurrency markets and health insurance markets. Some of these include NASDAQ, JSE, Lorica Health, Thomson Reuters and the NSW Government.

CMCRC-SIRCA has expertise in the area of measuring and improving market efficiency and integrity and provides valuable research in these areas. Our clients are interested in obtaining our research results to analyse their markets.

The challenge

To generate our research outputs, CMCRC-SIRCA obtains high volumes of data. For example, for stock markets, we receive trading data from 100+ global markets from data vendors including Thomson Reuters and Morningstar, going back 10+ years. New data is ingested on a daily basis, and 100+ market quality metrics must be calculated and the results displayed on our website in a timely manner. In addition, we also receive private data sets from clients on an ad hoc basis.

The solution

CMCRC-SIRCA has built the Market Quality Dashboard (MQD), a web-based product suite that allows our clients to visualise, compare and customise different market quality metrics. It is a general-purpose suite of tools that can be used for visualising broad date ranges and different markets, developing and running custom metrics, and bulk downloading results for offline analysis. We also generate custom case study reports, using Jupyter Notebooks, for clients with very specific mandates.

The MQD application consists of two main components:

  1. ETL processing backend
  2. Web application

The entire application is hosted in AWS. All of our infrastructure is managed as code in CloudFormation templates. We use a continuous delivery approach that automatically deploys changes to our code or infrastructure after they have been code reviewed and tested.

ETL processing backend

The main complexity of the MQD is the ETL process. We ingest data from a variety of data providers. The data can range from high-frequency intraday data to low-frequency data, such as stock market trading data which is made available at the close of trading or realtime, and company’s financials published on a quarterly or semi-annual basis.

This data can be provided as ad hoc one-off dumps from clients (either copied to an S3 bucket, or using AWS Snowball for larger dumps), but more commonly it is downloaded on an ongoing basis from the data provider’s API, FTP server or S3 bucket. The input data is first downloaded to an S3 bucket. We then convert this to a internal binary format called Blueshift, which is a flat file format that has been optimised for our specific data access patterns. At the same time, we normalise the data and apply various market-specific rules using a rules engine. The converted Blueshift is stored in a separate S3 bucket. We then calculate 100+ end-of-day and intraday market quality metrics for each instrument. These metrics are in the form of Jupyter notebooks and are mostly implemented in Python.

The metrics are implemented by our team of researchers, who are domain experts working in the field of financial research. The metric results are then written out to an AWS RDS database. This ETL workflow of downloading, converting and generating results is coordinated and monitored via a tool we have developed in-house called Workflow Engine.

The jobs run inside Docker containers on an ECS cluster, and the Workflow Engine is responsible for scheduling and monitoring the jobs via the ECS API. The jobs have been designed to be idempotent so that we can run the ECS cluster on a Spot Fleet to reduce costs. The Workflow Engine also generates historical statistics on the memory and CPU usage of each job to estimate resources to reserve for future jobs, and increase our cluster utilisation.

Web application

The main MQD web application is a Django application which runs on an ECS cluster behind an ELB in order to achieve scalability, availability and zero downtime blue-green deployments. The static content for the website (images, stylesheets, etc) is hosted in a public S3 bucket sitting behind a Cloudfront CDN. User management and authentication is done with AWS Cognito.

Clients also have the ability to code and run their own metrics. We have a JupyterHub instance running, which allows users to launch their own Jupyter notebook Docker instances running on an auto-scaling ECS cluster. Users are able to read in Blueshift data to perform their analysis in Jupyter and we provide a library of off-the-shelf market quality metrics in the form of a mounted read-only EFS volume.

The results

We currently host over a petabyte of data from various different sectors and markets across the globe. The Workflow Engine currently runs two million ETL jobs per month, allowing us to meet SLAs to ingest and process in a timely manner. Using AWS we are able to rapidly scale out our infrastructure to meet ad hoc requests and spikes in demand, and scale in to reduce costs.

Technology stack

  • AWS CloudFormation
  • AWS Cloudfront
  • AWS Cognito
  • AWS EC2 Spot Fleet
  • AWS ECS
  • AWS EFS
  • AWS ELB
  • AWS RDS
  • AWS S3
  • AWS Snowball
  • Blueshift
  • Jupyter + JupyterHub
  • Workflow Engine