Caeli: The Sky Is Not the Limit in Measuring Particulate Matter in the Atmosphere

Measuring by Satellite

Caeli is a startup dedicated to providing insight into air quality with a view from above. The satellites orbiting our planet provide their end-users with chronological and (near) real-time information. Satellite imagery can be a cheaper and more readily available option than remote sensing as a tool for measuring the molecular composition of our atmosphere. Generating maps displaying particulate matter such as Nitrogen Dioxide (NO2), Ammonia (NH3), Methane (CH4) and Ozone (O3) can help the public and government understand how changes to the atmosphere may affect health or influence the climate.

A Scalable Architecture Designed for Time and Place

Our first step was to design an architecture that could process and store large influxes of data at a quick pace. Scalability was crucial considering the inevitable increase in data for processing and storage. The obvious choice for scalability was the digital clouds. In this case, the Amazon Web Services (AWS) cloud platform provided the best data storage options. We created a database within AWS and a data pipeline to collect the acquired data and write it into the database for the NO2 gas that can form particulate nitrates.

Very simplied, Caeli’s data looks like this

When Caeli retrieves information from their own database, they often want to filter it by time and location: for example, data from Amsterdam during January 2021. Filtering by time is not a problem because the data is stored chronologically (ascending); the database system roughly ‘knows’ in which rows the January 2021 records are found.

However, it becomes more complicated when you also want to filter the data by location. The visual data is not georeferenced in the order of X and Y coordinates, and only about one record in a million in the database actually matches with coordinates in Amsterdam. It is very inefficient to check each of these lines, so the challenge was to find an architecture that could efficiently filter through multiple dimensions.

Rob: applying my knowledge and skills at Caeli

I made intensive use of many tools and techniques that the Young Mavericks traineeship introduced to me. Both Amazon Web Services (AWS) and the Hadoop ecosystem, which together were the focus of this assignment, were extensively practiced during the traineeship.  

The training prepared me to provide the best solution for Caeli’s data management. On the one hand, Caeli receives a working end product that fits the necessary precision and accessibility of information. Nitrogen dioxide (NO2) data is now automatically collected and stored in a scalable database where records can be efficiently filtered by both date-time and location. On the other hand, clear documentation of processes and transfer protocols allows Caeli to manage this product itself and to reuse it for other atmospheric particulate matter.

“The project helped us to migrate from an on-premise environment to the cloud environment. Young Mavericks and Rob put us on the right track so that we were able to migrate our environment to an AWS environment. As a result, our NO2 data has become available and we have been able to take a step to scale up to other products and to be ready for other countries.” – Martin and Tim from Caeli

ELT, ETL and Data Pipelines: loading Data in an Automated Manner and how to do it yourself

My name is Don and I am something between a data engineer and a data scientist. I automate repetitive tasks, generate insights from the data, manage projects and I often help in an advisory role for these processes. I especially enjoy the creative process required to solve problems using data.

Read more

Data Pipeline Implementation: how to do it yourself

These instructions build on what has been discussed in ELT, ETL and Data Pipelines. In that guide, we discussed the problems that arise in storing and using data for a company. In response to those problems, we introduced the concept of Data Pipelines, which helps the company become better aware of the data loading steps and incorporating these steps in the most optimal way to create a Data-driven Culture. We also discussed some specific tooling that can be used to properly deploy Data Pipelines. 

Now that we understand the concepts behind Data Pipelines, we will now apply them to implement a functioning Data Pipeline. Just like most of our data engineering processes, we follow a step-by-step plan and provide an implementation strategy for each step. 

Hopefully a step-by-step plan will give you a solid foundation when you are constructing your own Data Pipeline as well as the implementation methods.  You can find the whole code on our Giftlab.

Read more