The project focused on the analysis of a dense network of urban measurements - 1) Assessing strategies for converting raw voltages to concentrations 2) Collecting information on emissions from disparate sources and formatting them for comparison to observations 3) Developing simple and complex models for comparisons of predicted and observed concentrations 4) Visualizations that make the data publicly accessible.

The Problem

BEACO2N is a recently developed approach towards scientific networks. The data, which will be made available to the public soon, will enable scientists, educators, and students to trackgreenhouse gas pollution in their area. This project serves to calibrate datasets for NO, CO, CO2, NO2 and O3 and ultimately operationalize calibration of sensors for public access. By taking a look at EPA locations and sensors, each with multiple sensors, forming clusters of each type of sensor within 70 km becomes a primary method for filtering and classifying data. 

In addition, our project addresses the issue of balancing costs of cheap and expensive sensors by bridging the gap in sensor quality. Improving sensor performance and finding structural errors in sensor outputs can help save costs in regards to purchasing sensor equipment. Each node measures chemicals such as CO2, which is a major contributor to climate change and other nodes that collect data on NO2 and O3 serve as indicators for overall air quality in a specific area. These nodes are also useful in tracing the origins of emissions and creating clusters that allow us to track locations where higher levels of NO, CO, CO2, NO2 and O3 are being emitted. 

Thus, “success” for us includes successfully calibrating the data and speeding up the process for public access to the data. Several issues that arise from the BEACO2N nodes themselves must be addressed. These include noise issues due to external factors (unrelated to air quality or the tampering/corruption of data) such as temperature and humidity and the use of corrupted reference data. Stress-testing cross-sensitivity metrics helped ensure the validity of data in the calibration process. Sensor quality degradation was found to be another problem as sensors tend to fail gradually over time, hence mitigation would help improve accuracy of the data. The use of reference data also introduced corrupted data, and identifying these inaccuracies is essential as we assume full trust in the dataset for visualization and analysis. Besides the internal errors of nodes, being aware of external discrepancies such as physical intervention becomes crucial as well. Unintentionally bumping into, shifting, or changing the deployment of a BEACO2N node may shift and corrupt the data. Other than focusing on developing techniques to directly address these issues, we must also implement various model developments and testing in order to probe the success of these techniques, which is essential for determining the success of these techniques across the entire network of BEACO2N sensors. 

Substantially, we aim to utilize a broad network of cheaper sensors in areas that require less monitoring of air quality and to reduce the use of high-quality and expensive sensors to areas that are in need of significant air quality monitoring. Ideally, our purpose is to achieve the same degree of accuracy or success while addressing the economics of network deployment and, as mentioned earlier, successfully calibrating the data.

The Data

The data used in this project comes from the Berkeley Environmental Air-quality & CO2 Network (BEACO2N). The network takes a new approach at observing atmospheric gases over an urban area. It uses a high-density network of instruments where each instrument represents a node rather than using a low-density network of highly sensitive equipment. Each of the nodes in the BEACO2N network produces moderate quality data, but the network effect of the entire dataset produces an accurate representation of real-time pollutant concentrations. Every node in the network measures carbon dioxide, nitrogen oxides, ozone, and carbon monoxide as indicators of overall air quality in an area. 

We manage the noise discussed earlier through a number of flags that help us determine if the data is null or nonsensical, such as the auxiliary values for each gas being recorded as higher than the working values. These flags were effective in increasing the accuracy of our initial multilinear regression models as we saw significant improvements in r-squared values in nodes where these flags removed invalid data points from our data sets.

Our solution has been multifold. We first built multilinear regression models over specific 3-6 month periods using EPA sensors as reference nodes. Using these models, we performed cross-site testing with different nodes and reference locations and discovered the importance of humidity, temperature, and deployment numbers in our models. We cleaned the data through filters by dropping the most extreme ends of the values to account for calibration periods. We found that this was helpful in increasing our R-Squared values in some beacon nodes, sometimes by 8-10%.

We then moved away from linear modelling and focused on sensitivity testing with plume detection techniques in order to test the validity of our data to observe general trends in gas values (e.g. CO values). This involved some correction and offsetting our input values to achieve relative calibration
across our data.

[[{"fid":"1896","view_mode":"default","fields":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"2":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"style":"display: block; margin-left: auto; margin-right: auto;","class":"media-element file-default","data-delta":"2"}}]]

We also worked on clustering by EPA locations and their sensors in order to see how location affected the data.

[[{"fid":"1897","view_mode":"default","fields":{"format":"default","field_file_image_alt_text[und][0][value]":"Clustering of EPA Locations","field_file_image_title_text[und][0][value]":"Clustering of EPA Locations"},"type":"media","field_deltas":{"3":{"format":"default","field_file_image_alt_text[und][0][value]":"Clustering of EPA Locations","field_file_image_title_text[und][0][value]":"Clustering of EPA Locations"}},"attributes":{"alt":"Clustering of EPA Locations","title":"Clustering of EPA Locations","style":"display: block; margin-left: auto; margin-right: auto;","class":"media-element file-default","data-delta":"3"}}]]


Across the different objectives we had within our project, we had a wide variety of different results with varying degrees of success.

In the early stages of the project, our exploration primarily consisted of the construction of multilinear regression models across a wide range of different BEACO2N node sites with data sets varying over different periods of time. The high correlation values found from our linear regression modeling upon the datasets across NO, CO, CO2, NO2 and O3 provided confidence in the validity of the data as signals dependent upon their respective particulates.

[[{"fid":"1898","view_mode":"default","fields":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"4":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"style":"float: left;","class":"media-element file-default","data-delta":"4"}}]]

As an example, we will proceed with a data set from a BEACO2N deployment between January 2018 to July 2018. The visualization on the left is the data set where the orange timeseries of co represents the reference data, and the blue timeseries of co_wrk-aux represents the BEACO2N node raw data.

[[{"fid":"1899","view_mode":"default","fields":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"5":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"style":"float: right;","class":"media-element file-default","data-delta":"5"}}]]

From here, we proceed to fit a multilinear regression model between the BEACO2N input for CO (co_wrk-aux) and the reference data for CO (co). The visualization on the right applies the fitted model with our BEACO2N data as input and visualizes the prediction (pred_y) alongside the reference data (co). To produce our test metrics of root mean squared error and r2-value, we apply a test-train split of 70/30. To produce our KFold Validation root mean squared error, we use 4 splits. This procedure would be applied to NO, CO, CO2, NO2 and O3, and across different datasets involving different locations and periods of time.

With regards to direct forms of filtering, we tested around with filtering via quartiles, z-scores, and correlation over subsets to attempt a general method by which to flag outliers. A common concern across all three methods was that given their broad nature, occasionally a method might prune too much data resulting in sparse datasets; therefore, these fixed thresholds applied during filtering should typically err on the side of leniency to prevent this situation from occurring. Within the current implementation of the BEACO2N calibration process, the method first flags any values above a z-score of 2 as an outlier to be removed before entering the calibration phase.

With respect to considerations about the impact of temperature and humidity upon the quality of our data, the primary concern was that sensors could potentially suffer at low temperature or high humidity conditions. Therefore, our objective was to first identify if such a relationship existed. One approach that we implemented was to first fit a multilinear regression model to a chosen data set. Using this model, we would calculate the respective error squared values and match these values with their respective temperature and humidity values. Given this data set, we plotted them against each other in order to identify if such a relationship potentially existed. The following two figures demonstrate the results of this procedure upon one example data set for NO with respect to both humidity and temperature.

[[{"fid":"1900","view_mode":"width_400","fields":{"format":"width_400","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"8":{"format":"width_400","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"style":"float: left;","class":"media-element file-width-400","data-delta":"8"}}]]  [[{"fid":"1901","view_mode":"width_400","fields":{"format":"width_400","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"9":{"format":"width_400","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"style":"float: right;","class":"media-element file-width-400","data-delta":"9"}}]]As can be seen in the visualizations on the left and the right, high error-squared values do not seem to have a skew towards high humidity, as it appears to be equally present at both high and low humidity values; and high error-squared values do not seem to have a skew towards low temperature, as it appears to be equally present at both high and low temperature values.

This procedure was further repeated upon additional randomly selected BEACO2N node data sets, and this conclusion appeared to be consistent. Therefore, given the results of this procedure, we concluded that temperature and humidity were currently not of significant concern as indicators of data quality given the initial hypothesis.

Yin Yin Teo

Brandon Zhong

Nishant Mishra

Spring 2020
Physical Science/Engineering