Air Quality
Wed 20 July 2016Air Quality¶
Contents¶
Cleaning
Exploratory Data Analysis
Regression
The data is from a gas sensor array that was placed at ground level in a highly polluted section of an unnamed Italian city.
Several sensors were used in the detection of various pollutants. All various forms of metal oxides [1](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3264469/)
From these many sensors the question is whether or not a model could be built to predict the true levels of carbon monoxide present through the use of sensors that are attuned to other compounds. This would allow for the more robust installation of these sensor arrays that would still be versatile enough to detect the concentration of a specific compound. Additionally if it is found to have a relationship it could be used to determine if the values of a sensor are within an acceptable range.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(style='white')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
air_quality = pd.read_csv('AirQualityUCI/AirQualityUCI.csv', delimiter=';') # ';' was delimiter rather than ','
air_quality.head(2)
print air_quality.info()
Just from looking at the first few rows in the dataframe there are a few issues. The data for some of the columns has been recorded in the european style of commas instead of periods to indicate the decimals. Causing them to appear as objects, rather than numeric values (float64). The last two columns contain no information.
The info function is useful in that it shows the number of entries, how many contain a value, and what the type of that value is. It is useful to find NaNs, but does have weaknesses. If non existent values were filled using a number (i.e. -9999) it will not be counted. In order to be certain the data is as desired it will be worthwhile to go through a few more checks to make certain that the data is in the correct format and contains relevant information.
From the dtypes listed Unnamed 15 and 16 columns can be dropped. This will leave the following data for analysis:
Name | Explination | Data Type | |
---|---|---|---|
1 | Date | Date of Measurement DD/MM/YYYY | Datetime |
2 | Time | Time Measurement was Taken | Datetime |
3 | CO(GT) | Ground Truth Carbon Monoxide Presence | Float |
4 | PT08.S1(CO) | Tin Oxide Hourly Average Sensor Response - CO target | Float |
5 | NMHC(GT) | Ground Truth Hourly Average Non Metanic Hydrocarbons(NMHC) | Float |
6 | C6H6(GT) | Ground Truth Benzene | Float |
7 | PT08.S2(NMHC) | Titania sensor response - NMHC Targeted | Float |
8 | NOx(GT) | Ground Truth Nitrogen Dioxide | Float |
9 | PT08.S3(NOx) | Tungsten Oxide Nitrogen Oxides Targeted | Float |
10 | NO2(GT) | Ground Truth Nitrogen Dioxide | Float |
11 | PT08.S4(NO2) | Tungsten Oxide Response Nitrogen Dioxide Targeted | Float |
12 | PT08.S5(O3) | Indium Oxide Response Ozone Targeted | Float |
13 | T | Temperature (°C) | Float |
14 | RH | Relative Humidity | Float |
15 | AH | Absolute Humidity | Float |
Ground truth specifies that this reading was the true value, taken from a reference certified analytic device that was placed in the vincinity of the sensor array. The units for all ground truth readings were in $\frac{mg}{m^3}$.
Measurements were taken each hour. The experiment ran for a little more than a year, starting March 10th, 2004 to April 4th, 2005
Non-existent values were given the value of -200. These will be replaced by NaNs for further analysis, and dropped accordingly.
Below a mask was created in order to drop the unnamed columns from the dataframe. The returned dataframe was then assigned to a new variable in order to not have to worry about reassigning values.
mask = [x for x in air_quality.columns if x not in ["Unnamed: 15", "Unnamed: 16"]]
air = air_quality[mask]
Now to remove all NaNs that occupy entire rows and also replace values of -200 with NaN
for column in air.columns:
# Commas present only as string, before converting to numeric replace with decimals
air[column] = air[column].apply(lambda x: x.replace(',','.') if type(x)==str else x)
air = air.dropna(how='all')
for column in air.columns:
air[column] = air[column].apply(lambda x: np.NaN
if x == '-200' or x == -200 # If item is present as string or integer
else x)
Again another mask needs to be created to specify which columns need to be converted to numeric values and which ones do not. The Time and Date columns are not of great concern for this analysis, but they would be converted to datetime.
to_nums =[x for x in air.columns if x not in ['Date', 'Time']]
air[to_nums] = air[to_nums].applymap(float) # Apply float to all needed columns
air.describe()
After all of the cleaning has been successfully completed it is possible to move on to the analysis portion of the project.
air_ = air[[column for column in air.columns
if u'(GT)' not in column or column == u"CO(GT)"]] # New dataframe with columns for analysis
air_ = air_.dropna(how='any')
EDA¶
For the desired analysis, will need to remove the present ground truths for all of the chemical species except for carbon monoxide.
To hone it down even further a correlation plot can show how related the variables are to each other
sns.heatmap(air_.corr())
plt.show()
It would appear that there is a strong correlation between the Carbon Monoxide levels and the sensors for all of the available species, all positive, except for the nitrogen oxide species which are negatively correlated. Time and absolute humidity have a weak correlation and the relative humidity does not appear to have any effect on the carbon monoxide levels.
A pairplot with a regression focus helps confirm the assumption
sensors = [sensor for sensor in air_.columns if sensor not in ['Date', 'Time', 'T', 'RH', 'CO(GT)', 'PT08.S1(CO)']]
sns.pairplot(air_.dropna(how='any'), kind='reg')
plt.show()
Regression¶
From above it would appear that the NOx species relative to carbon monoxide has a non linear appearance. Other than that the chemical sensors present have a linear relationship to the ground truth of carbon monoxide.
The information can then be used to construct the linear models. The models can then be verified through both the use of a train test split to introduce unknown data to the model to see how resilient it is to overfitting.
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
Y = air_['CO(GT)']
X = air_[sensors]
X.columns
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state=11)
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)
y_pred = lin_reg.predict(x_test)
plt.figure(figsize=(8,8))
plt.scatter(y_pred, y_test, marker='+')
plt.xlabel("Predicted CO(GT)")
plt.xlim(0,10)
plt.ylim(0,10)
plt.ylabel("True CO(GT)")
plt.show()
lin_reg.score(x_test, y_test)
This is a satisfactory $r^2$ value for this situation. In the physical sciences typically an $r^2$ in the 0.75 to 0.9 range is acceptable. Because of linear regression containing that irreducible errors that are caused by the
An $r^2$ value of more than 0.9 is always something that causes a high degree of overfitting of the model. If new data were entered, it would likely be found to be very
Using sklearn is useful for visualizing the outcome of the model versus the truth, but in this case it would also be useful to perform the same analysis in another libarary, statsmodels, to get a clearer picture of the influence of these variables upon the output.
import statsmodels.api as sm
X = sm.add_constant(X) # Add intercept
model = sm.OLS(Y,X)
results = model.fit()
print results.summary()
From the work done above it would be appropriate to use the other sensors either to disregard outliers due to suspicious reponse, or to build sensor arrays with limited sensors. This would be very useful in urban situations where pollutants originate from common recurring sources.