connectivity between different values. Finally, the outputs were also displayed on the Indian map.
1.3.2 System Specifications
Main hardware requirements are high computational power CPU such as an i5/i7 Intel processor or equivalent. The system must be able to fulfill both primary and secondary high memory requirements approximately around 50-GB HDD and 4–8 GB of RAM. Main software requirements consist of any open source operating system, Python language with dependencies like scikit-learn, and other packages and libraries like pandas, numpy, matplotlib, bokeh, flask, tensorflow, and theano.
1.3.3 Algorithms
1. K-Means Clustering: The K-means algorithm takes a set of input values and, based on parameter k, clusters the values into k clusters. The division of values into k clusters is based on a similarity index in which data values having close similarity index are grouped into one cluster and another set of values is grouped into another cluster [11]. Distance measures like Euclidean, Manhattan, and Minkowski are some of the similarity indices that are used for clustering. We have used clustering because our dataset values were to be divided into classes. We choose six classes, viz., Good, Satisfactory, Moderately Polluted, Poor, Very Poor, and Severe.
2. Support Vector Machine Algorithm (SVMA) for Prediction: SVMAs are an age old excellent algorithms in use for solving classification and regression problems. SVM provides a supervised learning model and is used to analyze the patterns in the data. In SVM, a linear model is applied to convert non-linear class boundaries to linear classes. This is done by reducing the high-dimensional feature vector space. Kernel selection is an integral part of SVMs. Different kernels exist and we have used linear and Radial Basis Function (RBF) for our experiments. The outputs have been discussed under results. Two major kinds of SVM considered are therefore linear-based modeling and non-linear based modeling.
3. Recurrent Neural Network LSTM Algorithm (LSTM-RNN): Contemporary Neural Networks such as Feed Forward Neural Networks (FFNNs) are different from Recurrent Neural Networks (RNNs) because they are trained on labeled data and forward feed is used till prediction error gets minimized.
RNNs are different from FFNNs because the output or result received at stage t − 1 impacts the output or result received at stage t. In RNN, there are two input values: first one being the present input value and second one being the recent past value. Both inputs are used to compute the new output.
Figure 1.2 shows the simple form of RNN. For a hidden state (ht) which is non-linear transformation in itself, it can be computed using a combination of linear input value (It) and recent hidden past value (ht − 1). From the figure, it can be observed that the output result is computable using the present dependent hidden state ht. The output Ot holds dependence on probability pt which was computed using a function called softmax. Softmax was only computed in the last layer of RNN-based classification before the final result was received.
Figure 1.2 Basic steps of recurrent neural network.
Since RNN in itself suffers from two gradient problems of vanishing gradients and exploding gradients, therefore there have been two modifications to the basic RNN. Gates have been provided to control the impact of the multiplying factor that is majorly responsible for increase (explosion) in gradient (multiplying factor if larger than one) or decrease (vanishing) in gradient (multiplying factor if less than one). We now have LSTM and Gated Recurrent Unit (GRU). LSTM has been used in our work [12].
Because we were trying to predict future values based on present and past pollution data values that were in time series and had lags, therefore LSTM suited our use case. LSTM learns from the historical data to not only classify but also to process the results and predict the future scores without getting affected by gradient incumbencies.
1.3.4 Control Flow
In terms of control flow, the working of our model can be explained with respect to training model and testing model:
1. Training Model: As the first step of the training model, the data is fetched from the OpenAQ Open Data Community and is pre-processed to remove any kind of noise from the data. The cleaned world data is passed for K-means clustering. Before setting the number of clusters required to classify the data we measured Silhouette coefficient to determine the optimal number of clusters required. On the second hand, the cleaned single place data is passed to the LSTM for different places. The output of the world data clustering and LSTM training of single place data is passed to measure the performance using MAE and RMSE values. Also, the world data after clustering is assigned labels using the AQI table. The labeled data is then split into testing data and training data. SVM training is done with values of parameters as input and air quality as output. At the end, 10-fold cross-validations were done and performances were measured using confusion matrix, precision and recall parameters.
2. Testing Model: Under testing, new data was fetched using API. It was passed to the respective places LSTM. Future values of all parameters were predicted by the LSTM. This was passed as input to the SVM and the final result was prediction of air quality and assignment of AQI was done.
1.4 Results and Discussions
The open data is being provided by OpenAQ organization [13]. Their aim is to help people fight air pollution by providing open data and open-source tools. The data is obtained from government bodies as well as research groups and aggregated by OpenAQ. OpenAQ API was used to fetch the latest data in data frame and saved in .csv format for computations. Figure 1.3 shows the screenshot of data fetched on 6th June, 2020 for Visakhapatnam, India.
Table 1.1 Range of AQI categories.
AQI category (range) | PM10 (24hr) | PM2.5 (24hr) | NO2 (24hr) | O3 (8hr) | CO (8hr) | SO2 (24hr) | NH3 (24hr) | Pb (24hr) |
Good (0–50) | 0–50 | 0–30 | 0–40 | 0–50 | 0–1.0 | 0–40 | 0–200 | 0–0.5 |
Satisfactory (51–100) | 51–100 | 31–60 | 41–80 | 51–100 | 1.1–2.0 | 41–80 | 201–400 | 0.5–1.0 |
Moderately polluted (101–200) | 101–250 |