Группа авторов

Machine Learning Algorithms and Applications


Скачать книгу

connectivity between different values. Finally, the outputs were also displayed on the Indian map.

      1.3.2 System Specifications

      Main hardware requirements are high computational power CPU such as an i5/i7 Intel processor or equivalent. The system must be able to fulfill both primary and secondary high memory requirements approximately around 50-GB HDD and 4–8 GB of RAM. Main software requirements consist of any open source operating system, Python language with dependencies like scikit-learn, and other packages and libraries like pandas, numpy, matplotlib, bokeh, flask, tensorflow, and theano.

      1.3.3 Algorithms

       1. K-Means Clustering: The K-means algorithm takes a set of input values and, based on parameter k, clusters the values into k clusters. The division of values into k clusters is based on a similarity index in which data values having close similarity index are grouped into one cluster and another set of values is grouped into another cluster [11]. Distance measures like Euclidean, Manhattan, and Minkowski are some of the similarity indices that are used for clustering. We have used clustering because our dataset values were to be divided into classes. We choose six classes, viz., Good, Satisfactory, Moderately Polluted, Poor, Very Poor, and Severe.

       2. Support Vector Machine Algorithm (SVMA) for Prediction: SVMAs are an age old excellent algorithms in use for solving classification and regression problems. SVM provides a supervised learning model and is used to analyze the patterns in the data. In SVM, a linear model is applied to convert non-linear class boundaries to linear classes. This is done by reducing the high-dimensional feature vector space. Kernel selection is an integral part of SVMs. Different kernels exist and we have used linear and Radial Basis Function (RBF) for our experiments. The outputs have been discussed under results. Two major kinds of SVM considered are therefore linear-based modeling and non-linear based modeling.

       3. Recurrent Neural Network LSTM Algorithm (LSTM-RNN): Contemporary Neural Networks such as Feed Forward Neural Networks (FFNNs) are different from Recurrent Neural Networks (RNNs) because they are trained on labeled data and forward feed is used till prediction error gets minimized.

      RNNs are different from FFNNs because the output or result received at stage t − 1 impacts the output or result received at stage t. In RNN, there are two input values: first one being the present input value and second one being the recent past value. Both inputs are used to compute the new output.

Schematic illustration of basic steps of recurrent neural network.

      Because we were trying to predict future values based on present and past pollution data values that were in time series and had lags, therefore LSTM suited our use case. LSTM learns from the historical data to not only classify but also to process the results and predict the future scores without getting affected by gradient incumbencies.

      1.3.4 Control Flow

      In terms of control flow, the working of our model can be explained with respect to training model and testing model:

       1. Training Model: As the first step of the training model, the data is fetched from the OpenAQ Open Data Community and is pre-processed to remove any kind of noise from the data. The cleaned world data is passed for K-means clustering. Before setting the number of clusters required to classify the data we measured Silhouette coefficient to determine the optimal number of clusters required. On the second hand, the cleaned single place data is passed to the LSTM for different places. The output of the world data clustering and LSTM training of single place data is passed to measure the performance using MAE and RMSE values. Also, the world data after clustering is assigned labels using the AQI table. The labeled data is then split into testing data and training data. SVM training is done with values of parameters as input and air quality as output. At the end, 10-fold cross-validations were done and performances were measured using confusion matrix, precision and recall parameters.

       2. Testing Model: Under testing, new data was fetched using API. It was passed to the respective places LSTM. Future values of all parameters were predicted by the LSTM. This was passed as input to the SVM and the final result was prediction of air quality and assignment of AQI was done.



AQI category (range) PM10 (24hr) PM2.5 (24hr) NO2 (24hr) O3 (8hr) CO (8hr) SO2 (24hr) NH3 (24hr) Pb (24hr)
Good (0–50) 0–50 0–30 0–40 0–50 0–1.0 0–40 0–200 0–0.5
Satisfactory (51–100) 51–100 31–60 41–80 51–100 1.1–2.0 41–80 201–400 0.5–1.0
Moderately polluted (101–200) 101–250