Группа авторов

Computational Statistics in Data Science


Скачать книгу

      We first briefly discuss the general machine learning framework and basic machine learning methodology in Section 2. We then discuss feedforward neural networks and backpropagation in Section 3. In Section 4, we explore convolutional neural networks (CNNs), the type of architectures that are usually used in computer vision. In Section 5, we discuss autoencoders, the unsupervised learning models that learn latent features without labels. In Section 6, we discuss recurrent neural networks (RNNs), which can handle sequence data.

      2.1 Introduction

      Machine learning methods are grouped into two main categories, based on what they aim to achieve. The first category is known as supervised learning. In supervised learning, each observation in a dataset comes attached with a label. The label, similar to a response variable, may represent a particular class the observation belongs to (categorical response) or an output value (real‐valued response). In either case, the ultimate goal is to make inferences on possibly unlabeled observations outside of the given dataset. Prediction and classification are both problems that fall into the supervised learning category. The second category is known as unsupervised learning. In unsupervised learning, the data come without labels, and the goal is to find a pattern within the data at hand. Unsupervised learning encompasses the problems of clustering, density estimation, and dimension reduction.

      2.2 Supervised Learning

      Here, we state the problem of supervised learning explicitly. We have a set of training data bold-italic upper X equals left-parenthesis bold-italic x 1 comma period period period comma bold-italic x Subscript n Baseline right-parenthesis, where bold-italic x Subscript i Baseline element-of double-struck upper R Superscript p for all i, and a corresponding set of labels bold-italic y equals left-parenthesis y 1 comma period period period comma y Subscript n Baseline right-parenthesis, which can represent either a category membership or a real‐valued response. We aim to construct a function delta colon double-struck upper R Superscript p Baseline right-arrow double-struck upper R that maps each input bold-italic x Subscript i to a predicted label modifying above y with caret Subscript i. A given supervised learning method script upper M chooses a particular form delta equals delta left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis, where bold-italic theta Subscript script upper M is a vector of parameters based on script upper M.

      We wish to choose delta left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis to minimize an error function upper E left-parenthesis delta comma bold-italic y right-parenthesis. The error function is most commonly taken to be the sum of square errors in which case the goal is to choose an optimal delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis such that

StartLayout 1st Row 1st Column delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis equals arg min Underscript delta Endscripts upper E left-parenthesis delta comma bold-italic y right-parenthesis equals arg min Underscript delta Endscripts sigma-summation Underscript i equals 1 Overscript n Endscripts script l left-parenthesis delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis comma y Subscript i Baseline right-parenthesis 2nd Column Blank EndLayout

      where script l can be any loss function that evaluates the distance between delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis and y Subscript i, such as cross‐entropy loss and square loss.

      2.3 Gradient Descent

      The form of the function delta will usually be fairly complex, so attempting to find delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis via direct differentiation will not be feasible. Instead, we use gradient descent to minimize the error function.

      Gradient descent is a general optimization algorithm that can be used to find the minimizer of any given function. We pick an arbitrary starting point, and then at each time point, we take a small step in the direction