Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

1 Introduction

Many models in the field of machine learning, such as deep neural networks (DNNs) and graphical models, are naturally represented in a layered network structure. The more layers we use in such models, the more complex the functions that are able to be represented. However, models with many layers are difficult to estimate optimally, and thus those in the machine learning field have generally opted to restrict their model to fewer layers, trading model expressivity for simplicity [1]. Deep learning explores ways to effectively train models with many hidden layers in order to retain the model's expressive powers. One of the most effective approaches to deep learning has been proposed by Hinton and Salakhutdinov [2]. Traditionally, estimating the parameters of network‐based models involves an iterative algorithm with the initial parameters being randomly chosen. Hinton's proposed method involves pretraining, or deliberately presetting in an effective manner, the parameters of the model as opposed to randomly initializing them. In this chapter, we review the architectures and properties of DNNs and discuss their applications.

We first briefly discuss the general machine learning framework and basic machine learning methodology in Section 2. We then discuss feedforward neural networks and backpropagation in Section 3. In Section 4, we explore convolutional neural networks (CNNs), the type of architectures that are usually used in computer vision. In Section 5, we discuss autoencoders, the unsupervised learning models that learn latent features without labels. In Section 6, we discuss recurrent neural networks (RNNs), which can handle sequence data.

2 Machine Learning: An Overview

2.1 Introduction

Machine learning is a field focusing on the design and analysis of algorithms that can learn from data [3]. The field originated from artificial intelligence research in the late 1950s, developing independently from statistics. However, by the early 1990s, machine learning researchers realized that a lot of statistical methods could be applied to the problems they were trying to solve. Modern machine learning is an interdisciplinary field that encompasses theory and methodology from both statistics and computer science.

Machine learning methods are grouped into two main categories, based on what they aim to achieve. The first category is known as supervised learning. In supervised learning, each observation in a dataset comes attached with a label. The label, similar to a response variable, may represent a particular class the observation belongs to (categorical response) or an output value (real‐valued response). In either case, the ultimate goal is to make inferences on possibly unlabeled observations outside of the given dataset. Prediction and classification are both problems that fall into the supervised learning category. The second category is known as unsupervised learning. In unsupervised learning, the data come without labels, and the goal is to find a pattern within the data at hand. Unsupervised learning encompasses the problems of clustering, density estimation, and dimension reduction.

2.2 Supervised Learning

Here, we state the problem of supervised learning explicitly. We have a set of training data bold-italic upper X equals left-parenthesis bold-italic x 1 comma period period period comma bold-italic x Subscript n Baseline right-parenthesis , where bold-italic x Subscript i Baseline element-of double-struck upper R Superscript p for all , and a corresponding set of labels bold-italic y equals left-parenthesis y 1 comma period period period comma y Subscript n Baseline right-parenthesis , which can represent either a category membership or a real‐valued response. We aim to construct a function delta colon double-struck upper R Superscript p Baseline right-arrow double-struck upper R that maps each input bold-italic x Subscript i to a predicted label modifying above y with caret Subscript i . A given supervised learning method script upper M chooses a particular form delta equals delta left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis , where is a vector of parameters based on .

We wish to choose delta left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis to minimize an error function upper E left-parenthesis delta comma bold-italic y right-parenthesis . The error function is most commonly taken to be the sum of square errors in which case the goal is to choose an optimal delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis such that

StartLayout 1st Row 1st Column delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis equals arg min Underscript delta Endscripts upper E left-parenthesis delta comma bold-italic y right-parenthesis equals arg min Underscript delta Endscripts sigma-summation Underscript i equals 1 Overscript n Endscripts script l left-parenthesis delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis comma y Subscript i Baseline right-parenthesis 2nd Column Blank EndLayout

where script l can be any loss function that evaluates the distance between delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis and y Subscript i , such as cross‐entropy loss and square loss.

2.3 Gradient Descent

The form of the function delta will usually be fairly complex, so attempting to find delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic theta Subscript script upper M Baseline right-parenthesis via direct differentiation will not be feasible. Instead, we use gradient descent to minimize the error function.

Gradient descent is a general optimization algorithm that can be used to find the minimizer of any given function. We pick an arbitrary starting point, and then at each time point, we take a small step in the direction

Скачать книгу