Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

S., Queiroz, C., and Buyya, R. (2017) A taxonomy and survey of stream processing systems, in Software Architecture for Big Data and the Cloud (eds I. Mistrik, R. Bahsoon, N. Ali, et al.), Elsevier, pp. 183–206. doi: 10.1016/B978‐0‐12‐805467‐3.00011‐9.

117 117 Landset, S., Khoshgoftaar, T.M., Richter, A.N., and Hasanin, T. (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data, 2 (1), 1–36.

Part II Simulation‐Based Methods

5 Monte Carlo Simulation: Are We There Yet?

Dootika Vats1, James M. Flegal2, and Galin L. Jones3

1Indian Institute of Technology Kanpur, Kanpur, India

2University of California, Riverside, CA, USA

3University of Minnesota, Twin‐Cities Minneapolis, MN, USA

1 Introduction

Monte Carlo simulation methods generate observations from a chosen distribution in an effort to estimate unknowns of that distribution. A rich variety of methods fall under this characterization, including classical Monte Carlo simulation, Markov chain Monte Carlo (MCMC), importance sampling, and quasi‐Monte Carlo.

Consider a distribution upper F defined on a ‐dimensional space script í’³ , and suppose that theta element-of double-struck upper R Superscript p are features of interest of upper F . Specifically, theta may be a combination of quantiles, means, and variances associated with upper F . Samples upper X 1 comma ellipsis comma upper X Subscript n Baseline are obtained via simulation either approximately or exactly from upper F , and a consistent estimator of theta , ModifyingAbove theta With Ì‚ , is constructed so that, as n right-arrow infinity ,

(1) ModifyingAbove theta With Ì‚ left-parenthesis upper X 1 comma ellipsis comma upper X Subscript n Baseline right-parenthesis right-arrow Overscript a period s period Endscripts theta

Thus, even when upper F is a complicated distribution, Monte Carlo simulation allows for estimation of features of upper F . Throughout, we assume that either independent and identically distributed (IID) samples or MCMC samples from upper F can be obtained efficiently; see Refs [1–5] for various techniques.

The foundation of Monte Carlo simulation methods rests on asymptotic convergence as indicated by (1). When enough samples are obtained, ModifyingAbove theta With Ì‚ almost-equals theta , and simulation can be terminated with reasonable confidence. For many estimators, an asymptotic sampling distribution is available in order to ascertain the variability in estimation via a central limit theorem (CLT) or application of the delta method on a CLT. Section 2 introduces estimators of theta , while Section 3 discusses sampling distributions of these estimators for IID and MCMC sampling.

Although Monte Carlo simulation relies on large‐sample frequentist statistics, it is fundamentally different in two ways. First, data is generated by a computer, and so often there is little cost to obtaining further samples. Thus, the reliance on asymptotics is reasonable. Second, data is obtained sequentially, so determining when to terminate the simulation can be based on the samples already obtained. As this implies a random simulation time, additional safeguards are necessary to ensure asymptotic validity. This has led to the study of sequential stopping rules, which we present in Section 5.

Sequential stopping rules rely on estimating the limiting Monte Carlo variance–covariance matrix (when p equals 1 , this is the standard error of ModifyingAbove theta With Ì‚ ). This is a particularly challenging problem in MCMC due to serial correlation in the samples. We discuss these challenges in Section 4 and present estimators appropriate for large simulation sizes.

Over a variety of examples in Section 7, we conclude that the simulation size required for a reliable estimation is often higher than what is commonly used by practitioners (see also Refs [6, 7]. Given modern computational power, the recommended strategies can easily be adopted in most estimation problems. We conclude the introduction with an example illustrating the need for careful sample size calculations.

Example 1. Consider IID draws upper X 1 comma ellipsis comma upper X Subscript m Baseline tilde upper N left-parenthesis theta comma sigma squared right-parenthesis . An estimate of theta is upper X overbar equals m Superscript negative 1 Baseline sigma-summation Underscript i equals 1 Overscript m Endscripts upper X Subscript i , and is estimated with the sample variance, s squared . Let z Subscript u be the th quantile of a standard normal distribution, for 0 less-than u less-than 1 . A large‐sample left-parenthesis 1 minus alpha right-parenthesis 100 percent-sign confidence interval for theta is

upper X overbar plus-or-minus z Subscript </p>
</div><hr>
<div class=

Скачать книгу