Jimmy Lin

Data-Intensive Text Processing with MapReduce


Скачать книгу

ection>

      

       Data-Intensive Text Processing with MapReduce

       Synthesis Lectures on Human Language Technologies

       Editor

       Graeme Hirst, University of Toronto

       Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University of Toronto. The series consists of 50- to 150-page monographs on topics relating to natural language processing, computational linguistics, information retrieval, and spoken language understanding. Emphasis is on important new techniques, on new applications, and on topics that combine two or more HLT subfields.

       Data-Intensive Text Processing with MapReduce

       Jimmy Lin and Chris Dyer

       2010

       Semantic Role Labeling

       Martha Palmer, Daniel Gildea, and Nianwen Xue

       2010

       Spoken Dialogue Systems

       Kristiina Jokinen and Michael McTear

       2009

       Introduction to Chinese Natural Language Processing

       Kam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang

       2009

       Introduction to Linguistic Annotation and Text Analytics

       Graham Wilcock

       2009

       Dependency Parsing

       Sandra Kübler, Ryan McDonald, and Joakim Nivre

       2009

       Statistical Language Models for Information Retrieval

       ChengXiang Zhai

       2008

      Copyright © 2010 by Morgan & Claypool

      All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.

      Data-Intensive Text Processing with MapReduce

      Jimmy Lin and Chris Dyer

       www.morganclaypool.com

      ISBN: 9781608453429 paperback

      ISBN: 9781608453436 ebook

      DOI 10.2200/S00274ED1V01Y201006HLT007

      A Publication in the Morgan & Claypool Publishers series

       SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES

      Lecture #7

      Series Editor: Graeme Hirst, University of Toronto Series ISSN Synthesis Lectures on Human Language Technologies Print 1947-4040 Electronic 1947-4059

       Data-Intensive Text Processing with MapReduce

      Jimmy Lin and Chris Dyer

      University of Maryland

       SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #7

image

       ABSTRACT

      Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.

       KEYWORDS

      Hadoop, parallel and distributed programming, algorithm design, text processing, natural language processing, information retrieval, machine learning

       Contents

       Acknowledgments

       1 Introduction

       1.1 Computing in the Clouds

       1.2 Big Ideas

       1.3 Why Is This Different?

       1.4 What This Book Is Not

       2 MapReduce Basics

       2.1 Functional Programming Roots

       2.2 Mappers and Reducers

       2.3 The Execution Framework

       2.4 Partitioners and Combiners

       2.5 The Distributed File System

       2.6 Hadoop Cluster Architecture

       2.7 Summary

       3 MapReduce Algorithm Design

       3.1 Local Aggregation

       3.1.1 Combiners and In-Mapper Combining

       3.1.2 Algorithmic Correctness with Local Aggregation

       3.2 Pairs and Stripes

       3.3 Computing Relative Frequencies

       3.4 Secondary Sorting