Data-Intensive Text Processing with MapReduce

This book focuses on the design of MapReduce algorithms, with an emphasis on common text processing algorithms in natural language processing, information retrieval, and machine learning.

Data-Intensive Text Processing with MapReduce PDF

Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new information and opened up exciting new opportunities in business, science and computing applications. Processing the massive amount of data required for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing calculations spread across huge datasets and an execution framework for large-scale data processing on basic server clusters. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-wide details, from planning to timing to fault tolerance.

This book focuses on the design of MapReduce algorithms, with an emphasis on common text processing algorithms in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to common problems in a variety of problem domains. This book is not only intended to help the reader “think about MapReduce”, it also discusses the limitations of the programming model.

Table of contents :

Acknowledgments
Introduction
Computing in the Clouds
Big Ideas
Why Is This Different?
What This Book Is Not
MapReduce Basics
Functional Programming Roots
Mappers and Reducers
The Execution Framework
Partitioners and Combiners
The Distributed File System
Hadoop Cluster Architecture
Summary
MapReduce Algorithm Design
Local Aggregation
Combiners and In-Mapper Combining
Algorithmic Correctness with Local Aggregation
Pairs and Stripes
Computing Relative Frequencies
Secondary Sorting
Relational Joins
Reduce-Side Join
Map-Side Join
Memory-Backed Join
Summary
Inverted Indexing for Text Retrieval
Web Crawling
Inverted Indexes
Inverted Indexing: Baseline Implementation
Inverted Indexing: Revised Implementation
Index Compression
Byte-Aligned and Word-Aligned Codes
Bit-Aligned Codes
Postings Compression
What About Retrieval?
Summary and Additional Readings
Graph Algorithms
Graph Representations
Parallel Breadth-First Search
PageRank
Issues with Graph Processing
Summary and Additional Readings
EM Algorithms for Text Processing
Expectation Maximization
Maximum Likelihood Estimation
A Latent Variable Marble Game
MLE with Latent Variables
Expectation Maximization
An EM Example
Hidden Markov Models
Three Questions for Hidden Markov Models
The Forward Algorithm
The Viterbi Algorithm
Parameter Estimation for HMMs
Forward-Backward Training: Summary
EM in MapReduce
HMM Training in MapReduce
Case Study: Word Alignment for Statistical Machine Translation
Statistical Phrase-Based Translation
Brief Digression: Language Modeling with MapReduce
Word Alignment
Experiments
EM-Like Algorithms
Gradient-Based Optimization and Log-Linear Models
Summary and Additional Readings
Closing Remarks
Limitations of MapReduce
Alternative Computing Paradigms
MapReduce and Beyond
Bibliography
Authors’ Biographies

 

Data-Intensive Text Processing with MapReduce

Author(s): Jimmy Lin, Chris Dyer, Graeme Hirst

Series: Synthesis Lectures on Human Language Technologies

Publisher: Morgan and Claypool Publishers, Year: 2010

ISBN: 1608453421,9781608453429


Download

Download

Download


Download


Buy From Amazon

Thanks For Visiting Our Website http://www.freepdfbook.com To Support Us, Keep Share On Social Media.