**Data-Intensive Text Processing with MapReduce**

This book focuses on the design of MapReduce algorithms, with an emphasis on common text processing algorithms in natural language processing, information retrieval, and machine learning.

Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new information and opened up exciting new opportunities in business, science and computing applications. Processing the massive amount of data required for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing calculations spread across huge datasets and an execution framework for large-scale data processing on basic server clusters. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-wide details, from planning to timing to fault tolerance.

This book focuses on the design of MapReduce algorithms, with an emphasis on common text processing algorithms in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to common problems in a variety of problem domains. This book is not only intended to help the reader “think about MapReduce”, it also discusses the limitations of the programming model.

**Table of contents :**

Acknowledgments

Introduction

Computing in the Clouds

Big Ideas

Why Is This Different?

What This Book Is Not

MapReduce Basics

Functional Programming Roots

Mappers and Reducers

The Execution Framework

Partitioners and Combiners

The Distributed File System

Hadoop Cluster Architecture

Summary

MapReduce Algorithm Design

Local Aggregation

Combiners and In-Mapper Combining

Algorithmic Correctness with Local Aggregation

Pairs and Stripes

Computing Relative Frequencies

Secondary Sorting

Relational Joins

Reduce-Side Join

Map-Side Join

Memory-Backed Join

Summary

Inverted Indexing for Text Retrieval

Web Crawling

Inverted Indexes

Inverted Indexing: Baseline Implementation

Inverted Indexing: Revised Implementation

Index Compression

Byte-Aligned and Word-Aligned Codes

Bit-Aligned Codes

Postings Compression

What About Retrieval?

Summary and Additional Readings

Graph Algorithms

Graph Representations

Parallel Breadth-First Search

PageRank

Issues with Graph Processing

Summary and Additional Readings

EM Algorithms for Text Processing

Expectation Maximization

Maximum Likelihood Estimation

A Latent Variable Marble Game

MLE with Latent Variables

Expectation Maximization

An EM Example

Hidden Markov Models

Three Questions for Hidden Markov Models

The Forward Algorithm

The Viterbi Algorithm

Parameter Estimation for HMMs

Forward-Backward Training: Summary

EM in MapReduce

HMM Training in MapReduce

Case Study: Word Alignment for Statistical Machine Translation

Statistical Phrase-Based Translation

Brief Digression: Language Modeling with MapReduce

Word Alignment

Experiments

EM-Like Algorithms

Gradient-Based Optimization and Log-Linear Models

Summary and Additional Readings

Closing Remarks

Limitations of MapReduce

Alternative Computing Paradigms

MapReduce and Beyond

Bibliography

Authors’ Biographies

**Data-Intensive Text Processing with MapReduce**

Author(s): Jimmy Lin, Chris Dyer, Graeme Hirst

Series: Synthesis Lectures on Human Language Technologies

Publisher: Morgan and Claypool Publishers, Year: 2010

ISBN: 1608453421,9781608453429

**Thanks For Visiting Our Website http://www.freepdfbook.com**To Support Us, Keep Share On Social Media.