Hadoop Operations A Guide for Developers and Administrators

Name: Hadoop Operations A Guide for Developers and Administrators
Price: 6.75 USD
Availability: InStock
ISBN: 9781449327057

ISBN-10: 1449327052

ISBN-13: 9781449327057

Edition: 2012

Authors: Eric Sammer

List price: $39.99

30 day, 100% satisfaction guarantee!

Marketplace

3 new & used from $6.75

what's this?

Rush Rewards U
Members Receive:

You have reached 400 XP and carrot coins. That is the daily max!

Description:

If you’ve been tasked with the job of maintaining large and complex Hadoop clusters, or are about to be, this book is a must. You’ll learn the particulars of Hadoop operations, from planning, installing, and configuring the system to providing ongoing maintenance.Hadoop is being adopted by more and more Fortune 500 companies, and the demand for operations-specific material has skyrocketed. This book—written by Eric Sammer, Principal Solution Architect at Cloudera—is the definitive operations guide for administrators.Developers who want to improve MapReduce jobs by learning how Hadoop works in large production environments will also benefit. Application administrators responsible for the…

Book details

List price: $39.99
Copyright year: 2012
Publisher: O'Reilly Media, Incorporated
Publication date: 10/12/2012
Binding: Paperback
Pages: 298
Size: 6.97" wide x 9.17" long x 0.71" tall
Weight: 1.034
Language: English

Eric Sammer is currently a Principal Solution Architect at Cloudera where he helps customers plan, deploy, develop for, and use Hadoop and the related projects at scale. His background is in the development and operations of distributed, highly concurrent, data ingest and processing systems. He's been involved in the open source community and has contributed to a large number of projects over the last decade.



Preface


Introduction


HDFS


Goals and Motivation


Design


Daemons


Reading and Writing Data


The Read Path


The Write Path


Managing Filesystem Metadata


Namenode High Availability


Namenode Federation


Access and Integration


Command-Line Tools


FUSE


REST Support



MapReduce


The Stages of MapReduce


Introducing Hadoop MapReduce


Daemons


When It All Goes Wrong


YARN



Planning a Hadoop Cluster


Picking a Distribution and Version of Hadoop


Apache Hadoop


Cloudera's Distribution Including Apache Hadoop


Versions and Features


What Should I Use?


Hardware Selection


Master Hardware Selection


Worker Hardware Selection


Cluster Sizing


Blades, SANs, and Virtualization


Operating System Selection and Preparation


Deployment Layout


Software


Hostnames, DNS, and Identification


Users, Groups, and Privileges


Kernel Tuning


vm.swappiness


vm.overcommit_memory


Disk Configuration


Choosing a Filesystem


Mount Options


Network Design


Network Usage in Hadoop: A Review


1 Gb versus 10 Gb Networks


Typical Network Topologies



Installation and Configuration


Installing Hadoop


Apache Hadoop


CDH


Configuration: An Overview


The Hadoop XML Configuration Files


Environment Variables and Shell Scripts


Logging Configuration


HDFS


Identification and Location


Optimization and Tuning


Formatting the Namenode


Creating a /tmp Directory


Namenode High Availability


Fencing Options


Basic Configuration


Automatic Failover Configuration


Format and Bootstrap the Namenodes


Namenode Federation


MapReduce


Identification and Location


Optimization and Tuning


Rack Topology


Security



Identity, Authentication, and Authorization


Identity


Kerberos and Hadoop


Kerberos: A Refresher


Kerberos Support in Hadoop


Authorization


HDFS


MapReduce


Other Tools and Systems


Tying It Together



Resource Management


What Is Resource Management?


HDFS Quotas


MapReduce Schedulers


The FIFO Scheduler


The Fair Scheduler


The Capacity Scheduler


The Future



Cluster Maintenance


Managing Hadoop Processes


Starting and Stopping Processes with Init Scripts


Starting and Stopping Processes Manually


HDFS Maintenance Tasks


Adding a Datanode


Decommissioning a Datanode


Checking Filesystem Integrity with fsck


Balancing HDFS Block Data


Dealing with a Failed Disk


MapReduce Maintenance Tasks


Adding a Tasktracker


Decommissioning a Tasktracker


Killing a MapReduce Job


Killing a MapReduce Task


Dealing with a Blacklisted Tasktracker



Troubleshooting


Differential Diagnosis Applied to Systems


Common Failures and Problems


Humans (You)


Misconfiguration


Hardware Failure


Resource Exhaustion


Host Identification and Naming


Network Partitions


"Is the Computer Plugged In?"


E-SPORE


Treatment and Care


War Stories


A Mystery Bottleneck


There's No Place Like 127.0.0.1



Monitoring


An Overview


Hadoop Metrics


Apache Hadoop 0.20.0 and CDH3 (metrics1)


Apache Hadoop 0.20.203 and Later, and CDH4 (metrics 2)


What about SNMP?


Health Monitoring


Host-Level Checks


All Hadoop Processes


HDFS Checks


MapReduce Checks



Backup and Recovery


Data Backup


Distributed Copy (distcp)


Parallel Data Ingestion


Namenode Metadata


Appendix: Deprecated Configuration Properties


Index