Skip to content

Spidering Hacks 100 Industrial-Strength Tips and Tools

Spend $50 to get a free DVD!

ISBN-10: 0596005776

ISBN-13: 9780596005771

Edition: 2003

Authors: Kevin Hemenway, Tara Calishain, Morbus Iff

List price: $29.99
Blue ribbon 30 day, 100% satisfaction guarantee!
what's this?
Rush Rewards U
Members Receive:
Carrot Coin icon
XP icon
You have reached 400 XP and carrot coins. That is the daily max!

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then "Spidering Hacks is for you. "Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no…    
Customers also bought

Book details

List price: $29.99
Copyright year: 2003
Publisher: O'Reilly Media, Incorporated
Publication date: 11/7/2003
Binding: Paperback
Pages: 424
Size: 5.75" wide x 8.75" long x 1.00" tall
Weight: 1.298
Language: English

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.

Credits
Preface
Walking Softly
A Crash Course in Spidering and Scraping
Best Practices for You and Your Spider
Anatomy of an HTML Page
Registering Your Spider
Preempting Discovery
Keeping Your Spider Out of Sticky Situations
Finding the Patterns of Identifiers
Assembling a Toolbox
Perl Modules
Resources You May Find Helpful
Installing Perl Modules
Simply Fetching with LWP::Simple
More Involved Requests with LWP::UserAgent
Adding HTTP Headers to Your Request
Posting Form Data with LWP
Authentication, Cookies, and Proxies
Handling Relative and Absolute URLs
Secured Access and Browser Attributes
Respecting Your Scrapee's Bandwidth
Respecting robots.txt
Adding Progress Bars to Your Scripts
Scraping with HTML::TreeBuilder
Parsing with HTML::TokeParser
WWW::Mechanize 101
Scraping with WWW::Mechanize
In Praise of Regular Expressions
Painless RSS with Template::Extract
A Quick Introduction to XPath
Downloading with curl and wget
More Advanced wget Techniques
Using Pipes to Chain Commands
Running Multiple Utilities at Once
Utilizing the Web Scraping Proxy
Being Warned When Things Go Wrong
Being Adaptive to Site Redesigns
Collecting Media Files
Detective Case Study: Newgrounds
Detective Case Study: iFilm
Downloading Movies from the Library of Congress
Downloading Images from Webshots
Downloading Comics with dailystrips
Archiving Your Favorite Webcams
News Wallpaper for Your Site
Saving Only POP3 Email Attachments
Downloading MP3s from a Playlist
Downloading from Usenet with nget
Gleaning Data from Databases
Archiving Yahoo! Groups Messages with yahoo2mbox
Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
Gleaning Buzz from Yahoo!
Spidering the Yahoo! Catalog
Tracking Additions to Yahoo!
Scattersearch with Yahoo! and Google
Yahoo! Directory Mindshare in Google
Weblog-Free Google Results
Spidering, Google, and Multiple Domains
Scraping Amazon.com Product Reviews
Receive an Email Alert for Newly Added Amazon.com Reviews
Scraping Amazon.com Customer Advice
Publishing Amazon.com Associates Statistics
Sorting Amazon.com Recommendations by Rating
Related Amazon.com Products with Alexa
Scraping Alexa's Competitive Data with Java
Finding Album Information with FreeDB and Amazon.com
Expanding Your Musical Tastes
Saving Daily Horoscopes to Your iPod
Graphing Data with RRDTOOL
Stocking Up on Financial Quotes
Super Author Searching
Mapping O'Reilly Best Sellers to Library Popularity
Using All Consuming to Get Book Lists
Tracking Packages with FedEx
Checking Blogs for New Comments
Aggregating RSS and Posting Changes
Using the Link Cosmos of Technorati
Finding Related RSS Feeds
Automatically Finding Blogs of Interest
Scraping TV Listings
What's Your Visitor's Weather Like?
Trendspotting with Geotargeting
Getting the Best Travel Route by Train
Geographic Distance and Back Again
Super Word Lookup
Word Associations with Lexical Freenet
Reformatting Bugtraq Reports
Keeping Tabs on the Web via Email
Publish IE's Favorites to Your Web Site
Spidering GameStop.com Game Prices
Bargain Hunting with PHP
Aggregating Multiple Search Engine Results
Robot Karaoke
Searching the Better Business Bureau
Searching for Health Inspections
Filtering for the Naughties
Maintaining Your Collections
Using cron to Automate Tasks
Scheduling Tasks Without cron
Mirroring Web Sites with wget and rsync
Accumulating Search Results Over Time
Giving Back to the World
Using XML::RSS to Repurpose Data
Placing RSS Headlines on Your Site
Making Your Resources Scrapable with Regular Expressions
Making Your Resources Scrapable with a REST Interface
Making Your Resources Scrapable with XML-RPC
Creating an IM Interface
Going Beyond the Book
Index