Skip to content

Webbots, Spiders, and Screen Scrapers A Guide to Developing Internet Agents with PHP/Curl

Best in textbook rentals since 2012!

ISBN-10: 1593273975

ISBN-13: 9781593273972

Edition: 2nd 2011

Authors: Michael Schrenk

List price: $41.95
Blue ribbon 30 day, 100% satisfaction guarantee!
what's this?
Rush Rewards U
Members Receive:
Carrot Coin icon
XP icon
You have reached 400 XP and carrot coins. That is the daily max!

Customers also bought

Book details

List price: $41.95
Edition: 2nd
Copyright year: 2011
Publisher: No Starch Press, Incorporated
Publication date: 3/15/2012
Binding: Paperback
Pages: 392
Size: 7.00" wide x 9.00" long x 0.97" tall
Weight: 1.628

Michael Schrenk develops webbots and spiders for clients across North America. He has written for Computerworld and Web Techniques magazines and has taught college courses on web usability and Internet marketing. He is also an occasional speaker at DEFCON.

About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Old-School Client-Server Technology
The Problem with Browsers
What to Expect from This Book
About the Website
About the Code
Requirements
A Disclaimer (This Is Important)
Fundamental Concepts and Techniques
What's in It for You?
Uncovering the Internet's True Potential
What's in It for Developers?
What's in It for Business Leaders?
Final Thoughts
Ideas for Webbot Projects
Inspiration from Browser Limitations
A Few Crazy Ideas to Get You Started
Final Thoughts
Downloading Web Pages
Think About Files, Not Web Pages
Downloading Files with PHP's Built-in Functions
Introducing PHP/CURL
Installing PHP/CURL
LIB_http
Final Thoughts
Basic Parsing Techniques
Content Is Mixed with Markup
Parsing Poorly Written HTML
Standard Parse Routines
Using LIB_parse
Useful PHP Functions
Final Thoughts
Advanced Parsing with Regular Expressions
Pattern Matching, the Key to Regular Expressions
PHP Regular Expression Types
Learning Patterns Through Examples
Regular Expressions of Particular Interest to Webbot Developers
When Regular Expressions Are (or Aren't) the Right Parsing Tool
Final Thoughts
Automating Form Submission
Reverse Engineering Form Interfaces
Form Handlers, Data Fields, Methods, and Event Triggers
Unpredictable Forms
Analyzing a Form
Final Thoughts
Managing Large Amounts of Data
Organizing Data
Making Data Smaller
Thumbnailing Images
Final Thoughts; Projects
Price-Monitoring Webbots
The Target
Designing the Parsing Script
Initialization and Downloading the Target
Further Exploration
Image-Capturing Webbots
Example Image-Capturing Webbot
Creating the Image-Capturing Webbot
Further Exploration
Final Thoughts
Link-Verification Webbots
Creating the Link-Verification Webbot
Running the Webbot
Further Exploration
Search-Ranking Webbots
Description of a Search Result Page
What the Search-Ranking Webbot Does
Running the Search-Ranking Webbot
How the Search-Ranking Webbot Works
The Search-Ranking Webbot Script
Final Thoughts
Further Exploration
Aggregation Webbots
Choosing Data Sources for Webbots
Example Aggregation Webbot
Adding Filtering to Your Aggregation Webbot
Further Exploration
FTP Webbots
Example FTP Webbot
PHP and FTP
Further Exploration
Webbots That Read Email
The POP3 Protocol
Executing POP3 Commands with a Webbot
Further Exploration
Webbots That Send Email
Email, Webbots, and Spam
Sending Mail with SMTP and PHP
Writing a Webbot That Sends Email Notifications
Further Exploration
Converting a Website into a Function
Writing a Function Interface
Final Thoughts; Advanced Technical Considerations
Spiders
How Spiders Work
Example Spider
LIB_simple_spider
Experimenting with the Spider
Adding the Payload
Further Exploration
Procurement Webbots and Snipers
Procurement Webbot Theory
Sniper Theory
Testing Your Own Webbots and Snipers
Further Exploration
Final Thoughts
Webbots and Cryptography
Designing Webbots That Use Encryption
A Quick Overview of Web Encryption
Final Thoughts
Authentication
What Is Authentication?
Example Scripts and Practice Pages
Basic Authentication
Session Authentication
Final Thoughts
Advanced Cookie Management
How Cookies Work
PHP/CURL and Cookies
How Cookies Challenge Webbot Design
Further Exploration
Scheduling Webbots and Spiders
Preparing Your Webbots to Run as Scheduled Tasks
The Windows XP Task Scheduler
The Windows 7 Task Scheduler
Non-calendar-based Triggers
Final Thoughts
Scraping Difficult Websites with Browser Macros
Barriers to Effective Web Scraping
Overcoming Webscraping Barriers with Browser Macros
Final Thoughts
Hacking iMacros
Hacking iMacros for Added Functionality
Further Exploration
Deployment and Scaling
One-to-Many Environment
One-to-One Environment
Many-to-Many Environment
Many-to-One Environment
Scaling and Denial-of-Service Attacks
Creating Multiple Instances of a Webbot
Managing a Botnet
Further Exploration; Larger Considerations
Designing Stealthy Webbots and Spiders
Why Design a Stealthy Webbot?
Stealth Means Simulating Human Patterns
Final Thoughts
Proxies
What Is a Proxy?
Proxies in the Virtual World
Why Webbot Developers Use Proxies
Using a Proxy Server
Types of Proxy Servers
Final Thoughts
Writing Fault-Tolerant Webbots
Types of Webbot Fault Tolerance
Error Handlers
Further Exploration
Designing Webbot-Friendly Websites
Optimizing Web Pages for Search Engine Spiders
Web Design Techniques That Hinder Search Engine Spiders
Designing Data-Only Interfaces
Final Thoughts
Killing Spiders
Asking Nicely
Building Speed Bumps
Setting Traps
Final Thoughts
Keeping Webbots out of Trouble
It's All About Respect
Copyright
Trespass to Chattels
Internet Law
Final Thoughts; PHP/CURL Reference
Creating a Minimal PHP/CURL Session
Initiating PHP/CURL Sessions
Setting PHP/CURL Options
Executing the PHP/CURL Command
Closing PHP/CURL Sessions
Status Codes
HTTP Codes
NNTP Codes
SMS Gateways
Sending Text Messages
Reading Text Messages
A Sampling of Text Message Email Addresses