| |
| |
About the Author | |
| |
| |
About the Technical Reviewer | |
| |
| |
Acknowledgments | |
| |
| |
Introduction | |
| |
| |
Old-School Client-Server Technology | |
| |
| |
The Problem with Browsers | |
| |
| |
What to Expect from This Book | |
| |
| |
About the Website | |
| |
| |
About the Code | |
| |
| |
Requirements | |
| |
| |
A Disclaimer (This Is Important) | |
| |
| |
Fundamental Concepts and Techniques | |
| |
| |
| |
What's in It for You? | |
| |
| |
| |
Uncovering the Internet's True Potential | |
| |
| |
| |
What's in It for Developers? | |
| |
| |
| |
What's in It for Business Leaders? | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Ideas for Webbot Projects | |
| |
| |
| |
Inspiration from Browser Limitations | |
| |
| |
| |
A Few Crazy Ideas to Get You Started | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Downloading Web Pages | |
| |
| |
| |
Think About Files, Not Web Pages | |
| |
| |
| |
Downloading Files with PHP's Built-in Functions | |
| |
| |
| |
Introducing PHP/CURL | |
| |
| |
| |
Installing PHP/CURL | |
| |
| |
| |
LIB_http | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Basic Parsing Techniques | |
| |
| |
| |
Content Is Mixed with Markup | |
| |
| |
| |
Parsing Poorly Written HTML | |
| |
| |
| |
Standard Parse Routines | |
| |
| |
| |
Using LIB_parse | |
| |
| |
| |
Useful PHP Functions | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Advanced Parsing with Regular Expressions | |
| |
| |
| |
Pattern Matching, the Key to Regular Expressions | |
| |
| |
| |
PHP Regular Expression Types | |
| |
| |
| |
Learning Patterns Through Examples | |
| |
| |
| |
Regular Expressions of Particular Interest to Webbot Developers | |
| |
| |
| |
When Regular Expressions Are (or Aren't) the Right Parsing Tool | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Automating Form Submission | |
| |
| |
| |
Reverse Engineering Form Interfaces | |
| |
| |
| |
Form Handlers, Data Fields, Methods, and Event Triggers | |
| |
| |
| |
Unpredictable Forms | |
| |
| |
| |
Analyzing a Form | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Managing Large Amounts of Data | |
| |
| |
| |
Organizing Data | |
| |
| |
| |
Making Data Smaller | |
| |
| |
| |
Thumbnailing Images | |
| |
| |
| |
Final Thoughts; Projects | |
| |
| |
| |
Price-Monitoring Webbots | |
| |
| |
| |
The Target | |
| |
| |
| |
Designing the Parsing Script | |
| |
| |
| |
Initialization and Downloading the Target | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Image-Capturing Webbots | |
| |
| |
| |
Example Image-Capturing Webbot | |
| |
| |
| |
Creating the Image-Capturing Webbot | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Link-Verification Webbots | |
| |
| |
| |
Creating the Link-Verification Webbot | |
| |
| |
| |
Running the Webbot | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Search-Ranking Webbots | |
| |
| |
| |
Description of a Search Result Page | |
| |
| |
| |
What the Search-Ranking Webbot Does | |
| |
| |
| |
Running the Search-Ranking Webbot | |
| |
| |
| |
How the Search-Ranking Webbot Works | |
| |
| |
| |
The Search-Ranking Webbot Script | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Aggregation Webbots | |
| |
| |
| |
Choosing Data Sources for Webbots | |
| |
| |
| |
Example Aggregation Webbot | |
| |
| |
| |
Adding Filtering to Your Aggregation Webbot | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
FTP Webbots | |
| |
| |
| |
Example FTP Webbot | |
| |
| |
| |
PHP and FTP | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Webbots That Read Email | |
| |
| |
| |
The POP3 Protocol | |
| |
| |
| |
Executing POP3 Commands with a Webbot | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Webbots That Send Email | |
| |
| |
| |
Email, Webbots, and Spam | |
| |
| |
| |
Sending Mail with SMTP and PHP | |
| |
| |
| |
Writing a Webbot That Sends Email Notifications | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Converting a Website into a Function | |
| |
| |
| |
Writing a Function Interface | |
| |
| |
| |
Final Thoughts; Advanced Technical Considerations | |
| |
| |
| |
Spiders | |
| |
| |
| |
How Spiders Work | |
| |
| |
| |
Example Spider | |
| |
| |
| |
LIB_simple_spider | |
| |
| |
| |
Experimenting with the Spider | |
| |
| |
| |
Adding the Payload | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Procurement Webbots and Snipers | |
| |
| |
| |
Procurement Webbot Theory | |
| |
| |
| |
Sniper Theory | |
| |
| |
| |
Testing Your Own Webbots and Snipers | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Webbots and Cryptography | |
| |
| |
| |
Designing Webbots That Use Encryption | |
| |
| |
| |
A Quick Overview of Web Encryption | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Authentication | |
| |
| |
| |
What Is Authentication? | |
| |
| |
| |
Example Scripts and Practice Pages | |
| |
| |
| |
Basic Authentication | |
| |
| |
| |
Session Authentication | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Advanced Cookie Management | |
| |
| |
| |
How Cookies Work | |
| |
| |
| |
PHP/CURL and Cookies | |
| |
| |
| |
How Cookies Challenge Webbot Design | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Scheduling Webbots and Spiders | |
| |
| |
| |
Preparing Your Webbots to Run as Scheduled Tasks | |
| |
| |
| |
The Windows XP Task Scheduler | |
| |
| |
| |
The Windows 7 Task Scheduler | |
| |
| |
| |
Non-calendar-based Triggers | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Scraping Difficult Websites with Browser Macros | |
| |
| |
| |
Barriers to Effective Web Scraping | |
| |
| |
| |
Overcoming Webscraping Barriers with Browser Macros | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Hacking iMacros | |
| |
| |
| |
Hacking iMacros for Added Functionality | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Deployment and Scaling | |
| |
| |
| |
One-to-Many Environment | |
| |
| |
| |
One-to-One Environment | |
| |
| |
| |
Many-to-Many Environment | |
| |
| |
| |
Many-to-One Environment | |
| |
| |
| |
Scaling and Denial-of-Service Attacks | |
| |
| |
| |
Creating Multiple Instances of a Webbot | |
| |
| |
| |
Managing a Botnet | |
| |
| |
| |
Further Exploration; Larger Considerations | |
| |
| |
| |
Designing Stealthy Webbots and Spiders | |
| |
| |
| |
Why Design a Stealthy Webbot? | |
| |
| |
| |
Stealth Means Simulating Human Patterns | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Proxies | |
| |
| |
| |
What Is a Proxy? | |
| |
| |
| |
Proxies in the Virtual World | |
| |
| |
| |
Why Webbot Developers Use Proxies | |
| |
| |
| |
Using a Proxy Server | |
| |
| |
| |
Types of Proxy Servers | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Writing Fault-Tolerant Webbots | |
| |
| |
| |
Types of Webbot Fault Tolerance | |
| |
| |
| |
Error Handlers | |
| |
| |
| |
Further Exploration | |
| |
| |
| |
Designing Webbot-Friendly Websites | |
| |
| |
| |
Optimizing Web Pages for Search Engine Spiders | |
| |
| |
| |
Web Design Techniques That Hinder Search Engine Spiders | |
| |
| |
| |
Designing Data-Only Interfaces | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Killing Spiders | |
| |
| |
| |
Asking Nicely | |
| |
| |
| |
Building Speed Bumps | |
| |
| |
| |
Setting Traps | |
| |
| |
| |
Final Thoughts | |
| |
| |
| |
Keeping Webbots out of Trouble | |
| |
| |
| |
It's All About Respect | |
| |
| |
| |
Copyright | |
| |
| |
| |
Trespass to Chattels | |
| |
| |
| |
Internet Law | |
| |
| |
| |
Final Thoughts; PHP/CURL Reference | |
| |
| |
Creating a Minimal PHP/CURL Session | |
| |
| |
Initiating PHP/CURL Sessions | |
| |
| |
Setting PHP/CURL Options | |
| |
| |
Executing the PHP/CURL Command | |
| |
| |
Closing PHP/CURL Sessions | |
| |
| |
Status Codes | |
| |
| |
HTTP Codes | |
| |
| |
NNTP Codes | |
| |
| |
SMS Gateways | |
| |
| |
Sending Text Messages | |
| |
| |
Reading Text Messages | |
| |
| |
A Sampling of Text Message Email Addresses | |