| |
| |
Foreword | |
| |
| |
| |
Preface | |
| |
| |
Acknowledgments | |
| |
| |
Notation | |
| |
| |
| |
Introduction | |
| |
| |
| |
Challenges | |
| |
| |
| |
Goals | |
| |
| |
| |
Overview and Structure of the Argument | |
| |
| |
| |
Theory | |
| |
| |
| |
Methods | |
| |
| |
| |
Algorithms | |
| |
| |
| |
Summary | |
| |
| |
| |
Text Classification | |
| |
| |
| |
Learning Task | |
| |
| |
| |
Binary Setting | |
| |
| |
| |
Multi-Class Setting | |
| |
| |
| |
Multi-Label Setting | |
| |
| |
| |
Representing Text | |
| |
| |
| |
Word Level | |
| |
| |
| |
Sub-Word Level | |
| |
| |
| |
Multi-Word Level | |
| |
| |
| |
Semantic Level | |
| |
| |
| |
Feature Selection | |
| |
| |
| |
Feature Subset Selection | |
| |
| |
| |
Feature Construction | |
| |
| |
| |
Term Weighting | |
| |
| |
| |
Conventional Learning Methods | |
| |
| |
| |
Naive Bayes Classifier | |
| |
| |
| |
Rocchio Algorithm | |
| |
| |
| |
[kappa]-Nearest Neighbors | |
| |
| |
| |
Decision Tree Classifier | |
| |
| |
| |
Other Methods | |
| |
| |
| |
Performance Measures | |
| |
| |
| |
Error Rate and Asymmetric Cost | |
| |
| |
| |
Precision and Recall | |
| |
| |
| |
Precision/Recall Breakeven Point and F[subscript [beta]-Measure | |
| |
| |
| |
Micro- and Macro-Averaging | |
| |
| |
| |
Experimental Setup | |
| |
| |
| |
Test Collections | |
| |
| |
| |
Design Choices | |
| |
| |
| |
Support Vector Machines | |
| |
| |
| |
Linear Hard-Margin SVMs | |
| |
| |
| |
Soft-Margin SVMs | |
| |
| |
| |
Non-Linear SVMs | |
| |
| |
| |
Asymmetric Misclassification Cost | |
| |
| |
| |
Other Maximum-Margin Methods | |
| |
| |
| |
Further Work and Further Information | |
| |
| |
Part Theory | |
| |
| |
| |
A Statistical Learning Model of Text Classification for SVMs | |
| |
| |
| |
Properties of Text-Classification Tasks | |
| |
| |
| |
High-Dimensional Feature Space | |
| |
| |
| |
Sparse Document Vectors | |
| |
| |
| |
Heterogeneous Use of Terms | |
| |
| |
| |
High Level of Redundancy | |
| |
| |
| |
Frequency Distribution of Words and Zipf's Law | |
| |
| |
| |
A Discriminative Model of Text Classification | |
| |
| |
| |
Step 1: Bounding the Expected Error Based on the Margin | |
| |
| |
| |
Step 2: Homogeneous TCat-Concepts as a Model of Text-Classification Tasks | |
| |
| |
| |
Step 3: Learnability of TCat-Concepts | |
| |
| |
| |
Comparing the Theoretical Model with Experimental Results | |
| |
| |
| |
Sensitivity Analysis: Difficult and Easy Learning Tasks | |
| |
| |
| |
Influence of Occurrence Frequency | |
| |
| |
| |
Discriminative Power of Term Sets | |
| |
| |
| |
Level of Redundancy | |
| |
| |
| |
Noisy TCat-Concepts | |
| |
| |
| |
Limitations of the Model and Open Questions | |
| |
| |
| |
Related Work | |
| |
| |
| |
Summary and Conclusions | |
| |
| |
| |
Efficient Performance Estimators for SVMs | |
| |
| |
| |
Generic Performance Estimators | |
| |
| |
| |
Training Error | |
| |
| |
| |
Hold-Out Testing | |
| |
| |
| |
Bootstrap and Jackknife | |
| |
| |
| |
Cross-Validation and Leave-One-Out | |
| |
| |
| |
[xi alpha]-Estimators | |
| |
| |
| |
Error Rate | |
| |
| |
| |
Recall, Precision, and F[subscript 1] | |
| |
| |
| |
Fast Leave-One-Out Estimation | |
| |
| |
| |
Experiments | |
| |
| |
| |
How Large are Bias and Variance of the [xi alpha]-Estimators? | |
| |
| |
| |
What is the Influence of the Training Set Size? | |
| |
| |
| |
How Large is the Efficiency Improvement for Exact Leave-One-Out? | |
| |
| |
| |
Summary and Conclusions | |
| |
| |
Part Methods | |
| |
| |
| |
Inductive Text Classification | |
| |
| |
| |
Learning Task | |
| |
| |
| |
Automatic Model and Parameter Selection | |
| |
| |
| |
Leave-One-Out Estimator of the PRBEP | |
| |
| |
| |
[xi alpha]-Estimator of the PRBEP | |
| |
| |
| |
Model-Selection Algorithm | |
| |
| |
| |
Experiments | |
| |
| |
| |
Word Weighting, Stemming and Stopword Removal | |
| |
| |
| |
Trading Off Training Error vs. Complexity | |
| |
| |
| |
Non-Linear Classification Rules | |
| |
| |
| |
Comparison with Conventional Methods | |
| |
| |
| |
Related Work | |
| |
| |
| |
Summary and Conclusions | |
| |
| |
| |
Transductive Text Classification | |
| |
| |
| |
Learning Task | |
| |
| |
| |
Transductive Support Vector Machines | |
| |
| |
| |
What Makes TSVMs well Suited for Text Classification? | |
| |
| |
| |
An Intuitive Example | |
| |
| |
| |
Transductive Learning of TCat-Concepts | |
| |
| |
| |
Experiments | |
| |
| |
| |
Constraints on the Transductive Hyperplane | |
| |
| |
| |
Relation to Other Approaches Using Unlabeled Data | |
| |
| |
| |
Probabilistic Approaches using EM | |
| |
| |
| |
Co-Training | |
| |
| |
| |
Other Work on Transduction | |
| |
| |
| |
Summary and Conclusions | |
| |
| |
Part Algorithms | |
| |
| |
| |
Training Inductive Support Vector Machines | |
| |
| |
| |
Problem and Approach | |
| |
| |
| |
General Decomposition Algorithm | |
| |
| |
| |
Selecting a Good Working Set | |
| |
| |
| |
Convergence | |
| |
| |
| |
How to Compute the Working Set | |
| |
| |
| |
Shrinking: Reducing the Number of Variables | |
| |
| |
| |
Efficient Implementation | |
| |
| |
| |
Termination Criteria | |
| |
| |
| |
Computing the Gradient and the Termination Criteria Efficiently | |
| |
| |
| |
What are the Computational Resources Needed in each Iteration? | |
| |
| |
| |
Caching Kernel Evaluations | |
| |
| |
| |
How to Solve the QP on the Working Set | |
| |
| |
| |
Related Work | |
| |
| |
| |
Experiments | |
| |
| |
| |
Training Times for Reuters, WebKB, and Ohsumed | |
| |
| |
| |
How does Training Time Scale with the Number of Training Examples? | |
| |
| |
| |
What is the Influence of the Working-Set-Selection Strategy? | |
| |
| |
| |
What is the Influence of Caching? | |
| |
| |
| |
What is the Influence of Shrinking? | |
| |
| |
| |
Summary and Conclusions | |
| |
| |
| |
Training Transductive Support Vector Machines | |
| |
| |
| |
Problem and Approach | |
| |
| |
| |
The TSVM Algorithm | |
| |
| |
| |
Analysis of the Algorithm | |
| |
| |
| |
How does the Algorithm work? | |
| |
| |
| |
Convergence | |
| |
| |
| |
Experiments | |
| |
| |
| |
Does the Algorithm Effectively Maximize Margin? | |
| |
| |
| |
Training Times for Reuters, WebKB, and Ohsumed | |
| |
| |
| |
How does Training Time Scale with the Number of Training Examples? | |
| |
| |
| |
How does Training Time Scale with the Number of Test Examples? | |
| |
| |
| |
Related Work | |
| |
| |
| |
Summary and Conclusions | |
| |
| |
| |
Conclusions | |
| |
| |
| |
Open Question | |
| |
| |
Bibliography | |
| |
| |
Appendices | |
| |
| |
SVM-Light Commands and Options | |
| |
| |
Index | |