Practical text mining with Perl

cover image

Where to find it

Information & Library Science Library

Call Number
QA76.9.D343 .B45 2008
Status
Available

Authors, etc.

Names:

Summary

Provides readers with the methods, algorithms, and means to perform text mining tasks

This book is devoted to the fundamentals of text mining using Perl, an open-source programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectives--statistics, data mining, linguistics, and information retrieval--and provides readers with the means to successfully complete text mining tasks on their own.

The book begins with an introduction to regular expressions, a text pattern methodology, and quantitative text summaries, all of which are fundamental tools of analyzing text. Then, it builds upon this foundation to explore:

Probability and texts, including the bag-of-words model Information retrieval techniques such as the TF-IDF similarity measure Concordance lines and corpus linguistics Multivariate techniques such as correlation, principal components analysis, and clustering Perl modules, German, and permutation tests

Each chapter is devoted to a single key topic, and the author carefully and thoughtfully introduces mathematical concepts as they arise, allowing readers to learn as they go without having to refer to additional books. The inclusion of numerous exercises and worked-out examples further complements the book's student-friendly format.

Practical Text Mining with Perl is ideal as a textbook for undergraduate and graduate courses in text mining and as a reference for a variety of professionals who are interested in extracting information from text documents.

Contents

  • Preface
  • Acknowledgments
  • 1 Introduction
  • 1.1 Overview of this Book
  • 1.2 Text Mining and Related Fields
  • 1.2.1 Chapter 2 Pattern Matching
  • 1.2.2 Chapter 3 Data Structures
  • 1.2.3 Chapter 4 Probability
  • 1.2.4 Chapter 5 Information Retrieval
  • 1.2.5 Chapter 6 Corpus Linguistics
  • 1.2.6 Chapter 7Multivariate Statistics
  • 1.2.7 Chapter 8 Clustering
  • 1.2.8 Chapter 9 Three Additional Topics
  • 1.3 Advice for Reading this Book
  • 2 Text Patterns
  • 2.1 Introduction
  • 2.2 Regular Expressions
  • 2.2.1 First Regex: Finding the Word "Cat"
  • 2.2.2 Character Ranges and Finding Telephone Numbers
  • 2.2.3 Testing Regexes with Perl
  • 2.3 Finding Words in a Text
  • 2.3.1 Regex Summary
  • 2.3.2 Nineteenth Century Literature
  • 2.3.3 Perl Variables and the Function split
  • 2.3.4 Match Variables
  • 2.4 Decomposing Poe's "The Tell-Tale Heart" into Words
  • 2.4.1 Dashes and String Substitutions
  • 2.4.2 Hyphens
  • 2.4.3 Apostrophes
  • 2.5 A Simple Concordance
  • 2.5.1 Command Line Arguments
  • 2.5.2 Writing to Files
  • 2.6 First Attempt at Extracting Sentences
  • 2.6.1 Sentence Segmentation Preliminaries
  • 2.6.2 Sentence Segmentation for "A Christmas Carol"
  • 2.6.3 Leftmost Greediness and Sentence Segmentation
  • 2.7 Regex Odds and Ends
  • 2.7.1 Match Variables and Backreferences
  • 2.7.2 Regular Expression Operators and Their Output
  • 2.7.3 Lookaround
  • 2.8 References
  • Problems3 Quantitative Text Summaries
  • 3.1 Introduction
  • 3.2 Scalars, Interpolation and Context in Perl
  • 3.3 Arrays and Context in Perl
  • 3.4 Word Length Application
  • 3.5 Arrays and Functions
  • 3.5.1 Adding and Removing Entries from Arrays
  • 3.5.2 Selecting Subsets of an Array
  • 3.5.3 Sorting an Array
  • 3.6 Hashes
  • 3.6.1 Using a Hash
  • 3.7 Two Text Applications
  • 3.7.1 Zipf's Law
  • 3.7.2 Perl for Word Games
  • 3.7.2.1 An Aid to Crossword Puzzles
  • 3.7.2.2 Word Anagrams
  • 3.7.2.3 Finding Words in a Set of Letters
  • 3.8 Complex Data Structures
  • 3.8.1 References and Pointers
  • 3.8.2 Arrays of Arrays and Beyond
  • 3.8.3 Application: Comparing the Words in Two Poe Stories
  • 3.9 References
  • 3.10 First Transition
  • Problems4 Probability and Texts
  • 4.1 Introduction
  • 4.2 Probability
  • 4.2.1 Probability and Coin Flipping
  • 4.2.2 Probabilities and Texts
  • 4.2.2.1 Estimating Letter Probabilities
  • 4.2.2.2 Estimating Letter Bigram Probabilities
  • 4.3 Conditional Probability
  • 4.3.1 Independence
  • 4.4 Mean and Variance of Random Variables
  • 4.4.1 Sampling and Error Estimates
  • 4.5 The Bag-of-Words Model
  • 4.6 The Effect of Sample Size
  • 4.6.1 Tokens vs. Types
  • 4.7 References
  • Problems5 Applying Information Retrieval to Text Mining
  • 5.1 Introduction
  • 5.2 Text Counts and Vectors
  • 5.2.1 Counting Words with Perl
  • 5.2.2 Pronouns
  • 5.3 Text Counts and Vectors
  • 5.3.1 Vectors and Angles
  • 5.3.2 Computing Angles between Vectors
  • 5.3.2.1 Subroutines in Perl
  • 5.3.2.2 Computing the Angle between Vectors
  • 5.4 The Term-Document Matrix
  • 5.5 Matrix Multiplication
  • 5.5.1 A Text Application of Matrix Multiplication
  • 5.6 Functions of Counts
  • 5.7 Document Similarity
  • 5.7.1 Inverse Document Frequency
  • 5.7.2 Poe Story Angles Revisited
  • 5.8 References
  • Problems6 Concordance Lines and Corpus Linguistics
  • 6.1 Introduction
  • 6.2 Sampling
  • 6.2.1 Statistical Survey Sampling
  • 6.2.2 Text Sampling
  • 6.3 Corpus as Baseline
  • 6.3.1 Function vs. Content Words
  • 6.4 Concordancing
  • 6.4.1 Sorting Concordance Lines
  • 6.4.1.1 Code for Sorting Concordance Lines
  • 6.4.2 Application: Word Usage
  • 6.4 

Other details