Packt+ | Advance your knowledge in tech

You're reading from Mastering Text Mining with R

Product type Book

Published in Dec 2016

Publisher Packt

ISBN-13 9781783551811

Pages 258 pages

Edition 1st Edition

Languages

Concepts

Data Mining

Author (1):

KUMAR ASHISH

Table of Contents (15) Chapters

Mastering Text Mining with R

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Statistical Linguistics with R

2. Processing Text

3. Categorizing and Tagging Text

4. Dimensionality Reduction

5. Text Summarization and Clustering

6. Text Classification

7. Entity Recognition

Index

A

antiword tool
- download link / Microsoft Word documents

B

Bayes' formula
- for conditional probability / Bayes' formula for conditional probability
bias-variance decomposition / Bias-variance decomposition
bias-variance trade-off / Bias–variance trade-off and learning curve
binomial distribution / Binomial distribution
bootstrapping methods / Bootstrap

C

canonical correspondence analysis (CCA)
- about / Canonical correspondence analysis
- Pearson's Chi-squared test / Multiple correspondence analysis
caret package
- reference / Stratified
chunks
- reference / Chunk tags
co-occurrences
- extracting / Extracting co-occurrences
- surface co-occurrence / Surface Co-occurrence
- textual co-occurrence / Textual co-occurrence
- syntactic co-occurrence / Syntactic co-occurrence
- in document / Co-occurrence in a document
collocations / N-gram models
compound probabilities theorem / Theorem of compound probabilities
concept similarity
- about / Concept similarity
- path length / Path length
- Resnik similarity / Resnik similarity
- Lin similarity / Lin similarity
- Jiang-Conrath distance / Jiang – Conrath distance
conditional probability
- about / Conditional probability
- Bayes' formula / Bayes' formula for conditional probability
confusion matrix
- about / Confusion matrix
Correlated topic model (CTM)
- about / Correlated topic model
- model selection / Model selection
- R Package, for topic modeling / R Package for topic modeling
correspondence analysis
- about / Correspondence analysis
- canonical correspondence analysis (CCA) / Canonical correspondence analysis
- multiple correspondence analysis / Multiple correspondence analysis
cross-validation
- about / Cross validation
cumulative distribution function / Cumulative distribution function

D

Degree of Reading Power(DRP) / Automated readability index
dimensionality
- limitation / The curse of dimensionality
- distance concentration / Distance concentration and computational infeasibility
- computational infeasibility / Distance concentration and computational infeasibility
dimensionality reduction
- about / Dimensionality reduction
- principal component analysis (PCA) / Principal component analysis
- R, using for principal component analysis (PCA) / Using R for PCA
- reconstruction error / Reconstruction error
discrete random variables
- about / Discrete random variables
- continuous random variables / Continuous random variables
diverse sources
- text, accessing / Accessing text from diverse sources
document clustering
- about / Document clustering
document term matrix
- about / Document term matrix
- inverse document frequency / Inverse document frequency
- words similarity / Words similarity and edit-distance functions
- edit-distance functions / Words similarity and edit-distance functions
- Euclidean distance / Euclidean distance
- cosine similarity / Cosine similarity
- Levenshtein distance / Levenshtein distance
- Damerau-Levenshtein distance / Damerau-Levenshtein distance
- Hamming distance / Hamming distance
- Gunning frog index / Gunning frog index

E

.exe from
- download link / Synonymy and similarity
Easy Listening Formula(ELF) / Automated readability index
elements
- entities / Feature extraction
- attributes / Feature extraction
- events / Feature extraction
entity extraction
- about / Entity extraction
- rule-based approach / The rule-based approach
- machine learning / Machine learning
Extensible Markup Language (XML) / XML

F

10-fold cross-validation / k-Fold
feature extraction
- about / Feature extraction
- synonymy / Synonymy and similarity
- similarity / Synonymy and similarity
- multiwords / Multiwords, negation, and antonymy
- negation / Multiwords, negation, and antonymy
- antonymy / Multiwords, negation, and antonymy
- concept similarity / Concept similarity
feature selection, for text clustering
- about / Feature selection for text clustering
- mutual information, using / Mutual information
- statistic Chi Square feature selection / Statistic Chi Square feature selection
- frequency-based feature selection / Frequency-based feature selection
file system
- about / File system
- PDF documents / PDF documents
- Microsoft Word documents / Microsoft Word documents
- Hyper Text Markup Language (HTML) / HTML
- Extensible Markup Language (XML) / XML
- JavaScript Object Notation (JSON) / JSON
- Hypertext Transfer Protocol (HTTP) / HTTP

G

generative models / Latent Dirichlet Allocation

H

Hamming distance
- about / Hamming distance
- Jaro-Winkler distance / Jaro-Winkler distance
- text, readability measuring / Measuring readability of a text
Heaps' laws / Heaps' law
Hidden Markov Models (HMM), POS tagging
- about / Hidden Markov Models for POS tagging
- definitions / Basic definitions and notations
- notations / Basic definitions and notations
- implementing / Implementing HMMs
- Viterbi underflow / Viterbi underflow
- forward algorithm underflow / Forward algorithm underflow
- OpenNLP chunking / OpenNLP chunking
- chunk tags / Chunk tags
Hyper Text Markup Language (HTML) / HTML
Hypertext Transfer Protocol (HTTP) / HTTP

I

independent events
- for conditional probability / Independent events
inverse document frequency (IDF) / Inverse document frequency
Inverse Document Frequency (IDF) / Frequency-based feature selection
ISOMAP
- using / Implementation of SVD using R
- geodesic distance approximation, calculating / Implementation of SVD using R

J

JavaScript Object Notation (JSON) / JSON
Java Virtual Machine (JVM) / Training a model with new features
joint distribution / Joint distribution

K

k-fold cross-validation / k-Fold
kernel functions / Kernel Trick
Kernel Trick / Kernel Trick
kernlab
- implementations / Kernel Trick
- reference / Kernel Trick
koRpus package / koRpus

L

L-BFGS
- about / Maxent implemenation in R
language detection
- about / Language detection
language models
- about / Language models
- N-gram models / N-gram models
- Markov assumption / Markov assumption
- hidden Markov models / Hidden Markov models
language package
- about / languageR
languageR package / Lexical richness
Latent Dirichlet Allocation (LDA) / Latent Dirichlet Allocation
Latent Semantic Analysis (LSA)
- about / Latent semantic analysis
- R Package / R Package for latent semantic analysis
- example / Illustrative example of LSA
learning curve
- about / Learning curve
leave-one-out method / Leave-one-out
lemma / Word tokenization
lexical diversity
- about / Lexical diversity
- analyse lexical diversity / Analyse lexical diversity
- calculating / Calculate lexical diversity
- readability / Readability
- automated readability index / Readability
lexical richness
- about / Lexical richness
- lexical variation / Lexical variation
- lexical density / Lexical density
- lexical originality / Lexical originality
- lexical sophistication / Lexical sophistication
linear kernel
- applying / How to apply SVM on a real world example?
linguistics
- quantitative methods / Quantitative methods in linguistics
lsa package / lsa

M

Maxent package
- implementing, in R / Maxent implemenation in R
maxent package / maxent
maximum entropy classifiers / Number of instances is significantly larger than the number of dimensions.Maximum entropy classifier
model evaluation
- about / Model evaluation
- confusion matrix / Confusion matrix
- ROC curve / ROC curve
- precision-recall / Precision-recall
model files
- reference / OpenNLP
model validation methods
- leave-one-out / Leave-one-out
- k-fold cross-validation / k-Fold
- bootstrapping methods / Bootstrap
- stratified sampling / Stratified
multi-word expressions (MWE) / Collocation and contingency tables
multi-word units (MWU) / Collocation and contingency tables
MySQL software
- download link / Databases

N

n-fold cross-validation / Leave-one-out
named entity recognition
- about / Named entity recognition
- model, training with new features / Training a model with new features
natural language processing (NLP) / Collocation and contingency tables

O

occurrences
- counting / Counting occurrences
ODBC Bridge
- download link / Databases
OpenNLPmodels.language package
- installation link / OpenNLP
OpenNLP package / OpenNLP
operations, on document-term matrix
- frequent terms / Operations on a document-term matrix
- term association / Operations on a document-term matrix
OWLQN
- about / Maxent implemenation in R

P

part-of-speech (POS) / N-gram models
pointwise mutual information (PMI) / N-gram models
poisson distribution / Poisson distribution
POS tagging
- Hidden Markov Models (HMM) / Hidden Markov Models for POS tagging
pre-trained POS models, for OpenNLP
- reference / POS tagging with R packages
pre-trained sentence boundary detection models
- reference / Sentence boundary detection
precision-recall
- about / Precision-recall
precompiled binaries
- download link / PDF documents
principal component analysis (PCA)
- about / Principal component analysis
- R, using / Using R for PCA
probability
- about / Probability theory and basic statistics
- space / Probability space and event
probability distributions
- R, using / Probability distributions using R
probability frequency function / Probability frequency function

Q

quantitative methods, linguistics
- about / Quantitative methods in linguistics
- document term matrix / Document term matrix

R

R
- using, for probability distributions / Probability distributions using R
- used, for singular vector decomposition (SVD) implementation / Implementation of SVD using R
R, using for principal component analysis (PCA)
- about / Using R for PCA
- FactoMineR package / Understanding the FactoMineR package
- Amap package / Amap package
- proportion of variance / Proportion of variance
- scree plot function / Scree plot
random variables
- about / Random variables
- discrete random variables / Discrete random variables
RcmdrPlugin.temis package / RcmdrPlugin.temis
Rcurl / HTTP
Receiver Operating Characteristics Curve (ROC)
- about / ROC curve
reducible error components
- dealing with / Dealing with reducible error components
regular expressions
- used, for processing text / Processing text using regular expressions
relation between words, quantifying
- about / Quantifying the relation between words
- contingency tables / Contingency tables
- detailed analysis, on textual collocations / Detailed analysis on textual collocations
RKEA package / RKEA
R Package, for topic modeling
- about / R Package for topic modeling
- LDA model, fitting with VEM algorithm / Fitting the LDA model with the VEM algorithm
R packages, text mining
- OpenNLP / OpenNLP
- Rweka / Rweka
- RcmdrPlugin.temis / RcmdrPlugin.temis
- tm / tm
- languageR / languageR
- koRpus / koRpus
- RKEA / RKEA
- maxent / maxent
- lsa / lsa
R tau package / Counting occurrences
RTextTools
- about / RTextTools: a text classification framework
RWeka package / Rweka

S

segmentation / Tokenization and segmentation
sensitivity / Confusion matrix
sentence / Word tokenization
sentence boundary detection
- about / Sentence boundary detection
- Word token annotator / Word token annotator
sentence completion feature / Sentence completion
singular vector decomposition (SVD)
- about / Multiple correspondence analysis
- implementing, with R / Implementation of SVD using R
speech tagging
- components / Parts of speech tagging
- POS tagging, with R packages / POS tagging with R packages
state distribution / Hidden Markov models
statistics
- origin / Probability theory and basic statistics
strata / Stratified
stratified sampling / Stratified
Support vector machines (SVM)
- applying, on real world example / How to apply SVM on a real world example?

T

table(tags) / POS tagging with R packages
Term Document Matrix (TDM) / Dimensionality reduction
term frequency (TF) / Inverse document frequency
text
- accessing, from diverse sources / Accessing text from diverse sources
- accessing, from file system / File system
- accessing, from databases / Databases
- processing, with regular expressions / Processing text using regular expressions
- tokenization / Tokenization and segmentation
TextCat / Language detection
text clustering
- about / Text clustering
- feature selection / Feature selection for text clustering
text mining
- R packages / R packages for text mining
texts
- normalizing / Normalizing texts
- lemmatization / Lemmatization and stemming
- stemming / Stemming, Lemmatization
- synonyms / Synonyms
TF*IDF / Inverse document frequency
tm package / tm
tokenization
- about / Tokenization and segmentation
- word tokenization / Word tokenization
- document-term matrix, operations / Operations on a document-term matrix
- sentence segmentation / Sentence segmentation
tokens / Word tokenization
topic models
- using / Topic modeling
- Latent Dirichlet Allocation (LDA) / Latent Dirichlet Allocation
- Correlated topic model (CTM) / Correlated topic model
types / Word tokenization

U

utterance / Word tokenization

W

word form / Word tokenization
word sense / Word tokenization

Z

Zipf's law / Zipf's law

The rest of the chapter is locked

You're reading from Mastering Text Mining with R

Table of Contents (15) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

W

Z

Authors (1)

Personalised recommendations for you

You're reading from Mastering Text Mining with R

Table of Contents (15) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

W

Z

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you