Telechargé par Sara Ibn El Ahrache

modeling-text-using-natural-language-processing-slides

publicité
Mining Data from Text
MODELING TEXT USING NATURAL LANGUAGE PROCESSING
Janani Ravi
CO-FOUNDER, LOONYCORN
www.loonycorn.com
Expressing text as features
Overview
One-hot, frequency-based, and
prediction-based encoding
Reducing dimensionality using hashing
Bag-of-words and bag-of-n-grams
modeling
Converting text to features using the
CountVectorizer, TfidfVectorizer and
HashingVectorizer
Prerequisites and Course Outline
Prerequisites
Basic Python programming
Basic understanding of machine learning
High school math
Course Outline
Modeling text to feed into ML models
Building classification models using text
Understanding and implementing topic
modeling
Understanding and implementing keyword
extraction
Mining Data from Text
Text as Feature Vectors
Topic Modeling
Keyword Extraction
Represent words as numbers
NMF, LDA and LSA
RAKE
Count, Tf-Idf, Embeddings
“What is this document about?”
“What is important in this document?”
Classification of Documents
Clustering of Documents
“Positive or negative?”
Trivial after Topic Modeling
Sentiment analysis and many other uses
“Find groups of similar documents”
Encoding Text Data in Numeric Form
Mining Data from Text
Text as Feature Vectors
Topic Modeling
Keyword Extraction
Represent words as numbers
NMF, LDA and LSA
RAKE
Count, Tf-Idf, Embeddings
“What is this document about?”
“What is important in this document?”
Classification of Documents
Clustering of Documents
“Positive or negative?”
Trivial after Topic Modelling
Sentiment analysis and many other uses
“Find groups of similar documents”
Mining Data from Text
Text as Feature Vectors
Topic Modeling
Keyword Extraction
Represent words as numbers
NMF, LDA and LSA
RAKE
Count, Tf-Idf, Embeddings
“What is this document about?”
“What is important in this document?”
Classification of Documents
Clustering of Documents
“Positive or negative?”
Trivial after Topic Modelling
Sentiment analysis and many other uses
“Find groups of similar documents”
Sentiment Analysis Using Neural Networks
“This is the worst
restaurant in the
metropolis, by a long
way”
-1
(Not Positive)
ML-based Classifier
Input is in the form
of sentences i.e. text
Corpus
…
ML algorithms only process numeric
input; they don’t work with plain text
d = “This is not the worst restaurant in the metropolis,
not by a long way”
Document as Word Sequence
Model a document as an ordered sequence of words
d = “This is not the worst restaurant in the metropolis,
not by a long way”
(“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”,
“metropolis”, “not”,”by”,”a”,”long”,”way”)
Document as Word Sequence
Tokenize document into individual words
x0
Some numeric
encoding of word
w0
Word (text)
(“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”,
“metropolis”, “not”,”by”,”a”,”long”,”way”)
Represent Each Word as a Number
x1
Some numeric
encoding of word
w1
Word (text)
(“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”,
“metropolis”, “not”,”by”,”a”,”long”,”way”)
Represent Each Word as a Number
xn
Some numeric
encoding of word
wn
Word (text)
(“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”,
“metropolis”, “not”,”by”,”a”,”long”,”way”)
Represent Each Word as a Number
d = [x0, x1,
…
x n]
Document as Tensor
Represent each word as numeric data, aggregate into tensor
xi = [?]
The Big Question
How best can words be represented as numeric data?
d = [[?], [?],
…
[?]]
The Big Question
How best can words be represented as numeric data?
Word Embeddings
One-hot
Frequency-based
Prediction-based
Word Embeddings
One-hot
Frequency-based
Prediction-based
Documents and Corpus
Reviews
Amazing!
Worst movie ever
Two thumbs up
Part 2 was bad, 3 the worst
Up there with the greats
D = Entire corpus
di = One document in corpus
One-hot Encoding
All Words
Reviews
Amazing!
Worst movie ever
Two thumbs up
Part 2 was bad, 3 the worst
Up there with the greats
amazing
worst
movie
ever
two
thumbs
up
Part
was
bad
3
the
there
with
greats
Create a set of all words (all across the corpus)
One-hot Encoding
amazing
worst
movie
ever
two
thumbs
up
Part
was
bad
3
the
there
with
greats
Amazing!
Worst movie ever
Two thumbs up
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
Express each review as a tuple of 1,0 elements
One-hot Encoding
amazing
worst
movie
ever
two
thumbs
up
Part
was
bad
3
the
there
with
greats
Amazing!
Worst movie ever
Two thumbs up
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
Express each review as a tuple of 1,0 elements
One-hot Encoding
amazing
worst
movie
ever
two
thumbs
up
Part
was
bad
3
the
there
with
greats
Amazing!
Worst movie ever
Two thumbs up
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
Express each review as a tuple of 1,0 elements
Flaws of One-hot Encoding
Large vocabulary - enormous
feature vectors
Unordered - Lost all context
Binary - Lost frequency information
One-hot encoding does NOT
capture any semantic information
or relationship between words
Frequency-based Embedding
Word Embeddings
One-hot
Frequency-based
Prediction-based
Frequency-based Embeddings
Count
TF-IDF
Co-occurrence
Frequency-based Embeddings
Count
TF-IDF
Co-occurrence
Capture how often a word
occurs in a document i.e. the
counts or the frequency
Flaws of Count Vectors
Large vocabulary - enormous feature
vectors
Unordered - lost all context
Semantics and word relationships lost
Frequency-based Embeddings
Count
TF-IDF
Co-occurrence
Captures how often a word
occurs in a document as well as
the entire corpus
d = [x0, x1,
…
x n]
Document as Tensor
Represent each word as numeric data, aggregate into tensor
xi = tf(wi) x idf(wi)
Tf-Idf
Represent each word with the product of two terms - tf and idf
xi = tf(wi) x idf(wi)
Tf-Idf
Tf = Term Frequency; Idf = Inverse Document Frequency
Tf-Idf
Frequently in a single document
Frequently in the corpus
Might be important
Probably a common word like
“a”, “an”, “the”
Evaluating Tf-Idf
Important advantages
- Frequency and relevance captured
One big drawback
- Context still not captured
Frequency-based Embeddings
Count
TF-IDF
Co-occurrence
Similar words will occur
together and will have similar
context
Context Window
A window centered around a word, which includes a
certain number of neighboring words
Co-occurrence
The number of times two words w1 and w2 have
occurred together in a context window
Evaluating Co-occurrence Vectors
Huge matrix size, huge memory
V x V matrix for vocabulary size V
Use top N words, matrix V x N still huge
Evaluating Co-occurrence Vectors
Preserves semantic relationship between
words
-
“man” and “woman” more likely to
co-occur than “man” and “building”
Dimensionality reduction results in more
accurate representations
Efficient computation, can be reused
Word Embeddings
One-hot
Frequency-based
Prediction-based
Word Embeddings
Numerical representations of text
which capture meanings and
semantic relationships
“Birds Words of a feather flock
together”
Given words from its context,
predict the word
Given a word, predict the
words in its context
Prediction-based Word Embeddings
[
“London”
“123456789”
ML-based algorithm
Large
corpus
]
Prediction-based Word Embeddings
[
[
“New York”,
“123456770”,
“Paris”,
“123456600”,
“London”
ML-based algorithm
]
“123456789”
]
Large
corpus
Prediction-based Word Embeddings
[
[
“New York”,
“123456770”,
“Paris”,
“123456600”,
“London”
ML-based algorithm
]
“123456789”
]
Large
corpus
Prediction-based Word Embeddings
[
“London”
“123456789”
ML-based algorithm
Large
corpus
]
Prediction-based Word Embeddings
[
“London”
“123456789”
ML-based algorithm
Input = Word
Large
corpus
]
Prediction-based Word Embeddings
[
“London”
“123456789”
ML-based algorithm
Large
corpus
]
Output = Word
Embedding
Prediction-based Word Embeddings
Very low
dimensionality
[
“London”
“123456789”
ML-based algorithm
Large
corpus
]
Prediction-based Word Embeddings
[
“London”
“123456789”
ML-based algorithm
]
Large corpus,
e.g. Wikipedia
Large
corpus
Prediction-based Word Embeddings
Unsupervised
Learning
Algorithm
[
“London”
“123456789”
ML-based algorithm
Large
corpus
]
Prediction-based Word Embeddings
Can use NN,
deep learning
[
“London”
“123456789”
ML-based algorithm
Large
corpus
]
Magic
Word embeddings capture meaning
“Queen” ~ “King” == “Woman” ~ “Man”
“Paris” ~ “France” == “London” ~ “England”
Dramatic dimensionality reduction
Embeddings are a way to
encode words capturing the
context around them
Word2Vec and GloVe
Hashing
Hashing
A technique that allows you to lookup
specific values very quickly
Hashing
Also can be used to perform dimensionality reduction
Hashing
Have a fixed number of categories or buckets
Hashing
Accessing a bucket is fast
Hashing
f
A hash function determines which bucket each value belongs to
Hashing
f
Hashing
f
Hashing
f
Hashing
f
Hashing
f
Hashing
f
For any new value we know immediately which bucket it
belongs to
Hashing
f
For any new value we know immediately which bucket it
belongs to
Hashing
Each value is hashed so it falls in one of these buckets
Hashing
A value can only belong to one bucket and always belongs to the
same bucket
Feature Hashing in Text
Apply a hash function to words to determine their
location in the feature vector representing a
document. Fast and memory efficient but has no
inverse transform.
Feature hashing uses the
“hashing trick” for
dimensionality reduction
Dimensionality Reduction
Input: N-dimensional data
Output: k-dimensional data
Where k < N
Hashing
Input: N-dimensional data
Output: 1-dimensional data
Output is the hash bucket the data
maps to
Hashing
Input: N-dimensional data
Output: k-dimensional data
Can easily extend hashing to output
desired dimensionality
Hashing
Thus hashing is a simple form of
dimensionality reduction
However, very similar inputs may be
mapped to very different hash values
Hashing performs dimensionality
reduction but does not keep similar
data points together
Bag-of-words and Bag-of-n-grams
Bag-based Models for Text
Bag-of-words
Bag of n-grams
Bag-of-words Model
Any model that represents the document as a bag
(multiset) of its constituent words, disregarding order
but maintaining multiplicity
Bag-of-words Model
Any model that represents the document as a bag
(multiset) of its constituent words, disregarding order
but maintaining multiplicity
Bag-of-words Models
Examples of bag-of-words models
-
Count Vectorization
-
TF-IDF Vectorization
Examples that are not bag-of-words
models
-
One-hot encoding (no multiplicity)
-
Word embeddings
Bag-of-n-grams Model
Any model that represents the document as a bag
(multiset) of its constituent n-grams, disregarding
order but maintaining multiplicity
Bag-of-n-grams Model
Any model that represents the document as a bag
(multiset) of its constituent n-grams, disregarding
order but maintaining multiplicity
Bag-of-n-grams
Bag is a set with duplicates (i.e.
multiset)
Bag-of-words models contain only
individual words
Bag-of-n-gram models contain n-grams
Bag-of-n-grams
An n-gram model store additional
spatial information for a word
Words that occur together
Demo
Installing and setting up Python
libraries
Demo
Representing text data in numerical
form
- CountVectorizer
- TfidfVectorizer
- HashingVectorizer
Expressing text as features
Summary
One-hot, frequency-based, and
prediction-based encoding
Reducing dimensionality using hashing
Bag-of-words and bag-of-n-grams
modeling
Converting text to features using the
CountVectorizer, TfidfVectorizer and
HashingVectorizer
Téléchargement