Mining Data from Text MODELING TEXT USING NATURAL LANGUAGE PROCESSING Janani Ravi CO-FOUNDER, LOONYCORN www.loonycorn.com Expressing text as features Overview One-hot, frequency-based, and prediction-based encoding Reducing dimensionality using hashing Bag-of-words and bag-of-n-grams modeling Converting text to features using the CountVectorizer, TfidfVectorizer and HashingVectorizer Prerequisites and Course Outline Prerequisites Basic Python programming Basic understanding of machine learning High school math Course Outline Modeling text to feed into ML models Building classification models using text Understanding and implementing topic modeling Understanding and implementing keyword extraction Mining Data from Text Text as Feature Vectors Topic Modeling Keyword Extraction Represent words as numbers NMF, LDA and LSA RAKE Count, Tf-Idf, Embeddings “What is this document about?” “What is important in this document?” Classification of Documents Clustering of Documents “Positive or negative?” Trivial after Topic Modeling Sentiment analysis and many other uses “Find groups of similar documents” Encoding Text Data in Numeric Form Mining Data from Text Text as Feature Vectors Topic Modeling Keyword Extraction Represent words as numbers NMF, LDA and LSA RAKE Count, Tf-Idf, Embeddings “What is this document about?” “What is important in this document?” Classification of Documents Clustering of Documents “Positive or negative?” Trivial after Topic Modelling Sentiment analysis and many other uses “Find groups of similar documents” Mining Data from Text Text as Feature Vectors Topic Modeling Keyword Extraction Represent words as numbers NMF, LDA and LSA RAKE Count, Tf-Idf, Embeddings “What is this document about?” “What is important in this document?” Classification of Documents Clustering of Documents “Positive or negative?” Trivial after Topic Modelling Sentiment analysis and many other uses “Find groups of similar documents” Sentiment Analysis Using Neural Networks “This is the worst restaurant in the metropolis, by a long way” -1 (Not Positive) ML-based Classifier Input is in the form of sentences i.e. text Corpus … ML algorithms only process numeric input; they don’t work with plain text d = “This is not the worst restaurant in the metropolis, not by a long way” Document as Word Sequence Model a document as an ordered sequence of words d = “This is not the worst restaurant in the metropolis, not by a long way” (“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”, “metropolis”, “not”,”by”,”a”,”long”,”way”) Document as Word Sequence Tokenize document into individual words x0 Some numeric encoding of word w0 Word (text) (“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”, “metropolis”, “not”,”by”,”a”,”long”,”way”) Represent Each Word as a Number x1 Some numeric encoding of word w1 Word (text) (“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”, “metropolis”, “not”,”by”,”a”,”long”,”way”) Represent Each Word as a Number xn Some numeric encoding of word wn Word (text) (“This”, “is”,”not”,”the”,”worst”,”restaurant”,”in”,”the”, “metropolis”, “not”,”by”,”a”,”long”,”way”) Represent Each Word as a Number d = [x0, x1, … x n] Document as Tensor Represent each word as numeric data, aggregate into tensor xi = [?] The Big Question How best can words be represented as numeric data? d = [[?], [?], … [?]] The Big Question How best can words be represented as numeric data? Word Embeddings One-hot Frequency-based Prediction-based Word Embeddings One-hot Frequency-based Prediction-based Documents and Corpus Reviews Amazing! Worst movie ever Two thumbs up Part 2 was bad, 3 the worst Up there with the greats D = Entire corpus di = One document in corpus One-hot Encoding All Words Reviews Amazing! Worst movie ever Two thumbs up Part 2 was bad, 3 the worst Up there with the greats amazing worst movie ever two thumbs up Part was bad 3 the there with greats Create a set of all words (all across the corpus) One-hot Encoding amazing worst movie ever two thumbs up Part was bad 3 the there with greats Amazing! Worst movie ever Two thumbs up 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 Express each review as a tuple of 1,0 elements One-hot Encoding amazing worst movie ever two thumbs up Part was bad 3 the there with greats Amazing! Worst movie ever Two thumbs up 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 Express each review as a tuple of 1,0 elements One-hot Encoding amazing worst movie ever two thumbs up Part was bad 3 the there with greats Amazing! Worst movie ever Two thumbs up 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 Express each review as a tuple of 1,0 elements Flaws of One-hot Encoding Large vocabulary - enormous feature vectors Unordered - Lost all context Binary - Lost frequency information One-hot encoding does NOT capture any semantic information or relationship between words Frequency-based Embedding Word Embeddings One-hot Frequency-based Prediction-based Frequency-based Embeddings Count TF-IDF Co-occurrence Frequency-based Embeddings Count TF-IDF Co-occurrence Capture how often a word occurs in a document i.e. the counts or the frequency Flaws of Count Vectors Large vocabulary - enormous feature vectors Unordered - lost all context Semantics and word relationships lost Frequency-based Embeddings Count TF-IDF Co-occurrence Captures how often a word occurs in a document as well as the entire corpus d = [x0, x1, … x n] Document as Tensor Represent each word as numeric data, aggregate into tensor xi = tf(wi) x idf(wi) Tf-Idf Represent each word with the product of two terms - tf and idf xi = tf(wi) x idf(wi) Tf-Idf Tf = Term Frequency; Idf = Inverse Document Frequency Tf-Idf Frequently in a single document Frequently in the corpus Might be important Probably a common word like “a”, “an”, “the” Evaluating Tf-Idf Important advantages - Frequency and relevance captured One big drawback - Context still not captured Frequency-based Embeddings Count TF-IDF Co-occurrence Similar words will occur together and will have similar context Context Window A window centered around a word, which includes a certain number of neighboring words Co-occurrence The number of times two words w1 and w2 have occurred together in a context window Evaluating Co-occurrence Vectors Huge matrix size, huge memory V x V matrix for vocabulary size V Use top N words, matrix V x N still huge Evaluating Co-occurrence Vectors Preserves semantic relationship between words - “man” and “woman” more likely to co-occur than “man” and “building” Dimensionality reduction results in more accurate representations Efficient computation, can be reused Word Embeddings One-hot Frequency-based Prediction-based Word Embeddings Numerical representations of text which capture meanings and semantic relationships “Birds Words of a feather flock together” Given words from its context, predict the word Given a word, predict the words in its context Prediction-based Word Embeddings [ “London” “123456789” ML-based algorithm Large corpus ] Prediction-based Word Embeddings [ [ “New York”, “123456770”, “Paris”, “123456600”, “London” ML-based algorithm ] “123456789” ] Large corpus Prediction-based Word Embeddings [ [ “New York”, “123456770”, “Paris”, “123456600”, “London” ML-based algorithm ] “123456789” ] Large corpus Prediction-based Word Embeddings [ “London” “123456789” ML-based algorithm Large corpus ] Prediction-based Word Embeddings [ “London” “123456789” ML-based algorithm Input = Word Large corpus ] Prediction-based Word Embeddings [ “London” “123456789” ML-based algorithm Large corpus ] Output = Word Embedding Prediction-based Word Embeddings Very low dimensionality [ “London” “123456789” ML-based algorithm Large corpus ] Prediction-based Word Embeddings [ “London” “123456789” ML-based algorithm ] Large corpus, e.g. Wikipedia Large corpus Prediction-based Word Embeddings Unsupervised Learning Algorithm [ “London” “123456789” ML-based algorithm Large corpus ] Prediction-based Word Embeddings Can use NN, deep learning [ “London” “123456789” ML-based algorithm Large corpus ] Magic Word embeddings capture meaning “Queen” ~ “King” == “Woman” ~ “Man” “Paris” ~ “France” == “London” ~ “England” Dramatic dimensionality reduction Embeddings are a way to encode words capturing the context around them Word2Vec and GloVe Hashing Hashing A technique that allows you to lookup specific values very quickly Hashing Also can be used to perform dimensionality reduction Hashing Have a fixed number of categories or buckets Hashing Accessing a bucket is fast Hashing f A hash function determines which bucket each value belongs to Hashing f Hashing f Hashing f Hashing f Hashing f Hashing f For any new value we know immediately which bucket it belongs to Hashing f For any new value we know immediately which bucket it belongs to Hashing Each value is hashed so it falls in one of these buckets Hashing A value can only belong to one bucket and always belongs to the same bucket Feature Hashing in Text Apply a hash function to words to determine their location in the feature vector representing a document. Fast and memory efficient but has no inverse transform. Feature hashing uses the “hashing trick” for dimensionality reduction Dimensionality Reduction Input: N-dimensional data Output: k-dimensional data Where k < N Hashing Input: N-dimensional data Output: 1-dimensional data Output is the hash bucket the data maps to Hashing Input: N-dimensional data Output: k-dimensional data Can easily extend hashing to output desired dimensionality Hashing Thus hashing is a simple form of dimensionality reduction However, very similar inputs may be mapped to very different hash values Hashing performs dimensionality reduction but does not keep similar data points together Bag-of-words and Bag-of-n-grams Bag-based Models for Text Bag-of-words Bag of n-grams Bag-of-words Model Any model that represents the document as a bag (multiset) of its constituent words, disregarding order but maintaining multiplicity Bag-of-words Model Any model that represents the document as a bag (multiset) of its constituent words, disregarding order but maintaining multiplicity Bag-of-words Models Examples of bag-of-words models - Count Vectorization - TF-IDF Vectorization Examples that are not bag-of-words models - One-hot encoding (no multiplicity) - Word embeddings Bag-of-n-grams Model Any model that represents the document as a bag (multiset) of its constituent n-grams, disregarding order but maintaining multiplicity Bag-of-n-grams Model Any model that represents the document as a bag (multiset) of its constituent n-grams, disregarding order but maintaining multiplicity Bag-of-n-grams Bag is a set with duplicates (i.e. multiset) Bag-of-words models contain only individual words Bag-of-n-gram models contain n-grams Bag-of-n-grams An n-gram model store additional spatial information for a word Words that occur together Demo Installing and setting up Python libraries Demo Representing text data in numerical form - CountVectorizer - TfidfVectorizer - HashingVectorizer Expressing text as features Summary One-hot, frequency-based, and prediction-based encoding Reducing dimensionality using hashing Bag-of-words and bag-of-n-grams modeling Converting text to features using the CountVectorizer, TfidfVectorizer and HashingVectorizer