Text Mining

Students cheer on the Redhawks during a sporting event at Miami University.

Text mining is the process of summarizing a large amount of text into usable statistics. Vectorizing a document is turning that document into a vector, which could be a list of the words found in the document and a value signifying the number of times that particular word appears.

Commonly Used Vocabulary and Functions

A vector space model is used to take a document or grouping of text and turn it into a vector format.

Stopwords are common words that are filtered out in a document so statistics concerning them will not be calculated. This allows for a clearer inspection of the more interesting words. An example of a stopword is "the".

The following are some useful text mining functions, which belong to one of three R packages: tm, qdap, and wordcloud.

  • VectorSource(x) : Takes a grouping of texts and makes each element of the resulting vector a document within your R Workspace. There are many types of sources, but VectorSource() is made for working with character objects in R.
  • VCorpus(): Takes a source object and makes a volatile corpora. A VCorpus object is created from a source object. In essence, a corpus is a collection of documents. Since the object is volatile, all changes only affect the corresponding R object. Also, for volatile objects, once the variable is destroyed, the corpus object is also destroyed.
  • TermDocumentMatrix(x, control = list()): takes a VCorpus object, x, and creates a matrix as a list object where the document names are the column names and the terms are the row names. Controls such as stopwords, tolower, etc. can be used to clean up the documents.
  • tm_map(x, FUN, ...): allows for the application of transformation functions such as tolower to each document within a corpus object, x.
  • removeSparseTerms(x): For removing sparse terms from a DocumentTermMatrix or TermDocumentMatrix
  • wordcloud(words, freq, max.words = Inf): turns words into a word cloud plot
    • words: the words to be analyzed for the cloud
    • freq: the frequency of the words
    • max.words: the maximum number of words to be used

Example

library(tm)
library(qdap)
library(wordcloud)
text <- c("This is just random text to serve as an example. Some of these words are really meaningless, but some of the words are meaningful. Text mining can be used in many fields of study. An example might be in emails to examine if an email is spam or not. When an email is sent to its destination, the text of the email will be analyzed by an algorithm to see wether or not the email should be placed in your inbox or spam folder. Another example could be looking for particular words or frequency of words used in tweets to see if messages have a positive, negative, or neutral tone.")

sourceVec <- VectorSource(text)                  # turn character into source object
corpVec <- VCorpus(sourceVec)                    # turn source object into corpus
corpVec <- tm_map(corpVec, removeWords, "words") # removes the use of 'words'
textTDM <- TermDocumentMatrix(corpVec, list(tolower = T,          # covert to lower case
                                            stopwords = T,        # remove 'a', 'an', etc.
                                            removePunctuation = T,
                                            wordLengths = c(3, Inf)) # keep words of length
                              )                                      # three or more 

Now that we have cleaned up the text, let's turn the Term Document Matrix textTDM into an actual matrix and view the frequency of each word that appears in the example text.

textM <- as.matrix(textTDM)
textM

##              Docs
## Terms         1
##   algorithm   1
##   analyzed    1
##   another     1
##   can         1
##   destination 1
##   email       4
##   emails      1
##   examine     1
##   example     3
##   fields      1
##   folder      1
##   frequency   1
##   inbox       1
##   just        1
##   looking     1
##   many        1
##   meaningful  1
##   meaningless 1
##   messages    1
##   might       1
##   mining      1
##   negative    1
##   neutral     1
##   not         1
##   particular  1
##   placed      1
##   positive    1
##   random      1
##   really      1
##   see         2
##   sent        1
##   serve       1
##   spam        2
##   study       1
##   text        3
##   tone        1
##   tweets      1
##   used        2
##   wether      1
##   will        1

We can see that many words are used only once and a few are used more often. To summarize these results as a picture let’s first create a data frame and then plot a wordcloud.

textFreqs <- rowSums(textM)
freqTab <- data.frame(term = names(textFreqs), num = textFreqs)
wordcloud(freqTab$term, freqTab$num, max.words = 100, color = "green")

Wordcloud with email as largest word, then example and text are equally sized.

Need a Refresher?

Go back to the beginner tutorials.