Text Parser and Word Frequency using R

I spent a little time trying to find a simple parser for text files, and this certainly works, and is straightforward:

Example Code

 # Simple word frequency   
 #   
 # Read text   
 wordCount.text <- scan(file = choose.files(), what = 'char', sep = '\n')  

 # Removes whitepace  
 wordCount.text <- paste(wordCount.text, collapse = " ")  

 # Convert to lower   
 wordCount.text <- tolower(wordCount.text)  

 # Split on white space   
 wordCount.words.list <- strsplit(wordCount.text, '\\W+', perl = TRUE)  

 # convert to vector   
 wordCount.words.vector <- unlist(wordCount.words.list)  

 # table method builds word count list   
 wordCount.freq.list <- table(wordCount.words.vector)  

 # Sorts word count list   
 wordCount.sorted.freq.list <- sort(wordCount.freq.list, decreasing = TRUE)  

 # Concantenates the word and count before export   
 wordCount.sorted.table <- paste(names(wordCount.sorted.freq.list), wordCount.sorted.freq.list, sep = '\t')  

 # Output to file   
 cat('Word\tFREQ', wordCount.sorted.table, file = choose.files(), sep = '\n')  

Example Results

 Word      FREQ  
 and       53  
 to        41  
 the       33  
 in        22  
 business  18  
 james     16  
 of        16  
 as        15  
 for       14  
 with      14  
 on        13  
 team      11  
 employee  10
 (..truncated for brevity)

Popular posts from this blog

Charting Correlation Matrices in R

Neural Networks (Part 4 of 4) - R Packages and Resources

Performance Improvements in R: Vectorization & Memoisation