Text Parser and Word Frequency using R

- March 15, 2016

I spent a little time trying to find a simple parser for text files, and this certainly works, and is straightforward:

Example Code


 # Simple word frequency   
 #   
 # Read text   
 wordCount.text <- scan(file = choose.files(), what = 'char', sep = '\n')  

 # Removes whitepace  
 wordCount.text <- paste(wordCount.text, collapse = " ")  

 # Convert to lower   
 wordCount.text <- tolower(wordCount.text)  

 # Split on white space   
 wordCount.words.list <- strsplit(wordCount.text, '\\W+', perl = TRUE)  

 # convert to vector   
 wordCount.words.vector <- unlist(wordCount.words.list)  

 # table method builds word count list   
 wordCount.freq.list <- table(wordCount.words.vector)  

 # Sorts word count list   
 wordCount.sorted.freq.list <- sort(wordCount.freq.list, decreasing = TRUE)  

 # Concantenates the word and count before export   
 wordCount.sorted.table <- paste(names(wordCount.sorted.freq.list), wordCount.sorted.freq.list, sep = '\t')  

 # Output to file   
 cat('Word\tFREQ', wordCount.sorted.table, file = choose.files(), sep = '\n')

Example Results


 Word      FREQ  
 and       53  
 to        41  
 the       33  
 in        22  
 business  18  
 james     16  
 of        16  
 as        15  
 for       14  
 with      14  
 on        13  
 team      11  
 employee  10
 (..truncated for brevity)

Search This Blog

Data Analytics Workouts

Text Parser and Word Frequency using R

Comments

Post a Comment

Popular posts from this blog

Cultural Dimensions and Coffee Consumption

Developers in New York City by Zip Code

VBA versus .NET