Skip to main content

Text Parser and Word Frequency using R

I spent a little time trying to find a simple parser for text files, and this certainly works, and is straightforward:

Example Code

 # Simple word frequency   
 #   
 # Read text   
 wordCount.text <- scan(file = choose.files(), what = 'char', sep = '\n')  

 # Removes whitepace  
 wordCount.text <- paste(wordCount.text, collapse = " ")  

 # Convert to lower   
 wordCount.text <- tolower(wordCount.text)  

 # Split on white space   
 wordCount.words.list <- strsplit(wordCount.text, '\\W+', perl = TRUE)  

 # convert to vector   
 wordCount.words.vector <- unlist(wordCount.words.list)  

 # table method builds word count list   
 wordCount.freq.list <- table(wordCount.words.vector)  

 # Sorts word count list   
 wordCount.sorted.freq.list <- sort(wordCount.freq.list, decreasing = TRUE)  

 # Concantenates the word and count before export   
 wordCount.sorted.table <- paste(names(wordCount.sorted.freq.list), wordCount.sorted.freq.list, sep = '\t')  

 # Output to file   
 cat('Word\tFREQ', wordCount.sorted.table, file = choose.files(), sep = '\n')  

Example Results

 Word      FREQ  
 and       53  
 to        41  
 the       33  
 in        22  
 business  18  
 james     16  
 of        16  
 as        15  
 for       14  
 with      14  
 on        13  
 team      11  
 employee  10
 (..truncated for brevity)

Popular posts from this blog

Decision Tree in R, with Graphs: Predicting State Politics from Big Five Traits

This was a continuation of prior explorations, logistic regression predicting Red/Blue state dichotomy by income or by personality. This uses the same five personality dimensions, but instead builds a decision tree. Of the Big Five traits, only two were found to useful in the decision tree, conscientiousness and openness.

Links to sample data, as well as to source references, are at the end of this entry.

Example Code

# Decision Tree - Big Five and Politics library("rpart") # grow tree input.dat <- read.table("BigFiveScoresByState.csv", header = TRUE, sep = ",") fit <- rpart(Liberal ~ Openness + Conscientiousness + Neuroticism + Extraversion + Agreeableness, data = input.dat, method="poisson") # display the results printcp(fit) # visualize cross-validation results plotcp(fit) # detailed summary of splits summary(fit) # plot tree plot(fit, uniform = TRUE, main = "Classific…

Mean Median, and Mode with R, using Country-level IQ Estimates

Reusing the code posted for Correlations within with Hofstede's Cultural Values, Diversity, GINI, and IQ, the same data can be used for mean, median, and mode. Additionally, the summary function will return values in addition to mean and median, Min, Max, and quartile values:

Example Code
oecdData <- read.table("OECD - Quality of Life.csv", header = TRUE, sep = ",") v1 <- oecdData$IQ # Mean with na.rm = TRUE removed NULL avalues mean(v1, na.rm = TRUE) # Median with na.rm = TRUE removed NULL values median(v1, na.rm = TRUE) # Returns the same data as mean and median, but also includes distribution values: # Min, Quartiles, and Max summary(v1) # Mode does not exist in R, so we need to create a function getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } #returns the mode getmode(v1)
Example Results
> oecdData <- read.table("OECD - Quality of L…

Chi-Square in R on by State Politics (Red/Blue) and Income (Higher/Lower)

This is a significant result, but instead of a logistic regression looking at the income average per state and the likelihood of being a Democratic state, it uses Chi-Square. Interpreting this is pretty straightforward, in that liberal states typically have cities and people that earn more money. When using adjusted incomes, by cost of living, this difference disappears.

Example Code
# R - Chi Square rm(list = ls()) stateData <- read.table("CostByStateAndSalary.csv", header = TRUE, sep = ",") # Create vectors affluence.median <- median(stateData$Y2014, na.rm = TRUE) affluence.v <- ifelse(stateData$Y2014 > affluence.median, 1, 0) liberal.v <- stateData$Liberal # Solve pol.Data = table(liberal.v, affluence.v) result <- chisq.test(pol.Data) print(result) print(pol.Data)
Example Results
Pearson's Chi-squared test with Yates' continuity correction data: pol.Data X-squared = 12.672, df …