Skip to main content

Decision Tree in R, with Graphs: Predicting State Politics from Big Five Traits

This was a continuation of prior explorations, logistic regression predicting Red/Blue state dichotomy by income or by personality. This uses the same five personality dimensions, but instead builds a decision tree. Of the Big Five traits, only two were found to useful in the decision tree, conscientiousness and openness.

Links to sample data, as well as to source references, are at the end of this entry.

Example Code


 # Decision Tree - Big Five and Politics  
 library("rpart")  
   
 # grow tree   
 input.dat <- read.table("BigFiveScoresByState.csv", header = TRUE, sep = ",")  
 fit <- rpart(Liberal ~ Openness + Conscientiousness + Neuroticism + Extraversion + Agreeableness, data = input.dat, method="poisson")  
   
 # display the results  
 printcp(fit)  
   
 # visualize cross-validation results    
 plotcp(fit)  
   
 # detailed summary of splits  
 summary(fit)   
   
 # plot tree   
 plot(fit, uniform = TRUE, main = "Classification Tree for Liberal")  
 text(fit, use.n = TRUE, all = TRUE, cex = .8)  
   
 # create attractive postscript plot of tree   
 #post(fit, file = file.choose(), title = "Classification Tree for Liberal")  
   
 pfit <- prune(fit, cp = fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"])  
   
 # plot the pruned tree   
 plot(pfit, uniform = TRUE, main = "Pruned Classification Tree for Liberal")  
 text(pfit, use.n = TRUE, all = TRUE, cex = .8)  
   
 # create attractive postscript plot of tree   
 post(pfit, file = file.choose(), title = "Pruned Classification Tree for Liberal")  

Initial Results


 Rates regression tree:  
 rpart(formula = Liberal ~ Openness + Conscientiousness + Neuroticism +   
   Extraversion + Agreeableness, data = input.dat, method = "poisson")  
   
 Variables actually used in tree construction:  
 [1] Conscientiousness Openness       
   
 Root node error: 35.019/48 = 0.72956  
   
 n=48 (2 observations deleted due to missingness)  
   
     CP nsplit rel error xerror   xstd  
 1 0.36104   0  1.00000 1.03391 0.024507  
 2 0.13203   1  0.63896 0.70221 0.134896  
 3 0.01000   2  0.50692 0.63244 0.137426  
 > # detailed summary of splits  
 + summary(fit)   
 +   
 Call:  
 rpart(formula = Liberal ~ Openness + Conscientiousness + Neuroticism +   
   Extraversion + Agreeableness, data = input.dat, method = "poisson")  
  n=48 (2 observations deleted due to missingness)  
   
      CP nsplit rel error  xerror    xstd  
 1 0.3610441   0 1.0000000 1.0339150 0.02450674  
 2 0.1320324   1 0.6389559 0.7022068 0.13489629  
 3 0.0100000   2 0.5069236 0.6324368 0.13742579  
   
 Variable importance  
 Conscientiousness     Openness    Neuroticism   Agreeableness   
         43        16        15        14   
    Extraversion   
         11   
   
 Node number 1: 48 observations,  complexity param=0.3610441  
  events=20, estimated rate=0.4166667 , mean deviance=0.7295573   
  left son=2 (19 obs) right son=3 (29 obs)  
  Primary splits:  
    Conscientiousness < 53.7 to the right, improve=13.061310, (0 missing)  
    Openness     < 48.2 to the left, improve= 5.008984, (0 missing)  
    Agreeableness   < 45  to the right, improve= 4.720179, (0 missing)  
    Neuroticism    < 60.9 to the left, improve= 2.936428, (0 missing)  
    Extraversion   < 44.85 to the right, improve= 1.560031, (0 missing)  
  Surrogate splits:  
    Neuroticism  < 45.9 to the left, agree=0.729, adj=0.316, (0 split)  
    Agreeableness < 59.2 to the right, agree=0.708, adj=0.263, (0 split)  
    Extraversion < 54.9 to the right, agree=0.688, adj=0.211, (0 split)  
    Openness   < 60.3 to the right, agree=0.625, adj=0.053, (0 split)  
   
 Node number 2: 19 observations  
  events=1, estimated rate=0.09345794 , mean deviance=0.3311521   
   
 Node number 3: 29 observations,  complexity param=0.1320324  
  events=19, estimated rate=0.6369427 , mean deviance=0.5546051   
  left son=6 (13 obs) right son=7 (16 obs)  
  Primary splits:  
    Openness     < 47.35 to the left, improve=4.7031650, (0 missing)  
    Conscientiousness < 40.9 to the right, improve=2.2056170, (0 missing)  
    Agreeableness   < 45.25 to the right, improve=1.3349930, (0 missing)  
    Extraversion   < 44.95 to the right, improve=0.4743020, (0 missing)  
    Neuroticism    < 48.8 to the right, improve=0.2903434, (0 missing)  
  Surrogate splits:  
    Conscientiousness < 46.15 to the right, agree=0.724, adj=0.385, (0 split)  
    Agreeableness   < 50.75 to the right, agree=0.690, adj=0.308, (0 split)  
    Neuroticism    < 49.25 to the left, agree=0.655, adj=0.231, (0 split)  
    Extraversion   < 44.95 to the right, agree=0.655, adj=0.231, (0 split)  
   
 Node number 6: 13 observations  
  events=4, estimated rate=0.3246753 , mean deviance=0.7262304   
   
 Node number 7: 16 observations  
  events=15, estimated rate=0.8695652 , mean deviance=0.1261841


# plot tree plot(fit, uniform = TRUE, main = "Classification Tree for Liberal") text(fit, use.n = TRUE, all = TRUE, cex = .8)

Fitted Results

 #prune the tree  
 pfit <- prune(fit, cp = fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"])  
   
 # plot the pruned tree   
 plot(pfit, uniform = TRUE, main = "Pruned Classification Tree for Liberal")  
 text(pfit, use.n = TRUE, all = TRUE, cex = .8)  
   
 # create attractive postscript plot of tree - creates.ps, converted to .pdf, and then cropped for image (below)
 post(pfit, file = file.choose(), title = "Pruned Classification Tree for Liberal")



Divided we stand: Three psychological regions of the United States and their political, economic, social, and health correlates Abstract, PDF, PDF (copy)

List of United States presidential election results by state

Sample Data

Popular posts from this blog

Chi-Square in R on by State Politics (Red/Blue) and Income (Higher/Lower)

This is a significant result, but instead of a logistic regression looking at the income average per state and the likelihood of being a Democratic state, it uses Chi-Square. Interpreting this is pretty straightforward, in that liberal states typically have cities and people that earn more money. When using adjusted incomes, by cost of living, this difference disappears.

Example Code
# R - Chi Square rm(list = ls()) stateData <- read.table("CostByStateAndSalary.csv", header = TRUE, sep = ",") # Create vectors affluence.median <- median(stateData$Y2014, na.rm = TRUE) affluence.v <- ifelse(stateData$Y2014 > affluence.median, 1, 0) liberal.v <- stateData$Liberal # Solve pol.Data = table(liberal.v, affluence.v) result <- chisq.test(pol.Data) print(result) print(pol.Data)
Example Results
Pearson's Chi-squared test with Yates' continuity correction data: pol.Data X-squared = 12.672, df …

Mean Median, and Mode with R, using Country-level IQ Estimates

Reusing the code posted for Correlations within with Hofstede's Cultural Values, Diversity, GINI, and IQ, the same data can be used for mean, median, and mode. Additionally, the summary function will return values in addition to mean and median, Min, Max, and quartile values:

Example Code
oecdData <- read.table("OECD - Quality of Life.csv", header = TRUE, sep = ",") v1 <- oecdData$IQ # Mean with na.rm = TRUE removed NULL avalues mean(v1, na.rm = TRUE) # Median with na.rm = TRUE removed NULL values median(v1, na.rm = TRUE) # Returns the same data as mean and median, but also includes distribution values: # Min, Quartiles, and Max summary(v1) # Mode does not exist in R, so we need to create a function getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } #returns the mode getmode(v1)
Example Results
> oecdData <- read.table("OECD - Quality of L…