Decision Tree in R, with Graphs: Predicting State Politics from Big Five Traits

This was a continuation of prior explorations, logistic regression predicting Red/Blue state dichotomy by income or by personality. This uses the same five personality dimensions, but instead builds a decision tree. Of the Big Five traits, only two were found to useful in the decision tree, conscientiousness and openness.

Links to sample data, as well as to source references, are at the end of this entry.

Example Code


 # Decision Tree - Big Five and Politics  
 library("rpart")  
   
 # grow tree   
 input.dat <- read.table("BigFiveScoresByState.csv", header = TRUE, sep = ",")  
 fit <- rpart(Liberal ~ Openness + Conscientiousness + Neuroticism + Extraversion + Agreeableness, data = input.dat, method="poisson")  
   
 # display the results  
 printcp(fit)  
   
 # visualize cross-validation results    
 plotcp(fit)  
   
 # detailed summary of splits  
 summary(fit)   
   
 # plot tree   
 plot(fit, uniform = TRUE, main = "Classification Tree for Liberal")  
 text(fit, use.n = TRUE, all = TRUE, cex = .8)  
   
 # create attractive postscript plot of tree   
 #post(fit, file = file.choose(), title = "Classification Tree for Liberal")  
   
 pfit <- prune(fit, cp = fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"])  
   
 # plot the pruned tree   
 plot(pfit, uniform = TRUE, main = "Pruned Classification Tree for Liberal")  
 text(pfit, use.n = TRUE, all = TRUE, cex = .8)  
   
 # create attractive postscript plot of tree   
 post(pfit, file = file.choose(), title = "Pruned Classification Tree for Liberal")  

Initial Results


 Rates regression tree:  
 rpart(formula = Liberal ~ Openness + Conscientiousness + Neuroticism +   
   Extraversion + Agreeableness, data = input.dat, method = "poisson")  
   
 Variables actually used in tree construction:  
 [1] Conscientiousness Openness       
   
 Root node error: 35.019/48 = 0.72956  
   
 n=48 (2 observations deleted due to missingness)  
   
     CP nsplit rel error xerror   xstd  
 1 0.36104   0  1.00000 1.03391 0.024507  
 2 0.13203   1  0.63896 0.70221 0.134896  
 3 0.01000   2  0.50692 0.63244 0.137426  
 > # detailed summary of splits  
 + summary(fit)   
 +   
 Call:  
 rpart(formula = Liberal ~ Openness + Conscientiousness + Neuroticism +   
   Extraversion + Agreeableness, data = input.dat, method = "poisson")  
  n=48 (2 observations deleted due to missingness)  
   
      CP nsplit rel error  xerror    xstd  
 1 0.3610441   0 1.0000000 1.0339150 0.02450674  
 2 0.1320324   1 0.6389559 0.7022068 0.13489629  
 3 0.0100000   2 0.5069236 0.6324368 0.13742579  
   
 Variable importance  
 Conscientiousness     Openness    Neuroticism   Agreeableness   
         43        16        15        14   
    Extraversion   
         11   
   
 Node number 1: 48 observations,  complexity param=0.3610441  
  events=20, estimated rate=0.4166667 , mean deviance=0.7295573   
  left son=2 (19 obs) right son=3 (29 obs)  
  Primary splits:  
    Conscientiousness < 53.7 to the right, improve=13.061310, (0 missing)  
    Openness     < 48.2 to the left, improve= 5.008984, (0 missing)  
    Agreeableness   < 45  to the right, improve= 4.720179, (0 missing)  
    Neuroticism    < 60.9 to the left, improve= 2.936428, (0 missing)  
    Extraversion   < 44.85 to the right, improve= 1.560031, (0 missing)  
  Surrogate splits:  
    Neuroticism  < 45.9 to the left, agree=0.729, adj=0.316, (0 split)  
    Agreeableness < 59.2 to the right, agree=0.708, adj=0.263, (0 split)  
    Extraversion < 54.9 to the right, agree=0.688, adj=0.211, (0 split)  
    Openness   < 60.3 to the right, agree=0.625, adj=0.053, (0 split)  
   
 Node number 2: 19 observations  
  events=1, estimated rate=0.09345794 , mean deviance=0.3311521   
   
 Node number 3: 29 observations,  complexity param=0.1320324  
  events=19, estimated rate=0.6369427 , mean deviance=0.5546051   
  left son=6 (13 obs) right son=7 (16 obs)  
  Primary splits:  
    Openness     < 47.35 to the left, improve=4.7031650, (0 missing)  
    Conscientiousness < 40.9 to the right, improve=2.2056170, (0 missing)  
    Agreeableness   < 45.25 to the right, improve=1.3349930, (0 missing)  
    Extraversion   < 44.95 to the right, improve=0.4743020, (0 missing)  
    Neuroticism    < 48.8 to the right, improve=0.2903434, (0 missing)  
  Surrogate splits:  
    Conscientiousness < 46.15 to the right, agree=0.724, adj=0.385, (0 split)  
    Agreeableness   < 50.75 to the right, agree=0.690, adj=0.308, (0 split)  
    Neuroticism    < 49.25 to the left, agree=0.655, adj=0.231, (0 split)  
    Extraversion   < 44.95 to the right, agree=0.655, adj=0.231, (0 split)  
   
 Node number 6: 13 observations  
  events=4, estimated rate=0.3246753 , mean deviance=0.7262304   
   
 Node number 7: 16 observations  
  events=15, estimated rate=0.8695652 , mean deviance=0.1261841


# plot tree plot(fit, uniform = TRUE, main = "Classification Tree for Liberal") text(fit, use.n = TRUE, all = TRUE, cex = .8)

Fitted Results

 #prune the tree  
 pfit <- prune(fit, cp = fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"])  
   
 # plot the pruned tree   
 plot(pfit, uniform = TRUE, main = "Pruned Classification Tree for Liberal")  
 text(pfit, use.n = TRUE, all = TRUE, cex = .8)  
   
 # create attractive postscript plot of tree - creates.ps, converted to .pdf, and then cropped for image (below)
 post(pfit, file = file.choose(), title = "Pruned Classification Tree for Liberal")



Divided we stand: Three psychological regions of the United States and their political, economic, social, and health correlates Abstract, PDF, PDF (copy)

List of United States presidential election results by state

Sample Data

Comments

Popular posts from this blog

Charting Correlation Matrices in R

Attractive Confusion Matrices in R Plotted with fourfoldplot

Calculating Value at Risk (VaR) with Python or R