Skip to main content

OESMN (Obtaining, Scrubbing, Exploring, Modeling, iNterpreting): Getting Data

OESMN is an acronym, for the elements of data science:
  • Obtaining data
  • Scrubbing data
  • Exploring data
  • Modeling data
  • iNterpreting data
As part of the Data science is OSEMN module for Obtaining Data I performed the examples myself, as well as with variants.

Example Code


 # Data science is OSEMN  
 """http://people.duke.edu/~ccc14/sta-663/DataProcessingSolutions.html#data-science-is-osemn"""  
   
 # Acquiring data 
 
 """Plain text files  
 We can open plain text files with the open function. This is a common 
 and very flexible format, but because no structure is involved, custom 
 processing methods to extract the information needed may be necessary.  
 Example 1: Suppose we want to find out how often the words alice and 
 drink occur in the same sentence in Alice in Wonderland. This example 
 uses Gutenberg: http://www.gutenberg.org/wiki/Main_Page, a good source 
 for files""" 
 
 # the data file is linked below  
 text = open('Data\pg989.txt','r').read().replace('\r\n', ' ').lower()  
   
 # sentence extraction and frequency  
 import re  
 punctuation = '\.|\?|\!'  
 sentences = re.split(punctuation, text)  
 len(sentences)  
   
 # word extraction and frequency  
 import re  
 words = re.split(' ', text)  
 len(words)  
   
 # converts to a sorted dictionary  
 import collections as coll  
 counter=coll.Counter(words)  
 len(counter)  
   
 # if looking to exclude extraneous terms  
 exclude = ('the','of','to','and','that','in',''
 ,'a','as','is','for','not','by','he','it','which'
 ,'with','or','i','with','from','which')  
 trimmedWords = [word for word in words if word not in exclude]  
 trimmedWordCounter=coll.Counter(trimmedWords)  
   
 # Reading a text file line by line  
 import csv  
 with open('Data\pg989.txt') as f:  
   lines = list()  
   for line in csv.reader(row for row in f if not row.startswith('#')):  
     if len(line) > 0:  
       print(line)  
   
 # Reading a csv line by line  
 import csv  
 with open('Data\OECD - Quality of Life.csv') as f:  
   lines = list()  
   for line in csv.reader(row for row in f):  
     (Country, HofstederPowerDx, HofstederIndividuality, HofstederMasculinity
     , HofstederUncertaintyAvoidance, Diversity_Ethnic, Diversity_Linguistic
     , Diversity_Religious, ReligionMatters, Protestantism, Religiosity, IQ
     , Gini, Employment, Unemployment, EduReading, EduScience, TertiaryEdu
     , LifeExpectancy, InfantDeath, Obesity, HoursWorked, Prison
     , Carvandalism, Cartheft, Theftfromcar, Motorcycletheft, Bicycletheft
     , Assaultsandthreats, Sexualincidents, Burglaries, Robberies) = line  
     if len(line) > 0:  
       print(Country)  
   
 # Load into a data frame  
 import pandas as pd  
 df = pd.read_csv('Data\OECD - Quality of Life.csv', header=0)  
 df  


Sample Data

Popular posts from this blog

Decision Tree in R, with Graphs: Predicting State Politics from Big Five Traits

This was a continuation of prior explorations, logistic regression predicting Red/Blue state dichotomy by income or by personality. This uses the same five personality dimensions, but instead builds a decision tree. Of the Big Five traits, only two were found to useful in the decision tree, conscientiousness and openness.

Links to sample data, as well as to source references, are at the end of this entry.

Example Code

# Decision Tree - Big Five and Politics library("rpart") # grow tree input.dat <- read.table("BigFiveScoresByState.csv", header = TRUE, sep = ",") fit <- rpart(Liberal ~ Openness + Conscientiousness + Neuroticism + Extraversion + Agreeableness, data = input.dat, method="poisson") # display the results printcp(fit) # visualize cross-validation results plotcp(fit) # detailed summary of splits summary(fit) # plot tree plot(fit, uniform = TRUE, main = "Classific…

Chi-Square in R on by State Politics (Red/Blue) and Income (Higher/Lower)

This is a significant result, but instead of a logistic regression looking at the income average per state and the likelihood of being a Democratic state, it uses Chi-Square. Interpreting this is pretty straightforward, in that liberal states typically have cities and people that earn more money. When using adjusted incomes, by cost of living, this difference disappears.

Example Code
# R - Chi Square rm(list = ls()) stateData <- read.table("CostByStateAndSalary.csv", header = TRUE, sep = ",") # Create vectors affluence.median <- median(stateData$Y2014, na.rm = TRUE) affluence.v <- ifelse(stateData$Y2014 > affluence.median, 1, 0) liberal.v <- stateData$Liberal # Solve pol.Data = table(liberal.v, affluence.v) result <- chisq.test(pol.Data) print(result) print(pol.Data)
Example Results
Pearson's Chi-squared test with Yates' continuity correction data: pol.Data X-squared = 12.672, df …

Logistic Regression in R on State Voting in National Elections and Income

This data set - a link is at the bottom - is slightly different, income by state and political group, red or blue. Generally, residents of blue states have higher incomes, although not when adjusted for cost of living:

Example Code
# R - Logistic Regression stateData <- read.table("CostByStateAndSalary.csv", header = TRUE, sep = ",") am.data = glm(formula = Liberal ~ Y2014, data = stateData, family = binomial) print(summary(am.data))
Example Results
Call: glm(formula = Liberal ~ Y2014, family = binomial, data = stateData) Deviance Residuals: Min 1Q Median 3Q Max -2.4347 -0.7297 -0.4880 0.6327 1.8722 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.325e+00 2.671e+00 -3.491 0.000481 *** Y2014 1.752e-04 5.208e-05 3.365 0.000765 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for bi…