OESMN (Obtaining, Scrubbing, Exploring, Modeling, iNterpreting): Getting Data

OESMN is an acronym, for the elements of data science:
  • Obtaining data
  • Scrubbing data
  • Exploring data
  • Modeling data
  • iNterpreting data
As part of the Data science is OSEMN module for Obtaining Data I performed the examples myself, as well as with variants.

Example Code


 # Data science is OSEMN  
 """http://people.duke.edu/~ccc14/sta-663/DataProcessingSolutions.html#data-science-is-osemn"""  
   
 # Acquiring data 
 
 """Plain text files  
 We can open plain text files with the open function. This is a common 
 and very flexible format, but because no structure is involved, custom 
 processing methods to extract the information needed may be necessary.  
 Example 1: Suppose we want to find out how often the words alice and 
 drink occur in the same sentence in Alice in Wonderland. This example 
 uses Gutenberg: http://www.gutenberg.org/wiki/Main_Page, a good source 
 for files""" 
 
 # the data file is linked below  
 text = open('Data\pg989.txt','r').read().replace('\r\n', ' ').lower()  
   
 # sentence extraction and frequency  
 import re  
 punctuation = '\.|\?|\!'  
 sentences = re.split(punctuation, text)  
 len(sentences)  
   
 # word extraction and frequency  
 import re  
 words = re.split(' ', text)  
 len(words)  
   
 # converts to a sorted dictionary  
 import collections as coll  
 counter=coll.Counter(words)  
 len(counter)  
   
 # if looking to exclude extraneous terms  
 exclude = ('the','of','to','and','that','in',''
 ,'a','as','is','for','not','by','he','it','which'
 ,'with','or','i','with','from','which')  
 trimmedWords = [word for word in words if word not in exclude]  
 trimmedWordCounter=coll.Counter(trimmedWords)  
   
 # Reading a text file line by line  
 import csv  
 with open('Data\pg989.txt') as f:  
   lines = list()  
   for line in csv.reader(row for row in f if not row.startswith('#')):  
     if len(line) > 0:  
       print(line)  
   
 # Reading a csv line by line  
 import csv  
 with open('Data\OECD - Quality of Life.csv') as f:  
   lines = list()  
   for line in csv.reader(row for row in f):  
     (Country, HofstederPowerDx, HofstederIndividuality, HofstederMasculinity
     , HofstederUncertaintyAvoidance, Diversity_Ethnic, Diversity_Linguistic
     , Diversity_Religious, ReligionMatters, Protestantism, Religiosity, IQ
     , Gini, Employment, Unemployment, EduReading, EduScience, TertiaryEdu
     , LifeExpectancy, InfantDeath, Obesity, HoursWorked, Prison
     , Carvandalism, Cartheft, Theftfromcar, Motorcycletheft, Bicycletheft
     , Assaultsandthreats, Sexualincidents, Burglaries, Robberies) = line  
     if len(line) > 0:  
       print(Country)  
   
 # Load into a data frame  
 import pandas as pd  
 df = pd.read_csv('Data\OECD - Quality of Life.csv', header=0)  
 df  


Sample Data

Comments

Popular posts from this blog

Developers in New York City by Zip Code

Cultural Dimensions and Coffee Consumption

Charting Correlation Matrices in R