OESMN (Obtaining, Scrubbing, Exploring, Modeling, iNterpreting): Getting Data

OESMN is an acronym, for the elements of data science:
  • Obtaining data
  • Scrubbing data
  • Exploring data
  • Modeling data
  • iNterpreting data
As part of the Data science is OSEMN module for Obtaining Data I performed the examples myself, as well as with variants.

Example Code

 # Data science is OSEMN  
 # Acquiring data 
 """Plain text files  
 We can open plain text files with the open function. This is a common 
 and very flexible format, but because no structure is involved, custom 
 processing methods to extract the information needed may be necessary.  
 Example 1: Suppose we want to find out how often the words alice and 
 drink occur in the same sentence in Alice in Wonderland. This example 
 uses Gutenberg: http://www.gutenberg.org/wiki/Main_Page, a good source 
 for files""" 
 # the data file is linked below  
 text = open('Data\pg989.txt','r').read().replace('\r\n', ' ').lower()  
 # sentence extraction and frequency  
 import re  
 punctuation = '\.|\?|\!'  
 sentences = re.split(punctuation, text)  
 # word extraction and frequency  
 import re  
 words = re.split(' ', text)  
 # converts to a sorted dictionary  
 import collections as coll  
 # if looking to exclude extraneous terms  
 exclude = ('the','of','to','and','that','in',''
 trimmedWords = [word for word in words if word not in exclude]  
 # Reading a text file line by line  
 import csv  
 with open('Data\pg989.txt') as f:  
   lines = list()  
   for line in csv.reader(row for row in f if not row.startswith('#')):  
     if len(line) > 0:  
 # Reading a csv line by line  
 import csv  
 with open('Data\OECD - Quality of Life.csv') as f:  
   lines = list()  
   for line in csv.reader(row for row in f):  
     (Country, HofstederPowerDx, HofstederIndividuality, HofstederMasculinity
     , HofstederUncertaintyAvoidance, Diversity_Ethnic, Diversity_Linguistic
     , Diversity_Religious, ReligionMatters, Protestantism, Religiosity, IQ
     , Gini, Employment, Unemployment, EduReading, EduScience, TertiaryEdu
     , LifeExpectancy, InfantDeath, Obesity, HoursWorked, Prison
     , Carvandalism, Cartheft, Theftfromcar, Motorcycletheft, Bicycletheft
     , Assaultsandthreats, Sexualincidents, Burglaries, Robberies) = line  
     if len(line) > 0:  
 # Load into a data frame  
 import pandas as pd  
 df = pd.read_csv('Data\OECD - Quality of Life.csv', header=0)  

Sample Data


Popular posts from this blog

Neural Networks in R (Part 2 of 4) - caret and nnet on State 'Personality' and Political Outcomes

Cultural Dimensions and Coffee Consumption

Charting Correlation Matrices in R