OESMN (Obtaining, Scrubbing, Exploring, Modeling, iNterpreting): Getting Data
OESMN is an acronym, for the elements of data science:
Example Code
Sample Data
- Obtaining data
- Scrubbing data
- Exploring data
- Modeling data
- iNterpreting data
Example Code
# Data science is OSEMN
"""http://people.duke.edu/~ccc14/sta-663/DataProcessingSolutions.html#data-science-is-osemn"""
# Acquiring data
"""Plain text files
We can open plain text files with the open function. This is a common
and very flexible format, but because no structure is involved, custom
processing methods to extract the information needed may be necessary.
Example 1: Suppose we want to find out how often the words alice and
drink occur in the same sentence in Alice in Wonderland. This example
uses Gutenberg: http://www.gutenberg.org/wiki/Main_Page, a good source
for files"""
# the data file is linked below
text = open('Data\pg989.txt','r').read().replace('\r\n', ' ').lower()
# sentence extraction and frequency
import re
punctuation = '\.|\?|\!'
sentences = re.split(punctuation, text)
len(sentences)
# word extraction and frequency
import re
words = re.split(' ', text)
len(words)
# converts to a sorted dictionary
import collections as coll
counter=coll.Counter(words)
len(counter)
# if looking to exclude extraneous terms
exclude = ('the','of','to','and','that','in',''
,'a','as','is','for','not','by','he','it','which'
,'with','or','i','with','from','which')
trimmedWords = [word for word in words if word not in exclude]
trimmedWordCounter=coll.Counter(trimmedWords)
# Reading a text file line by line
import csv
with open('Data\pg989.txt') as f:
lines = list()
for line in csv.reader(row for row in f if not row.startswith('#')):
if len(line) > 0:
print(line)
# Reading a csv line by line
import csv
with open('Data\OECD - Quality of Life.csv') as f:
lines = list()
for line in csv.reader(row for row in f):
(Country, HofstederPowerDx, HofstederIndividuality, HofstederMasculinity
, HofstederUncertaintyAvoidance, Diversity_Ethnic, Diversity_Linguistic
, Diversity_Religious, ReligionMatters, Protestantism, Religiosity, IQ
, Gini, Employment, Unemployment, EduReading, EduScience, TertiaryEdu
, LifeExpectancy, InfantDeath, Obesity, HoursWorked, Prison
, Carvandalism, Cartheft, Theftfromcar, Motorcycletheft, Bicycletheft
, Assaultsandthreats, Sexualincidents, Burglaries, Robberies) = line
if len(line) > 0:
print(Country)
# Load into a data frame
import pandas as pd
df = pd.read_csv('Data\OECD - Quality of Life.csv', header=0)
df
Sample Data
Comments
Post a Comment