Exercises: OESMN (Obtaining, Scrubbing, Exploring, Modeling, iNterpreting)

As part of the Data science is OSEMN module for Obtaining Data I worked through the exercises.

Example Code


 # Exercises  
 """http://people.duke.edu/~ccc14/sta-663/DataProcessingSolutions.html#exercises"""  
   
 """1. Write the following sentences to a file “hello.txt” using open and write.   
 There should be 3 lines in the resulting file.  
   
 Hello, world.  
 Goodbye, cruel world.  
 The world is your oyster."""  
   
 str = 'Data\Test.txt'  
   
 f = open(str, 'w')  
 f.write('Hello, world.\r')  
 f.write('Goodbye, cruel world.\r')  
 f.write('The world is your oyster.')  
   
 # Writes same thing, only in one statement  
 f.write('\rHello, world.\rGoodbye, cruel world.\rThe world is your oyster.')  
   
 f.close()  
   
 with open(str, 'r') as f:  
   content = f.read()  
   print(content)  
   
   
 """2. Using a for loop and open, print only the lines from the   
 file ‘hello.txt’ that begin wtih ‘Hello’ or ‘The’."""  
   
 for line in open(str, 'r'):  
   if line.startswith('Hello') or line.startswith('The') :  
     print(line)  
   
   
 """3. Most of the time, tabular files can be read corectly using   
 convenience functions from pandas. Sometimes, however, line-by-line processing   
 of a file is unavoidable, typically when the file originated from   
 an Excel spreadsheet. Use the csv module and a for loop to create   
 a pandas DataFrame for the file ugh.csv."""  
   
 # Reading a csv line by line  
 import pandas as pd  
 with open('Data\OECD - Quality of Life.csv') as f:  
   rowCount = 0  
   for line in csv.reader(row for row in f):  
     if rowCount == 0:  
       tempDf = pd.DataFrame(index=None, columns=line)  
       rowCount = 1;  
     else:  
       for i in range(len(line)):  
         tempDf.set_value(newDf.columns[i], rowCount, line[i])  
       rowCount += 1  
   tempDf  
   
 # the easy way  
 otherDf = pd.read_csv('Data\OECD - Quality of Life.csv')  
   

Sample Data

Comments

Popular posts from this blog

Charting Correlation Matrices in R

Developers in New York City by Zip Code

Cultural Dimensions and Coffee Consumption