Exercises: OESMN (Obtaining, Scrubbing, Exploring, Modeling, iNterpreting)
As part of the Data science is OSEMN module for Obtaining Data I worked through the exercises.
# Exercises """http://people.duke.edu/~ccc14/sta-663/DataProcessingSolutions.html#exercises""" """1. Write the following sentences to a file “hello.txt” using open and write. There should be 3 lines in the resulting file. Hello, world. Goodbye, cruel world. The world is your oyster.""" str = 'Data\Test.txt' f = open(str, 'w') f.write('Hello, world.\r') f.write('Goodbye, cruel world.\r') f.write('The world is your oyster.') # Writes same thing, only in one statement f.write('\rHello, world.\rGoodbye, cruel world.\rThe world is your oyster.') f.close() with open(str, 'r') as f: content = f.read() print(content) """2. Using a for loop and open, print only the lines from the file ‘hello.txt’ that begin wtih ‘Hello’ or ‘The’.""" for line in open(str, 'r'): if line.startswith('Hello') or line.startswith('The') : print(line) """3. Most of the time, tabular files can be read corectly using convenience functions from pandas. Sometimes, however, line-by-line processing of a file is unavoidable, typically when the file originated from an Excel spreadsheet. Use the csv module and a for loop to create a pandas DataFrame for the file ugh.csv.""" # Reading a csv line by line import pandas as pd with open('Data\OECD - Quality of Life.csv') as f: rowCount = 0 for line in csv.reader(row for row in f): if rowCount == 0: tempDf = pd.DataFrame(index=None, columns=line) rowCount = 1; else: for i in range(len(line)): tempDf.set_value(newDf.columns[i], rowCount, line[i]) rowCount += 1 tempDf # the easy way otherDf = pd.read_csv('Data\OECD - Quality of Life.csv')