Posts

Showing posts from 2017

Clustering: Hierarchical and K-Means in R on Hofstede Cultural Patterns

Image
Overview What follows is an exploration of clustering, via hierarchies and K-Means, using the Hofstede patterns data, available from my Public folder.

For a deeper understanding of clustering and the various related techniques I suggest the following: Cluster analysis (Wikipedia)An Introduction to Clustering and different methods of clustering Load Data # load data Hofstede.df.preclean <- read.csv("HofstedePatterns.csv", na.strings = c("", "NA")) #nrow(Hofstede.df.preclean) # remove NULLs Hofstede.df.preclean <- na.omit(Hofstede.df.preclean) #nrow(Hofstede.df.preclean) Hofstede.df <- Hofstede.df.preclean Hierarchical Clustering Run hclust, Generate Dendrogram The first attempt is the simplest analysis using the dist() and hclust() functions to generate a hierarchy of grouped data. The cluster size is derived from a reading of the dendrogram, although there are automated ways of selecting the cluster number, shown …

F# is Part of Microsoft's Data Science Workloads

Image
I have not worked in F# for over two (2) years, but am enthused that Microsoft has added it to it languages for Data Science Workloads, along with R and Python. To that end, I hope to repost some of my existing F# code, as well as explore Data Science Workloads utilizing all three languages. Prior work in F# is available from learning F#, and some solutions will be republished on this site.
Data Science Workloads Build Intelligent Apps Faster with Visual Studio and the Data Science Workload Published Work in F# James Igoe MS Azure Notebooks that utilizes MS's implementation of Jupyter Notesbooks. Mathematical Library, a basic mathematical NuGet package, with the source hosted on GitHub. Basic Statistical Functions: Very basic F# class for performing standard deviation and variance calculations. Various Number Functions: A collection of basic mathematical functions written in F# as part of my learning via Project Euler, functions for creating arrays or calculating values in vari…

Comparing Performance in R Using Microbenchmark

Image
This post is a very simple display of how to use microbenchmark in R. Other sites might have longer and more detailed posts, but this post is primarily to 'spread the word' about this useful function, and show how to plot it. An alternative version of this post exists in Microsoft's Azure Notebooks, as Performance Testing Results with Microbenchmark
Load Libraries Memoise as part of the code to test, microbenchmark to show usage, and ggplot2 to plot the result.
library(memoise) library(microbenchmark) library(ggplot2) Create Functions Generate several functions with varied performance times, a base function plus functions that leverage vectorization and memoisation.
# base function monte_carlo = function(N) { hits = 0 for (i in seq_len(N)) { u1 = runif(1) u2 = runif(1) if (u1 ^ 2 > u2) hits = hits + 1 } return(hits / N) } # memoise test function monte_carlo_memo <- memoise(monte_carl…

Pluralsight Courses - Opinion

My list is kind of paltry, but I’ve sat through others or started many but decided against finishing. The best courses I’ve finished have been along the lines of project management:
Project Management for Software EngineersProject 2013 Fundamentals for Business Professionals I’ve also sat through this, and useful, although very rudimentary:
Creating and Leading Effective Teams for Managers I do my own reading for data science, and have my own side projects, but I’ve also taken some data science courses via Pluralsight. The beginner demos are done well, although less informative than the intermediate ones, which are ultimately more useful. For the latter, I typically do simultaneous coding on my own data sets, which helps learn the material.

Beginner
Understanding Machine LearningUnderstanding Machine Learning with R Intermediate
Understanding and Applying Logistic Regression (using Excel, Python, or R) Data Mining Algorithms in SSAS, Excel, and R

ARIMA,Time Series, and Charting in R

Image
ARIMA is an acronym for Autoregressive Integrated Moving Average, and one explanation describes ARIMA models as...
...another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely-used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models were based on a description of trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. A detailed technical discussion can be found on Wikipedia.

For the first exploration I developed several ARIMA models using financial data, varying the parameters, noted as P, D and Q. A general overview of the parameters is from Wikipedia:
p is the order (number of time lags) of the autoregressive model d is the degree of differencing (the number of times the data have had past values subtracted) q is the order of the moving-average model # filtered to start on a date Portfolio.filtered <- P…

Geert Hofstede | Defined Corporate Culture

I've been interested in Hofstede's work since B-school, back in the early second millennia, and at one time considered publishing using his country characteristics as predictors for economic and social welfare outcomes. Nowadays, I use the results of his analyses frequently in small R programming demonstrations.

He's an interesting researcher, who's done important work, as The Economist article describes:
The man who put corporate culture on the map—almost literally—Geert Hofstede (born 1928) defined culture along five different dimensions. Each of these he measured for a large number of countries, and then made cross-country comparisons. In the age of globalisation, these have been used extensively by managers trying to understand the differences between workforces in different environments. The Economist article give a fuller picture of Geert Hofstede, and anyone interested in reading one of his works might enjoy Cultures and Organizations: Software of the Mind, T…

Microsoft Azure Notebooks - Live code - F#, R, and Python

I was exploring Jupyter notebooks, that combines live code, markdown and data, through Microsoft's implementation, known as MS Azure Notebooks, putting together a small library of R and F# notebooks.

As Microsoft's FAQ for the service describes it as : ...a multi-lingual REPL on steroids. This is a free service that provides Jupyter notebooks along with supporting packages for R, Python and F# as a service. This means you can just login and get going since no installation/setup is necessary. Typical usage includes schools/instruction, giving webinars, learning languages, sharing ideas, etc. Feel free to clone and comment...
In R Azure Workbook for R - Memoisation and Vectorization Charting Correlation Matrices in R In F# Charnownes Constant in FSharp.ipynb Project Euler - Problems 18 and 67 - FSharp using Dynamic Programming

Efficient R Programming - A Quick Review

Image
Efficient R Programming: A Practical Guide to Smarter Programming by Colin Gillespie
My rating: 5 of 5 stars

Simply a great book, chock full of tips and techniques for improving one's work with R.

View all my reviews

Performance Improvements in R: Vectorization & Memoisation

Image
Full of potential coding improvements, Efficient R Programming: A Practical Guide to Smarter Programming, the book makes two suggestions that are notable. Vectorization, explained here and here, and memoisation, caching prior results albeit with additional memory use, were relevant and significant.

What follows is a demonstration of the speed improvements that might be achieved using these concepts.

################################ # performance # vectorization and memoization ################################ # clear memory between changes rm(list = ls()) #load memoise #install.packages('memoise') library(memoise) # create test function monte_carlo = function(N) { hits = 0 for (i in seq_len(N)) { u1 = runif(1) u2 = runif(1) if (u1 ^ 2 > u2) hits = hits + 1 } return(hits / N) } # memoise test function monte_carlo_memo <- memoise(monte_carlo) # vectorize function monte_carlo_vec &l…

Neural Networks (Part 4 of 4) - R Packages and Resources

Image
While developing these demonstrations in logistic regression and neural networks, I used and discovered some interesting methods and techniques:
Better Methods A few useful commands and packages...:
update.packages() for updating installed packages in one easy actionas.formula() for creating a formula that I can reuse and update in one action across all my code sectionsView() for looking at data framesfourfoldplot() for plotting confusion matricesneuralnet for developing neural networkscaret, used with nnet, to create predictive modelplotnet() in NeuralNetTools, for creating attractive neural network models Resources that I used or that I would like to explore... MS Azure Notebooks, for working online with Python, R, and F#, all part of MS's data workflowsEfficient R Programming, that seems to have many good tips on working with RData Mining Algorithms in SSAS, Excel, and R, showing various algorithms in each technologyR Documentation, a high quality, useable resource To explor…

Attractive Confusion Matrices in R Plotted with fourfoldplot

Image
As part of writing analyses using neural networks I thought of displaying a confusion matrix, but went looking for something more. What I found was either ugly and simple, or attractive but complicated. The following code demonstrates using fourfoldplot, with original insight gained from fourfoldplot: A prettier confusion matrix in base R, and a great documentation page, R Documentation's page for fourfoldplot.

The code is below, as is the related graph. Source data can be found here.

############################################### # Load and clean data ################################################ Politics.df <- read.csv("BigFiveScoresByState.csv", na.strings = c("", "NA")) Politics.df <- na.omit(Politics.df) ################################################ # Neural Net with nnet and caret ################################################ # load packages library(nnet) library(caret) # set equat…

Neural Networks in R (Part 3 of 4) - Neural Networks on Price Changes in Financial Data

Image
This post is a demonstration using the caret and nnet packages on aggregate Big Five traits per state and political leanings, and a continuation of a series:
Neural Networks (Part 1 of 4) - Logistic Regression and neuralnet on State 'Personality' and Political OutcomesNeural Networks (Part 2 of 4) - caret and nnet on State 'Personality' and Political Outcomes This process using a neural net process to train then test on data. The neural net itself would be envisioned as follows:

Initially, the results seemed promising, with a prediction accuracy of ~.85, further analysis revealed that the bulk of accurate predictions were those that had no price change, and that price change classes were often inaccurate. I excluded security types without price change, and the average dropped even further. I then created a loop that calculated test and training data by security type, and those result indicated that the neural net very right and very wrong, for price change state. Only…

Neural Networks in R (Part 2 of 4) - caret and nnet on State 'Personality' and Political Outcomes

Image
This is a continuation of a prior post, Neural Networks (Part 1 of 4) - Logistic Regression and neuralnet on State 'Personality' and Political Outcomes, the second in a series exploring neural networks.

This post is a demonstration using the caret and nnet packages on aggregate Big Five traits per state and political leanings. Given the small sample size, there was a lower predictive ability using the train/test scenario using 80%, around .66, versus the entire set, which was correct at .83. This is evident in the graphs using lm() and stat_smoot().

The code is below, as are some related graphs. Source data is here.

################################################ # Neural Net with nnet and caret ################################################ # load packages library(nnet) library(caret) # train Politics.model <- train(equation, Politics.df, method = 'nnet', linout = TRUE, trace = FALSE) Politics.model.predicted <- predict(Po…

Neural Networks in R (Part 1 of 4) - Logistic Regression and neuralnet on State 'Personality' and Political Outcomes

Image
This is an example of neural networks using the neuralnet package on one of my sample data sets, walking through a Pluralsight training series, Data Mining Algorithms in SSAS, Excel, and R.

In terms of results, the regression primarily predicts voter leanings based on five (5) traits, openness, conscientiousness, extraversion, agreeableness, and neuroticism, although only the first two (2) traits have a significant impact. Logistic regression is about 85% predictive, and slightly better at predicting Red states over its ability in predicting Blue states.

[1] "Logistic Regression (All) - Correct (%) = 0.854166666666667" [1] "Logistic Regression (Red) - Correct (%) = 0.892857142857143" [1] "Logistic Regression (Blue) - Correct (%) = 0.8"
For the neuralnet package, I created a loop to vary the hidden layers and the number of repetitions, since this is such a small number of records. This is obviously much less predictive than logistic regression, …

Support Vector Machines on Big Five Traits and Politics

Image
This is an example of Support Vector Machines, using one of my usual data sets, as part of a Pluralsight training presentation, Data Mining Algorithms in SSAS, Excel, and R.

In terms of results, the prediction primarily predicts voter leanings based on two (2) traits, openness and conscientiousness, and although using all five (5) factors improved the prediction quality, plotting that is problematic. For this, the model is 96% predictive of Republican outcomes, but only 66% accurate in predicting Democratic leaning.
Politics.prediction Blue Red Blue 15 1 Red 5 27 The code is below, as are some related graphs. Source data is here.
# Clear memory rm(list = ls()) # set Working directory getwd() setwd('../Data') # load data Politics.df <- read.csv("BigFiveScoresByState.csv", na.strings = c("", "NA")) # clean data - remove NULLs Politics.df <- na.omit(Politics.df…

Charting Correlation Matrices in R

Image
I noticed this very simple, very powerful article by James Marquez, Seven Easy Graphs to Visualize Correlation Matrices in R, in the Google+ community, R Programming for Data Analysis, so thought to give it a try, since I started some of my current analyses a decade ago by generating correlation matrices in Excel, which I've sometimes redone and improved in R.

Some of these packages are only designed for display, or as extensions to ggplot2:
corrplot: Visualization of a Correlation MatrixGGally: Extension to 'ggplot2'ggcorrplot: Visualization of a Correlation Matrix using 'ggplot2' These two are focused on more complex analysis:
PerformanceAnalytics: Econometric tools for performance and risk analysispsych: Procedures for Psychological, Psychometric, and Personality Research As for data, I used Hofstede's culture dimensions, limited to developed countries. Using a broader and larger set of of countries would significantly reduce the correlations, in that only in…

Decision Trees (party) on Political Outcome Based on State-level Big Five Assessment

Image
This decision tree demo is similar to a prior one I've done, but in this case it uses the party package, that produces much higher quality graphics than rpart, at least when used with plot.. This was done as part of a Pluralsight training presentation, Data Mining Algorithms in SSAS, Excel, and R.

Source data is here.

# # Load Data # # Set working directory setwd("../Data") getwd() # read from data frame BigFivByState.df <- read.table("BigFiveScoresByState.csv", header = TRUE, sep = ",") # # Run Analysis # # Load package, install.packages('party', dependencies = TRUE) library(party) # train the model BigFivByState.dt <- ctree(data = BigFivByState.df, Liberal ~ Openness + Conscientiousness + Extraversion + Neuroticism + Agreeableness) # plot result plot(BigFivByState.dt, uniform = TRUE, main = "Classification Tree for Politics on Big Five")

Naive Bayes on Political Outcome Based on State-level Big Five Assessment

Image
As part of another Pluralsight training presentation, Data Mining Algorithms in SSAS, Excel, and R, I worked through various exercises, and from that I've adapated Naive Basyes to one of my existing data sets.

The code is below, as are some related graphs. Overall, the percent correct predicted based on Big Five personality traits using the Naive Bayes calculation is 66%.

Source data is here.

# # Load & Explore Data # # read from data frame BigFivByState.df <- read.table("BigFiveScoresByState.csv", header = TRUE, sep = ",") # review data head(BigFivByState.df) nrow(BigFivByState.df) summary(BigFivByState.df) names(BigFivByState.df) # various aggregations # as "count this value" ~ grouped by this + this Liberal.dist <- aggregate(State ~ Liberal, data = BigFivByState.df, FUN = length) head(Liberal.dist) RedBlue.dist <- aggregate(State ~ Politics, data = BigFivByState.df, FUN = l…

Logistic Regression on Stock Data using Google and SPY (SPDR S&P 500)

As part of a Pluralsight training presentation, Understanding and Applying Logistic Regression, students worked through various exercises, one of which was predicting stock price changes, up or down, on Google, using Google and Spyder closing prices.

As an ordered list of actions:
Load data - Yahoo financials for each day for 5 years, taking only date and closing price for this analysis Transform sources: Merge sources, change column headings, cast the Date column as DATE type, sort descending Perform logistic regression Create a frame of actual versus predicted changes, and add a column for the correct/incorrect prediction result Find percent correct, on whether the price moved correctly up or down As a result, the lagged Google and SPY prices accurately predict next day prices about 63% of the time.

Source data is here.

# Clear memory rm(list = ls()) # Set working directory setwd("../Data") getwd() # load data # Data is Yahoo financial price f…