Posts

Clustering: Hierarchical and K-Means in R on Hofstede Cultural Patterns

Image
Overview What follows is an exploration of clustering, via hierarchies and K-Means, using the Hofstede patterns data, available from my Public folder.

For a deeper understanding of clustering and the various related techniques I suggest the following: Cluster analysis (Wikipedia)An Introduction to Clustering and different methods of clustering Load Data # load data Hofstede.df.preclean <- read.csv("HofstedePatterns.csv", na.strings = c("", "NA")) #nrow(Hofstede.df.preclean) # remove NULLs Hofstede.df.preclean <- na.omit(Hofstede.df.preclean) #nrow(Hofstede.df.preclean) Hofstede.df <- Hofstede.df.preclean Hierarchical Clustering Run hclust, Generate Dendrogram The first attempt is the simplest analysis using the dist() and hclust() functions to generate a hierarchy of grouped data. The cluster size is derived from a reading of the dendrogram, although there are automated ways of selecting the cluster number, shown …

F# is Part of Microsoft's Data Science Workloads

Image
I have not worked in F# for over two (2) years, but am enthused that Microsoft has added it to it languages for Data Science Workloads, along with R and Python. To that end, I hope to repost some of my existing F# code, as well as explore Data Science Workloads utilizing all three languages. Prior work in F# is available from learning F#, and some solutions will be republished on this site.
Data Science Workloads Build Intelligent Apps Faster with Visual Studio and the Data Science Workload Published Work in F# James Igoe MS Azure Notebooks that utilizes MS's implementation of Jupyter Notesbooks. Mathematical Library, a basic mathematical NuGet package, with the source hosted on GitHub. Basic Statistical Functions: Very basic F# class for performing standard deviation and variance calculations. Various Number Functions: A collection of basic mathematical functions written in F# as part of my learning via Project Euler, functions for creating arrays or calculating values in vari…

Comparing Performance in R Using Microbenchmark

Image
This post is a very simple display of how to use microbenchmark in R. Other sites might have longer and more detailed posts, but this post is primarily to 'spread the word' about this useful function, and show how to plot it. An alternative version of this post exists in Microsoft's Azure Notebooks, as Performance Testing Results with Microbenchmark
Load Libraries Memoise as part of the code to test, microbenchmark to show usage, and ggplot2 to plot the result.
library(memoise) library(microbenchmark) library(ggplot2) Create Functions Generate several functions with varied performance times, a base function plus functions that leverage vectorization and memoisation.
# base function monte_carlo = function(N) { hits = 0 for (i in seq_len(N)) { u1 = runif(1) u2 = runif(1) if (u1 ^ 2 > u2) hits = hits + 1 } return(hits / N) } # memoise test function monte_carlo_memo <- memoise(monte_carl…

Pluralsight Courses - Opinion

My list is kind of paltry, but I’ve sat through others or started many but decided against finishing. The best courses I’ve finished have been along the lines of project management:
Project Management for Software EngineersProject 2013 Fundamentals for Business Professionals I’ve also sat through this, and useful, although very rudimentary:
Creating and Leading Effective Teams for Managers I do my own reading for data science, and have my own side projects, but I’ve also taken some data science courses via Pluralsight. The beginner demos are done well, although less informative than the intermediate ones, which are ultimately more useful. For the latter, I typically do simultaneous coding on my own data sets, which helps learn the material.

Beginner
Understanding Machine LearningUnderstanding Machine Learning with R Intermediate
Understanding and Applying Logistic Regression (using Excel, Python, or R) Data Mining Algorithms in SSAS, Excel, and R

ARIMA,Time Series, and Charting in R

Image
ARIMA is an acronym for Autoregressive Integrated Moving Average, and one explanation describes ARIMA models as...
...another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely-used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models were based on a description of trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. A detailed technical discussion can be found on Wikipedia.

For the first exploration I developed several ARIMA models using financial data, varying the parameters, noted as P, D and Q. A general overview of the parameters is from Wikipedia:
p is the order (number of time lags) of the autoregressive model d is the degree of differencing (the number of times the data have had past values subtracted) q is the order of the moving-average model # filtered to start on a date Portfolio.filtered <- P…

Geert Hofstede | Defined Corporate Culture

I've been interested in Hofstede's work since B-school, back in the early second millennia, and at one time considered publishing using his country characteristics as predictors for economic and social welfare outcomes. Nowadays, I use the results of his analyses frequently in small R programming demonstrations.

He's an interesting researcher, who's done important work, as The Economist article describes:
The man who put corporate culture on the map—almost literally—Geert Hofstede (born 1928) defined culture along five different dimensions. Each of these he measured for a large number of countries, and then made cross-country comparisons. In the age of globalisation, these have been used extensively by managers trying to understand the differences between workforces in different environments. The Economist article give a fuller picture of Geert Hofstede, and anyone interested in reading one of his works might enjoy Cultures and Organizations: Software of the Mind, T…

Microsoft Azure Notebooks - Live code - F#, R, and Python

I was exploring Jupyter notebooks, that combines live code, markdown and data, through Microsoft's implementation, known as MS Azure Notebooks, putting together a small library of R and F# notebooks.

As Microsoft's FAQ for the service describes it as : ...a multi-lingual REPL on steroids. This is a free service that provides Jupyter notebooks along with supporting packages for R, Python and F# as a service. This means you can just login and get going since no installation/setup is necessary. Typical usage includes schools/instruction, giving webinars, learning languages, sharing ideas, etc. Feel free to clone and comment...
In R Azure Workbook for R - Memoisation and Vectorization Charting Correlation Matrices in R In F# Charnownes Constant in FSharp.ipynb Project Euler - Problems 18 and 67 - FSharp using Dynamic Programming

Efficient R Programming - A Quick Review

Image
Efficient R Programming: A Practical Guide to Smarter Programming by Colin Gillespie
My rating: 5 of 5 stars

Simply a great book, chock full of tips and techniques for improving one's work with R.

View all my reviews

Performance Improvements in R: Vectorization & Memoisation

Image
Full of potential coding improvements, Efficient R Programming: A Practical Guide to Smarter Programming, the book makes two suggestions that are notable. Vectorization, explained here and here, and memoisation, caching prior results albeit with additional memory use, were relevant and significant.

What follows is a demonstration of the speed improvements that might be achieved using these concepts.

################################ # performance # vectorization and memoization ################################ # clear memory between changes rm(list = ls()) #load memoise #install.packages('memoise') library(memoise) # create test function monte_carlo = function(N) { hits = 0 for (i in seq_len(N)) { u1 = runif(1) u2 = runif(1) if (u1 ^ 2 > u2) hits = hits + 1 } return(hits / N) } # memoise test function monte_carlo_memo <- memoise(monte_carlo) # vectorize function monte_carlo_vec &l…

Neural Networks (Part 4 of 4) - R Packages and Resources

Image
While developing these demonstrations in logistic regression and neural networks, I used and discovered some interesting methods and techniques:
Better Methods A few useful commands and packages...:
update.packages() for updating installed packages in one easy actionas.formula() for creating a formula that I can reuse and update in one action across all my code sectionsView() for looking at data framesfourfoldplot() for plotting confusion matricesneuralnet for developing neural networkscaret, used with nnet, to create predictive modelplotnet() in NeuralNetTools, for creating attractive neural network models Resources that I used or that I would like to explore... MS Azure Notebooks, for working online with Python, R, and F#, all part of MS's data workflowsEfficient R Programming, that seems to have many good tips on working with RData Mining Algorithms in SSAS, Excel, and R, showing various algorithms in each technologyR Documentation, a high quality, useable resource To explor…