Logistic Regression on Stock Data using Google and SPY (SPDR S&P 500)

As part of a Pluralsight training presentation, Understanding and Applying Logistic Regression, students worked through various exercises, one of which was predicting stock price changes, up or down, on Google, using Google and Spyder closing prices.

As an ordered list of actions:
  1. Load data - Yahoo financials for each day for 5 years, taking only date and closing price for this analysis
  2. Transform sources: Merge sources, change column headings, cast the Date column as DATE type, sort descending
  3. Perform logistic regression
  4. Create a frame of actual versus predicted changes, and add a column for the correct/incorrect prediction result
  5. Find percent correct, on whether the price moved correctly up or down
As a result, the lagged Google and SPY prices accurately predict next day prices about 63% of the time.

Source data is here.

 
 # Clear memory  
 rm(list = ls())  
 
 # Set working directory  
 setwd("../Data")  
 getwd()

 # load data  
 # Data is Yahoo financial price for each day for 5 years,  
 # taking only date and closing price for this analysis  
 GOOG.data <- read.csv("GOOG.csv", header = TRUE, sep = ",")[, c("Date", "Adj.Close")]  
 SPY.data <- read.csv("SPY.csv", header = TRUE, sep = ",")[, c("Date", "Adj.Close")]  

 # merge sources  
 GOOG.merged <- merge(GOOG.data, SPY.data, by = "Date")  

 # change column headings for Google and for SPY  
 names(GOOG.merged)[2] <- "GOOG.Close"  
 names(GOOG.merged)[3] <- "SPY.Close"  

 # cast as date  
 # in some data sets, the date is formatted as MM/DD/YYYY  
 GOOG.merged$Date <- as.Date(GOOG.merged$Date)  

 # sort descending  
 GOOG.merged <- GOOG.merged[order(GOOG.merged$Date, decreasing = TRUE),]  

 # calculate returns from previous row  
 GOOG.returns <- GOOG.merged  
 GOOG.returns[-nrow(GOOG.merged), -1] <- GOOG.merged[-nrow(GOOG.merged), -1] / GOOG.merged[-1, -1] - 1  

 # change column headings for returns  
 names(GOOG.returns)[2] <- "GOOG.Returns"  
 names(GOOG.returns)[3] <- "SPY.Returns"  

 # remove last row  
 GOOG.returns <- GOOG.returns[-nrow(GOOG.returns),]  

 # correlation of prices  
 (corr.Prices <- cor.test(GOOG.merged$GOOG.Close, GOOG.merged$SPY.Close))  

 # correlation of returns  
 (corr.Return <- cor.test(GOOG.returns$GOOG.Returns, GOOG.returns$SPY.Returns))  

 # combine prices and returns using lagged data   
 GOOG.lagging <- data.frame(GOOG.returns[-nrow(GOOG.returns),], GOOG.returns[-1, -1])  

 # rename the lagged columns  
 names(GOOG.lagging)[4:5] <- c("GOOG.Returns.Lagged", "SPY.Returns.Lagged")  

 # create a column for logistic regression  
 GOOG.lagging$Up = GOOG.lagging$GOOG.Returns >= 0  

 # perform logistic regression  
 GOOG.logRegression <- glm(GOOG.lagging$Up ~ GOOG.lagging$GOOG.Returns.Lagged + GOOG.lagging$SPY.Returns.Lagged, fam = binomial)  
 summary(GOOG.logRegression)  

 # create a frame of actual versus predicted  
 GOOG.fitted <- data.frame(GOOG.lagging$Up, fitted(GOOG.logRegression) >= 0.5)  
 names(GOOG.fitted) <- c("Actual", "Predicted")  

 # create column for correct/incorrect prediction  
 GOOG.fitted$CorrectForecast = GOOG.fitted$Actual == GOOG.fitted$Predicted  

 # find percent correct  
 length(GOOG.fitted$CorrectForecast[GOOG.fitted$CorrectForecast == TRUE]) / length(GOOG.fitted$CorrectForecast)  

Comments

Popular posts from this blog

Charting Correlation Matrices in R

Developers in New York City by Zip Code

Cultural Dimensions and Coffee Consumption