Part 1 - Regression Models
In this tutorial series we’ll be going over how to create a simple Regression Analysis library in Java. If you have any prior knowledge of regression analysis you will probably know that this is a very large field with a great many applications. In this tutorial series we won’t be covering any massively advanced techniques. Our final library will be able to produce the same results as you would find in Microsoft Excel (excluding the graph plotting), which in most basic circumstances will be plenty enough to get you some good results.
It’s best if you start this series with a sound knowledge of OOP (object-oriented programming) practices in Java as this series will include the use of abstract classes and polymorphism. You will also need a good knowledge of some of the more basic concepts in Java such as looping, methods and variables. I will do my best to explain the code as much as I can but it is advisable that you have some prior knowledge.
As this tutorial series will of course focus on mathematical concepts as regression analysis is a mathematical technique you will need a sound knowledge of algebra and graphs. I will again do my best to explain all of the concepts as much as possible to cater for beginners, people who have a basic algebra or statistics course under their belts will find things a lot easier.
What is Regression Analysis?
So enough of all the introductions lets get straight in! If you haven’t heard of regression analysis before you are probably already asking what is it and why is it useful? From the Wikipedia article on regression analysis:
“a statistical process for estimating the relationships among variables. It includes many techniques for modelling and analysing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.”
Well that hasn’t really helped much now has it? It is much simpler to understand if you think about the variables as the
Y coordinates on a graph.
Consider the case where you have a simple scatter plot diagram. You have a set of
Y coordinates that are plotted on a graph with two axis’ - the
y. For example this graph where the data runs up until an
X value of 11. Say these values are from a particular stock on the stock exchange (regression analysis has a lot of applications in stocks and shares). The
X values represent each a month in the year and the respective
Y coordinates are the average price of the stock in that particular month. From the graph plot we can see that the price of shares is steadily increasing but we don’t possess any data for the 12th month. Is the price going to increase or decrease in December? How can we find out? For market traders this is very important information that can make them or lose them millions. The answer - regression analysis!
So we have data up to November and we want to find out what the
Y value is when
X is 12. The trouble is its not December yet so we don’t know what it is. We need a forecast model. Lets revisit the situation. We have an
X value and we need the
Y value. Hopefully this is ringing some bells. It sounds an awful lot like a good use of an function such as
Y = aX + b (or it could be any other function). We can insert an
X value of 12 and we get back the corresponding
Y value which is the average stock price for December. Sounds great but we have a problem. We don’t know the variables
b! The function could have any intercept and gradient. We currently don’t have a clue. We could make one up but someone like a market trader doesn’t want to risk their money on a made up value. We need a way to find the values of a and b which when put into the function will give us back an accurate value for the price in December.
Armed with that knowledge lets go back to the Wikipedia definition. ‘estimating the relationships among variables’ - this kind of makes more sense now. As
X increases what does
Y do? This is called the relationship between the two variables. If the
Y values are increasing a lot as
X increases, our forecast should reflect this relationship. We now need to label
Y in more formal terms.
Y is the dependent variable. It depends on the values of the other independent variables and parameters a,
X and b to give it a value.
We can now again go back to the Wikipedia definition. ‘helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables or parameters are held fixed.’ Again this makes more sense now. We want to analyse how the dependent variable
Y changes as the independent
X value is varied and the other parameters
b are kept fixed. This is most often done through an function such as
Y = aX + b.
So essentially we want to find some function that best fits the data points that we have for the other months. The function models the relationship between
Y. Once we have this function we can plug in
X values and get the
Y values that follow the relationship. This has many uses!
Lets go back to our example. We want to find the forecast of the stock price in December. We therefore need to find some function that relates the month to the price. This is regression analysis in its simplest form. Things get harder when we have to figure out what function is best to use to model the relationship (is it a linear line, an exponential line etc) and how can we find out how good our model is at describing the relationship, but we will move onto that in later parts of this series.
The most basic form of regression analysis is linear regression - that is finding a linear function that best models the relationship between the two variables. The base linear line function is
Y = aX + b from earlier. We want to find the price
X is the month. We need to find the best values for
b that produce a line that follows our current data as much as possible. If the line is accurate, we can use it to forecast other months. Our function becomes
PRICE = a * MONTH + b. A huge part of regression analysis is finding the best values of
b that produce a line that closely models our current data set.