Regression Analysis
- Overview
Regression is one of the most common models of machine learning (ML). It differs from classification models because it estimates a numerical value, whereas classification models identify which category an observation belongs to.
Regression analysis is a statistical method that estimates the relationship between a dependent variable and one or more independent variables. It can be used to assess the strength of the relationship (correlation) between variables and to model future relationships.
Regression analysis can help identify associations between variables, and determine the statistical significance of those associations. It can also be used to analyze different factors that might influence an objective, and determine which factors are important.
The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion.
The main uses of regression analysis are forecasting, time series modeling and finding the cause and effect relationship between variables.
Please refer to the following for more information:
- Wikipedia: Regression Analysis
- The Applications of Regression Analysis
Regression analysis is a powerful tool for revealing associations between variables observed in data, but it cannot easily show cause and effect. It is used in a variety of contexts in business, finance, and economics.
For example, it is used to help investment managers evaluate assets and understand the relationship between factors such as commodity prices and the stocks of companies that trade those commodities.
Solving regression problems is one of the most common applications for machine learning (ML) models, especially in supervised ML. Algorithms are trained to understand the relationship between independent variables and an outcome or dependent variable.
The model can then be leveraged to predict the outcome of new and unseen input data, or to fill a gap in missing data.
Regression analysis consists of three stages:
- Analyzing the correlation and directionality of the data
- Estimating the model, or fitting the line
- Evaluating the validity and usefulness of the model
- Relationships among Variables
In regression analysis, the dependent variable is denoted "Y" and the independent variables are denoted by "X".
Here are some examples of regression analysis:
- Age and height: Age and height can be described using a linear regression model. As a person's age increases, so does their height, so they have a linear relationship.
- Product launches, business growth, and marketing campaigns: Regression analysis can be used to analyze different factors that might influence an objective, and determine which factors are important.
Equations are used in mathematics to express the relationships among variables. In fields such as geometry or trigonometry, these mathematical equations, or functions, express the deterministic (exact) relationship among the variable of interest.
The equation A = s2 describes the relationship between s (the length of the side of a square) and A (the area of the square).
By substituting numerical values for the variables on the right-hand sides of these equations, we can determine the exact value of the quantities on the let-hand side.
- Statistical Relationships
Regression is a key element of predictive modelling, so can be found within many different applications of ML. Whether powering financial forecasting or predicting healthcare trends, regression analysis can bring organisations key insight for decision-making. It’s already used in different sectors to forecast house prices, stock or share prices, or map salary changes.
In the social sciences and in the fields such as business and government adminstration, exact relationships are generally observed among variables, but rather statistical relationships prevail.
A statistical relationship is a statistically significant association between two variables. This significance is based on the level of a probability test.
A statistical relationship is a combination of deterministic and random relationships. A deterministic relationship is an exact relationship between two variables. For example, if you earn $10 per hour, you earn ten dollars more for every hour you work.
A statistically significant relationship is one that is large enough to be unlikely to have occurred in the sample if there's no relationship in the population.
Two or more variables are considered to be related if their values change so that as the value of one variable increases or decreases, so does the value of the other variable. However, the change may be in the opposite direction.
For example, if males earn more than females, the values of Income increase when the values of Gender change.
Statistical relationships can be visualized using scatter plots and line plots. The strongest linear relationship occurs when the slope is 1. This means that when one variable increases by one, the other variable also increases by the same amount.
- Regression Models
Regression analysis is an integral part of any forecasting or predictive model, so is a common method found in machine learning powered predictive analytics.
Alongside classification, regression is a common use for supervised machine learning models. This approach to training models required labelled input and output training data.
A regression model is a statistical model that estimates the relationship between a dependent variable and one or more independent variables. It uses a line or plane to fit the observed data and describe the relationship between the variables.
Machine learning regression models need to understand the relationship between features and outcome variables, so accurately labelled training data is vital.
Regression models can be used for many types of predictions and for determining the effects on target variables. For example, a sales manager might use regression analysis to predict next month's sales.
Here are some types of regression models:
- Linear: A linear regression is a model where the relationship between inputs and outputs is a straight line.
- Multiple: Multiple regression indicates that there are more than one input variables that may affect the outcome.
- Regression models can also be logistic and nonlinear, which use a curved line.
To conduct a regression analysis, you can:
- Gather data on the variables in question.
- Plot all the information on a chart.
- Analyze the correlation and directionality of the data.
- Estimate the model, i.e., fitting the line.
- Evaluate the validity and usefulness of the model.
[More to come ...]