QUAN 2010 Developing a Regression Model

Section 13.2 Developing a Regression Model

Now we are going to learn techniques to create models that fit our data. Regression analysis is the modeling procedure that we will study. We will discuss how to perform the calculations in the formulas involved in creating and assessing regression models. However, in most cases we will rely on the “Regression” tool in the Data Analysis tab of Excel.

Figure 13.2.1. Regression Example powered by Desmos

Figure 13.2.2. Regression Line with Error powered by Desmos

Definition 13.2.3.

Regression analysis enables us to describe a straight line that best fits the data.

In simple regression analysis, there is one independent variable.
In multiple regression analysis, there is more than one independent variable.

The formula for the linear regression model created with sample data is

\begin{equation*} \hat{y} = b_0+b_1x_1+b_2x_2 + \cdots +b_kx_k \end{equation*}

where

\begin{align*} \hat{y} \amp= \text{the predicted value of $y$ given all the $x$'s in the model} \\ x_1,x_2,\cdots, x_k \amp= \text{ the independent variables in the model}\\ k \amp= \text{ the number of independent variables in the model} \\ b_0 \amp= \text{ the $y$-intercept of the regression line} \\ b_k \amp= \text{ the average change in } \hat{y} \text{ due to a one-unit change in } x_k \text{ with all }\\ \amp \text{ other $x$'s constant} \end{align*}

The $b_k$ are called regression coefficients. The simple regression model is often just called the regression line.

Exercise 13.2.4.

(Donnelly 14.56)

As a measure of productivity, Verizon Wireless records the number of customers each of its retail employees activates weekly. An activation is defined as either a new customer signing a cell phone contract or an existing customer renewing a contract. The data table found in this lesson’s Excel file shows the number of weekly activations for eight randomly selected employees along with their job-satisfaction levels rated on a scale of $1-10$ ($10=$ Most satisfied). external/sheets/VerizonExample.xlsx

(a)

Construct a scatter plot for these data. Let satisfaction be the independent variable and let activations be the dependent variable.

Answer.

(Excel file with solutions is in the answer to the last part of this problem.)

(b)

Determine the equation of the regression line for the data.

Answer.

$y=6.919x-23.468$

(c)

Interpret the value of the slope in the equation.

Answer.

The slope means that for each 1 unit increase in satisfaction score, there is a 6.919 increase in activations.

(d)

Predict the number of activations for an employee with a satisfaction level of 7.4.

Answer.

For a satisfaction score of 7.4, the predicted number of activations is

\begin{equation*} y=6.919(7.4-23.468=27.7326 \end{equation*}

(e)

Calculate the correlation coefficient for this sample.

Answer.

$r\approx 0.8437$

(f)

Using $\alpha=0.10\text{,}$ test to determine if the population correlation coefficient is not equal to zero. What conclusions can be made based on these results?

Answer.

external/sheets/VerizonExampleCorrelationSolutions.xlsx

Exercise 13.2.5.

(Donnelly 15.6)

A hospital would like to develop a regression model to predict the total hospital bill for a patient based on the age of the patient ($x_1$), the patient’s length of stay ($x_2$), and the number of days in the hospital’s intensive care unit (ICU) ($x_3$). Data for these variables can be found in the table in the Excel file below. external/sheets/HospitalExample.xlsx

(a)

Construct a regression model using all three independent variables.

Answer.

The regression model is y-hat = -462.5 + 113.555*x_1 + 1218.626*x_2 + 2213.213*x_3

external/sheets/HospitalExampleSolutions.xlsx

(b)

Interpret the meaning of the regression coefficients.

Answer.

For each additional year of age, the hospital bill increases by $113.56. For each additional day spent in the hospital, the hospital bill increases by $1218.63. For each additional day spent in the ICU, the hospital bill increases by $2213.21.

external/sheets/HospitalExampleSolutions.xlsx

(c)

Predict the average hospital bill for a 76-year-old person hospitalized for 5 days with 3 days spent in the ICU.

Answer.

The model predicts that the hospital bill for a 76 year old, in the hospital for 5 days, with 3 days spent in the ICU to be y-hat = -462.5 + 113.555*76 + 1218.626*5 + 2213.213*3 = $20,900.45

external/sheets/HospitalExampleSolutions.xlsx

Definition 13.2.6.

The regression line will not pass through each of the data points. Hence, there is error between the true value of $y$ from the data and the value, $\hat{y}\text{,}$ predicted by the regression line. This difference is called the residual, $e_i\text{.}$

Figure 13.2.7. Regression Line with Error powered by Desmos

Exercise 13.2.8.

Let’s look at the regression model below and think about the actual versus the predicted $y$-values.

Figure 13.2.9. Regression Line Error powered by Desmos

(a)

For $x=1\text{,}$ what is the corresponding $y$-value, and what is $\hat{y}\text{?}$

$y=3$

$\hat{y}\approx 1.581$
$y\approx 1.581$

$\hat{y}\approx 3$
$y=1$

$\hat{y}=1$

(b)

For $x=1\text{,}$ what is $y-\hat{y}\text{?}$

$y-\hat{y}\approx 3-1.581=1.419$
$y-\hat{y}\approx 0.5-1.581=-1.081$
$y-\hat{y}\approx 0.5-0=0.5$

Definition 13.2.10.

The mathematical procedure that is used to find the regression line is the least squares method. The least squares method aims to minimize the total squared error between the values of $y$ and $\hat{y}\text{.}$ This sum is also called the sum of squares error (SSE), and is definted by the formula

\begin{equation*} SSE=\sum_{i=1}^n (y_i-\hat{y_i})^2. \end{equation*}

Minimizing the SSE results in the best fitting line through the data points.

Figure 13.2.11.

Least Squares Visualization (Made in GeoGebra by Ivan Gatewood)

Link to GeoGebra: https://www.geogebra.org/m/tRUGb6Hg

Definition 13.2.12.

There are two other “sum of squares” related to our data points:

the total sum of squares (SST):

\begin{equation*} SST= \sum_{i=1}^n ( y-\bar{y} )^2 \end{equation*}
the sum of squares regression (SSR):

\begin{equation*} SSR= \sum_{i=1}^n (\hat{y}-\bar{y} )^2 \end{equation*}

(where $\bar{y}=$ the average value of the dependent variable from the sample)

Definition 13.2.13.

All of the sum of squares are related:

\begin{equation*} SST=SSR+SSE \end{equation*}

The total sum of squares (SST), measures the total variation in the dependent variable.
The sum of squares regression (SSR), measures the amount of variation in the dependent variable that is explained by the independent variable(s).
The ratio of these two numbers, $R^2=\frac{SSR}{SST}\text{,}$ is called the coefficient of determination. It measures the percentage of the total variation of the dependent variable that is explained by the independent variable(s) in the model.

Exercise 13.2.14.

Let’s go back to the Verizon example from earlier.

(Here are the solutions to all parts from this problem: external/sheets/VerizonExampleRegressionSolutions.xlsx)

(a)

Identify the SST. (Hint: Use the “Regression” output from Excel.)

(b)

Partition the SST into the SSE and the SSR.

(c)

Calculate the coefficient of determination, $R^2\text{.}$

(d)

Interpret the coefficient of determination.

Note 13.2.15.

“Multiple R” under “Regression Statistics” is the absolute value of the sample correlation coefficient, $r\text{.}$

Prev Top Next