Normal equation in Excel (statistics)
I'm attempting to learn about statistics in the infrastructure asset management industry.
I have an existing exponential regression equation that is used to find the condition of a given asset:
Y = B - eax
In a related question, someone has suggested that I "can get variable a
directly from the normal equation":
Basic Exponential Regression
That sounds promising. But as someone who doesn't have a background in math, I'm having trouble converting the mathematical notation in that equation to Excel syntax.
Question:
How can I calculate the normal equation using Excel 2016 syntax/formulas?
Solution 1:
Here's an alternative that deals with the issue you raised in your comment to the answer posted on math.stackexhange
Your model is:
y = B - exp(a*x)
and you have, a priori, determined that B = 21.
The model is not an exact relationship between the y(i) and x(i) values, so it is usual to add an error term e(i) and represent the model as
y(i) = B - exp(a*x(i)) + e(i)
The values of y estimated by the model for each of the x values are denoted by y'(i) where
y'(i) = B - exp(a*x(i))
The method of least squares aims to pick the value of a which minimises the sum of the squares of the differences between the actual y(i) values and the corresponding estimated or y'(i) values.
y(i) - y'(i) = [B - exp(a* x(i)) + e(i)] - [B - exp(a*x(i))] = e(i)
So the sum of the squares of the differences between the y(i) and y'(i) values is
Sum[(y(i) - y'(i))^2] = Sum[e(i)^2]
The right hand side is the sum of the squared error terms, so is called the Error Sum of Squares or ESS.
The process of transforming the y(i) values to z(i) values via
z(i) = LN(21 - y(i))
creates a linear model
z = ax
which allows use of the LINEST
function to estimate the value "best fit" value of a. With the data provided in your answer (or your question on math.stackexchange) this best fit value of a is 0.147233 - the same answer you derived in your answer implementing the Normal Equation.
The ESS associated with this value of a is 8.27991. However, this value is not the minimum achievable value of the ESS. That occurs when a takes the value of 0.149140 and the corresponding ESS is 6.66073.
The screenshot below shows the calculations.
The estimated y'(i) and associated error squared and ESS values are shown for two versions of the model y = 21 - exp(a*x).
In Version 1, a is derived using the LINEST
approach, based on the transformed model z = ax. In Version 2, a is the value that minimises the ESS of the (untransformed) model. More on how this value of a was obtained is provided below.
With linear models such as y = mx + c, the Normal Equations provide a convenient way of estimating the values of m and c which minimise the ESS. Function LINEST
implements (amongst other things) the Normal Equations.
For non-linear models (such as y = 21 - exp(a*x)) such convenient equations generally do not exist, so other methods need to be used to find the value of a which minimises the ESS.
One approach is to use search methods: essentially try a range of different possibilities for a and pick the one that results in the smallest ESS.
This is effectively what the next screenshot shows. It uses what Microsoft calls a Data Table. This is a poor choice of name since a Data Table is not a table of data. Rather it is a tool for determining how the value of a calculation changes as one or two elements within the calculation are changed. It is found on the ribbon in the Forecast group of the Data menu by selecting "Data Table..." from the "What-if Analysis" icon.
Microsft's documentation to create and use a Data Table is pretty awful, so I will provide a bit of a cookbook approach.
- The data table itself is provided in range N2:O23.
- Cell
O2
contains the calculated value to be explored. This cell contains the formula=J4
which is the ESS associated with the value of a in cellJ3
. - The possibilities for the different values of a are placed in range
N3:N23
and the results of the ESS will appear in cellsO3:O23
. This provides 21 possible values of a. This is an arbitrary choice, data tables can involve a greater or smaller number of possible values. - Once the a values are in place, select the range
N2:O23
and initiate the Data Table dialog by selecting "Data Table..." from the "What-if Analysis" icon in the Forecast group of the Data menu. - In the dialog, put
$J$3
into the field labelled "Column input cell:" and press the "OK" button. - The range
O3:O23
will now be populated with the ESS values corresponding to the values of a inN3:N23
. Changing any of the valuesN3:N23
will update the ESS values inO3:O23
.
The a values in N3:N23
are set by formulae rather than being typed in. The values are set using a search strategy which looks at increasingly finer sets of values for a.
The 21 a values in N3:N23
are based around a Central value in position 11 - cell N13
- with cells above and below this successively differing by an Increment amount so that the whole range of 21 values is in ascending order.
The search strategy goes through a number of steps, with the step number being controlled by the value in cell O1
.
In step 1, the Central value is set at 0.15 (in cell R3
) and the Increment is set as 0.001 (in cell S3
) giving values in N3:N23
ranging from 0.14 to 0.16. This range is chosen on the basis of the Version 1 value of a, with an anticipation that the minimum ESS value will fall within this range.
This proves to be the case. For the 21 values of a starting at 0.14 and increasing by 0.001 to 0.16, the corresponding ESS values start at over 39 (when a is 0.14), decrease as a increases until a has the value of 0.149 (when the ESS is 6.66972) and then increase reaching an ESS value of over 70 when a is 0.16. This shows that the value of a which minimises the ESS is in the vicinity of 0.149.
(Had it not proved to be the case that a minimum is found within the range of a values, the ESS values would have either all increased or decreased putting the minimum at one end of the range. In this case the Central value (in cell R3
) will require adjustment with possibly an increase in the Increment value (in cell S3
) until a mid-range minimum is found.)
For any range of values in N3:N23
, cells O27
and N27
respectively identify the minimum ESS value and the value of a that produces the minimum.
The value of a producing the minimum, provides the new Central value for the next step of the search. The new Increment is the previous value reduced by a factor of 10. These new Central and Increment values are entered manually into the "control table" in columns R
and S
and the step number is manually increased by 1 in cell O1
.
The search proceeds through successive steps, terminating when no practical reductions in the ESS value can be obtained.
The screenshot shows the results at step 2 of the search.
Solution 2:
Here's what it would look like in Excel:
I tried to describe it in pseudo code:
- For each record in the set, calculate x*LN(21-y). Calculate the sum of those values (we'll call it "sum 1").
- For each record in the set, calculate x^2. Calculate the sum of those values (we'll call it "sum 2").
- Divide sum 1 by sum 2.
From a colleague:
i represents the particular observation. All of these calculations assume there are a fixed number, usually called n, pairs of observations. For example, you had 20 pairs of observations in your data. Here, pairs means the x and y value together, usually denoted as
(x, y), (0, 20), (1, 20)....(20, 2)
. The i represents the i'th pair of observation amongst all n pairs.So, if i = 1, this means we are referring to the first pair,
(0, 20)
. If i = 14, we take the 14th pair,(14, 12)
. In general, mathematically, the ith observation pair is(xi, yi)
, i is in subscript.The sigma sign which says i = 1 to n, means essentially that we are taking all the pairs of observations starting from 1st observation, until the last.