How to interpolate intermediate values for arbitrary data in Excel
I have data tables like this example, nine entries at A1:B9 in this case:
A B
-- ---
1 2.9
2 5.06
3 7
4 8.84
5 10.87
6 13.24
7 16.22
8 20.25
9 36.7
The above represents nine measurements of a non linear increasing physical variable in B, Voltage for example, and A represents exactly each one of nine round minutes the measurement was done.
I want to create a second table, columns E and F, with a quantity of rows that is the "next integer" for the highest value in the B column. In this case, B9=36.7, thus it will have 37 rows. Column F1:F37 will contain integers 1 to 37, column E must have numeric values that correspond to F, in the same relationship as between columns A to B. In other words, interpolate the column E values corresponding to the column F values.
For example, A3=3 and B3=7. In this case, F7=7 and E7=3 because B already included the integer 7 and has a matching value in column A. However, F8=8, which is an intermediate value not contained in column B. So E8 will lie between 3 and 4, based on the original data, and must be interpolated.
The idea is when plotting a graph, A1:B9 will have the same shape as E1:F37. In this example, I will expand the data table to 37 integer results that would have occurred over the course of the original measurements, and will see what time (in column E, with decimal places), those values would have occurred.
What I've tried
In trying to solve this myself, I was able to find a time consuming formula (note that in my attempt, my E and F columns are reversed from what I described above).
- I created a column (K) containing the difference between the B column elements. K5 = B5-B4. That is the Y displacement for every X increment.
- Column E will contain as many sequential integer numbers (37), starting at 1, as the next integer value of the biggest element in B. In this case, B9 contains 36.7, so 37.
- On F1:F37 I input the following formula.
Cell F1 contains:
=IF(E1>$B$9,$A$9+(E1-$B$9)/$K$9,IF(E1>$B$8,$A$8+(E1-$B$8)
/$K$9,IF(E1>$B$7,$A$7+(E1-$B$7)/$K$8,IF(E1>$B$6,$A$6+(E1-$B$6)
/$K$7,IF(E1>$B$5,$A$5+(E1-$B$5)/$K$6,IF(E1>$B$4,$A$4+
(E1-$B$4)/$K$5,IF(E1>$B$3,$A$3+(E1-$B$3)/$K$4,IF(E1>$B$2,$A$2+
(E1-$B$2)/$K$3,IF(E1>$B$1,$A$1+(E1-$B$1)/$K$2,E1/$K$1)))))))))
It works pretty well. But it is not an automated formula; one must enter as many "IFs" as elements in the columns A+B (X+Y). I tested scatter charts with lines from A1:B9 and E1:F37 (reversed for correct X/Y sequence), and they generated exactly the same curve shape, so it works.
But it is not an effective solution because it requires a tedious, custom, manual process for each data set. I'm looking for a way to accomplish this in a more automated way with features built into Excel, or at least a more generic approach using formulas.
Solution 1:
Short Answer
Interpolation is based on an equation that relates X and Y values. If you know the actual equation, you can directly calculate any intermediate values you want. If you don't, you interpolate using an approximation. The quality of the approximation determines how accurate your intermediate values will be. Linear interpolation will be crude if you are approximating a curve with a limited number of points. There are several other approaches that will give you better results, and built-in analysis tools that will do most of the work.
Long Answer
You're looking for a "general formula" or solution that automates interpolating intermediate values. You can use linear interpolation for pretty much any data, but the results will be crude if there is a limited number of data points and significant curvature in the shape of the data. There isn't a "one size fits all" solution if you want accuracy. The best solution for a given dataset will depend on the characteristics of the data.
The Equation
No matter how you do it, interpolation is accomplished using an equation that defines the relationship between X and Y. The equation will be either the actual one or an estimate. If it is an estimate, there are a number of different approaches that are driven by the nature of the data and what you need to accomplish.
In your other question, you used data based on the equation Y=2^X
. When you have the actual equation, you can interpolate exactly. Pick a new value for either X
or Y
and the equation gives you the other value. If you don't know the actual equation, you need to find one that approximates it. I'll use this answer to focus on interpolation approaches. These generally use built-in analysis tools that do most of the work. If you need more detail on the mechanics of using a specific tool or a more automated approach, we can expand on that in another answer.
Try to find the actual equation
The best solution is to see if you can determine what the actual equation is. If you know the process that generated the data, that may tell you the nature of the equation. Many processes, when under controlled conditions so you are dealing with a single driving variable and no random noise, follow a simple curve for which the type of equation is known. So a first step is to look at the shape of the data and see if it is similar to one of those.
An easy way to do that is to graph the data and add a trend line. Excel has a number of common curves available to try to fit.
Let's try this with the 2^N
data from your other question. If you didn't recognize the number pattern and tried the trend line approach, you would see the icons of different shaped curves. The exponential curve is the same general shape, and that would give you this:
Excel uses e
rather than 2
as the base, which is just a translation (e0.693 is 2
). Visually, you can see that the trend line exactly follows the data. The R2 also tells you that. R2 is a statistical measure of how much of the variation in the data you account for with your equation. The value 1
means the equation accounts for 100% of the variation, or a perfect fit.
The example in this question also has sort of an exponential shape. If you try the same approach, you get this result:
So this data isn't exponential. We can try a polynomial, which describes some natural processes and is able to mimic a variety of curves (I'll talk more about it later):
As an approximation of the process behind the data, it isn't a great fit. At the third order (an equation containing powers of X up to X^3), it has more major inflection points than the data and still doesn't match. So the underlying equation doesn't look like a simple, common curve, which means the equation will need to be approximated.
Linear Interpolation
This is the approach you describe in your comments. It is straightforward, using a simple formula, and fairly easy to automate. It can be adequate if you have a lot of points, and straight lines between them are close enough. On many curves, short segments of some areas will be close to straight lines. However, it's a poor approximation for a curved line, and your results will be inaccurate in areas with any significant curvature. In your example, the area between X values of 7 and 8 would have a lot of curvature. In this area, a straight line compared to the actual curve would look like this:
You're looking for a general solution that will apply to any data. You may find that linear interpolation is too crude for some data.
Regression
People have suggested regression as an approach, here and in other posts. It can be done using trend lines or their underlying worksheet functions, or the analysis tools (I think that might be in the Analysis Toolkit, which might require loading that option into Excel, it may not be loaded by default).
Regression tries to fit a curve to your data with the objective of minimizing the total error between the data and the curve. In its normal usage, it isn't the right tool for this task (it's the method used for fitting the trend lines, and you saw how that compared to what you need).
-
It's intended for situations where your goal is to model the process behind the data. The data are assumed to be inaccurate, and regression suggests what it is really supposed to be. The curve found by regression may not pass through any of the actual data points. In your case, the data are given and assumed to be accurate. The curve must pass through every point.
-
Regression tries to fit a single equation to all of the data. It won't be effective if the process that created the data is not described by the types of equations available to try. With a lot of data points, linear interpolation of each segment can be a better approximation than a regression curve for all of the data.
However, rather than employing it in the usual way, regression can be "abused" as a workaround for what you want, and it will usually work. When you are trying to model a process, the simplest formula is usually valued (Occam's razor). On the other hand, with a complex enough equation, you can fit anything. You can always draw a scribble that will pass through every point. With N
points, you can find an N-1
order polynomial equation that will pass through all of the points (worst case scenario).
I say "usually" because in some cases, it is a pretty tortured line that would be useless for your purpose. And note that this approach doesn't really "model" anything in the sense that the resulting equation would predict behavior outside the range of the data.
Here's an analysis of your data using polynomial regression with successively higher-order equations (the first screenshot includes orders 3 - 5):
(Click on the image for a readable size.) Note that the anaysis tool includes the kind of interpolation you want to do; it generated the intermediate values. For each analysis, the a(n)
values are the coeficients of the equation it found. a(0)
is a constant, a(1)
is the coeficient for the X^1 term, etc. It shows the R2 value of the fit. It needs to be virtually 1
to be close enough for your purpose.
I've highlighted the original data values with the biggest differences. In this range of orders, the fit gets a little better with each successive order, but which specific points are more accurately described can change. Here's a chart of those three:
When we get to the 6th and 7th order polynomial, it looks like this:
If we went to an 8th order polynomial for your 9 values, it would be perfect, but the 7th order is probably close enough. For perspective, notice that the 7th order equation has an R2 of .99999 and still isn't perfect.
Using the regression analysis tool to find an adequate fit (in this case the 7th or 8th order equation), would produce the intermediate values you want. But it is a good idea to chart the result and eyeball the curve to ensure that it isn't a scribble.
Splines
If you chart your data and select the option for smooth lines, what Excel uses to produce that is splines. In fact, almost every application of computer graphics (including font definitions), is based on splines for smooth curves and curve transitions. It's named after the flexible rule that draftsmen once used to connect arbitrary points with a curve.
Splines create the curve for each section, a section at a time, considering the adjacent points. The curve passes through each point and there are no abrupt changes on either side of the point such as you get when connecting the points with straight lines.
The equations used for splines make no attempt to model the process that produced the data; it's strictly to look pretty. However, most processes follow some kind of continuous, smooth curve. When you deal with a single curve segment, many different equations that produce curves of generally similar shape will produce very similar values within the segment. So in most cases, splines will produce a good approximation for what you want (and it naturally passes through every point, unlike regression, which must be forced through each point).
Again, I say "most cases". Splines work great for data that are pretty uniform and regular, and follow the "rules" for a curve. It can do some unexpected things with unusual data. For example, a previous SU question was about this strange negative "dip" in the chart Excel produced of the data:
Splines are a little like Jello. Imagine a big hunk of Jello, and you constrain specific spots where you want them. The rest of the Jello will bulge in places it needs to. An equation can define certain kinds of curves. If you force the curve through specific points, the same thing happens. With splines, the effect is limited to an odd bulge or unnatural looking curve segment; high-order regression equations can follow a wild path.
This is how splines represent the curve of your data:
If you compare this with the high-order regression curves, the splines are more "responsive" to local variations.
I did this analysis using LibreOffice Calc, which has an analysis add-in that includes splines. As you can see, this also produces for splines, the interpolated results you're looking for. I don't have ready access to Excel's Analysis Toolkit, so I don't know if Excel's includes splines. If not, LO Calc will run in Windows and it's free.
Bottom Line
This covers the approaches you can use to interpolate the intermediate values. It could be that different approaches work better with different data. Or, your requirements could be anything approximate, fast, and easy. Decide what kind of interpolation you need. If you need more detail on how to accomplish it, we can address the mechanics in another answer.
Solution 2:
Reading through your comments and revisions to the question, there are a couple of things you want to do that aren't really covered in my previous answer. This answer will deal with those items, and I've included a step-by-step walk-through of how you would accomplish the whole interpolation process.
Inaccurate Data
You describe the process that generated the data as taking readings at a time interval, and the numbers are rounded times. The equation is only as good as the data. In your actual analysis, you should use the most precise numbers available (perhaps you were just keeping your example simple by showing rounded times).
However, the data you show don't precisely fit the kind of curve you typically see for a physical process. Theoretical curves are generally smooth when there is just one driving variable and no noise. If you are using very precise equipment both to trigger a reading at a preset interval and to provide an accurate measurement, you can accept the results as precise. However, if you are manually timing the reading and manually taking the reading, the X
values can be at imprecise times even if the readings, themselves, are accurate. Shifting individual X
values a little one way or the other will introduce the kinds of small irregularities you see in the curve of your data (unless the example is just numbers you made up for the purposes of an example).
If this is the case, you might benefit from using regression to estimate best fit.
Using Y as X
In your problem, you want to define values for Y
(integer values from 1 to 37 in this example), and find the associated X values. That was easy enough to do in your Y=2^X
problem because that simple equation can easy be reversed to X=log(Y)/log(2)
, and you can directly calculate any value you want. If the equation isn't something simple, there often isn't a practical way to invert it. The "abused" regression approach in my previous answer gives you a high order equation, but it's "one-direction", often not practical to solve for the reverse equation.
The simplest approach is just to reverse X
and Y
from the start. This gives you an equation you can use with the integer values you introduce (the analysis gives you the coeficients of the equation as described in the previous answer).
It never hurts to see if a simple curve will work. Here's the reversed data, and you can see that there's not a useful fit:
So, try a polynomial fit. However, this is a case like I described in the previous answer. The values from 1 through 8 fit well, but 9 gives it indigestion. A 3rd order polynomial gives you a bump:
It gets progressively more "interesting" as the order of the equation increases. By the 7th order, you get this:
It goes almost exactly through every point, but the curve between 8 and 9 isn't useful. One solution would be to make do with linear interpolation between 8 and 9. In this case, though, you could get better values by incorporating splines for the upper end. The splines option provides a good looking fit, and a curve that makes more sense between 8 and 9:
Unfortunately, the spline equations are a bit convoluted and the equations are not provided. However, you could do the linear interpolation on the intermediate values provided by the analysis, which should get you very close to numbers that fit a reasonable curve.
Extrapolation vs. Interpolation
In this example, your first Y
value is 2.9. You want to produce values for 1
and 2
, which are outside the range of the data. That requires extrapolation rather than interpolation, which is a very different requirement.
-
If the equation is known, like your
Y=2^X
example, you can calculate any value you want. -
If the process generating the data is known to follow a simple curve, and you are confident of the fit, you can project values outside the data range, and even get a meaningful confidence interval for the range that the values could actually be (based on how much variation there is between the data and the curve inside the data's range).
-
If you're force-fitting a high-order equation to the data, projections outside the data's range are usually meaningless.
-
If you are using splines, there is no basis for projecting outside the data range.
Whatever projections you make outside the range of your data are only as good as the equation you use, and if you're not using an exact equation, the farther you get from your data, the more inaccurate it will be.
Looking at the log curve in the first graph, you can see that it would project a very different value than what you would expect.
For the polynomial equations, the zero-power coeficient is a constant, and that's the value that would be produced for an X
value of 0
. So that's a simple way to look at where the curve would go in that direction.
Note that by the 4th or 5th order, the points 1 through 8 are pretty accurate. But once you go outside the range, the equations can behave very differently.
Extrapolation using limited data
One way to improve things is to fit only the points at that end, and include as many successive points as follow the shape of the curve at that end. Point 9 is obviously out. There are several inflections in the curve before that, one being around point 5 or 6, so points higher than that follow a different curve. Using just the points 1 through 5, you get close to a perfect fit with a 3rd order polynomial. That equation would project a zero point of 0.12095 (compare to the table above), and for an X
value of 1
, 0.3493
.
What happens if you just fit a straight line to the first five points:
That projects a zero point of -0.5138 and for an X
of 1
, -0.0071
.
That range of possible outcomes indicates the level of uncertainty outside the range of your data. There is no right answer. And this was at the "well-behaved" end of your curve. The Y
value for an X
of 9
is 36.7
. You want to go to 37. The splines suggest that the curve is asymptotic at 9
. Projecting a straight line in the raw data would produce a value a little more than 9
(same with a 4th order polynomial). A 3rd order polynomial suggests a value less than 9
(as do 5th and 6th orders). A 7th order polynomial suggests a value substantially above 9
. So anything outside of the data range is a guess, or anything you want it to be.
Putting it all together
So let's step through what the actual solution would look like. We'll assume you've already tried to find an exact equation and tested common curves using a trend line. The next step would be to try regression because that gives you the formula for the curve and you can plug in your integer values.
I don't have ready access to Excel 2013 or the Analysis Toolkit. I'll use LibreOffice Calc to illustrate this. It's not identical, but it's close enough that you should be able to follow it in Excel. In LO Calc, this is actually a free extension that needs to be loaded. I'm using CorelPolyGUI, which can be downloaded here. My recollection of the Analysis Toolkit is that it didn't include splines. If that's still the case and you want to do this in Excel, I came across this free add-in (which I have not tested). An alternative would be to use LO Calc, which will run in Windows and is free.
Here, I've entered the X and Y values (reversed) in columns A and B, and opened the analysis dialog. Highlighting the X values and clicking the X button loads the data ranges, and I've selected polynomial.
On the next tab, I specify that I want to use 0
to 7
degrees (a 7th order polynomial with all of the orders).
To specify the output, I select C1 and click Columns, and it registers the columns needed for the output. I select that I want it to output the original data, the calculated results, and I've selected to have it add three intermediate points between each original data point. And I tell it I want a graph of the results on a new chart. Then go to the calculate menu and click calculate.
And there it is. If you look at the calculated values, you may notice a problem. It will become apparent in the next step.
Here, I've added the 1
through 37
values. At this point, we only want to deal with interpolation, so I've added a formula to calculate only the values 3
through 36
. The formula just expands the coeficients listed in the results (the a(n) values). The formula in I2 is:
=D$4+D$5*H3+D$6*H3^2+D$7*H3^3+D$8*H3^4+D$9*H3^5+D$10*H3^6+D$11*H3^7
This is just each coeficient multiplied by the associated power of the X value. Drag this down and you have your results. Well not quite; you have to look at it to see if it passes the sanity test. We knew there was a problem between 8
and 9
, but that turns out to be half of the values you want. We could use the values from 3
through 20
, but there's no sense in combining that many values from another method. So let's just use splines for the whole thing.
Open the analysis dialog again, and change the method to "splines" on the input tab (not shown here). Give it a new output range and tell it to calculate. That's all it takes.
We have new results to work with. Dividing the data range into this many segments keeps each segment short, so linear interpolation should be pretty good (way better than using it on the original data).
The process of curve fitting or interpolation involves creating data points; using your own judgement about what the curve "should" (or should not), look like (regression assumes that even the original data are imprecise).
Giving this data a sanity check shows that even splines produce a connecting curve with a bulge; one value goes slightly over 9
, which is likely an artifact rather than a reflection of the process you were measuring. In this case, a curve asymptotic at 9
is more likely, so I arbitrarily assigned the high point a value that's a hair less than 9
by eyeballing it. The assumption isn't that my value is precise, only that it's an improvement. For this illustration, I created a new column with the values that will be used.
I added a column with your numbers 1
through 37
. From the previous discussion, we don't have a reliable basis for projecting values for 1
and 2
, so I left them blank. For 37
, I went with the asymptotic assumption and made it 9
. The values for 3
through 36
are found by linear interpolation (and it's a formula you could adapt to other data). The formula in Q3 is:
=TREND(OFFSET($M$1,MATCH(P3,M$1:M$33)-1,2,2),OFFSET($M$1,MATCH(P3,M$1:M$33)-1,0,2),P3)
The TREND function just interpolates when the range is two points. The syntax is:
TREND(Y_range, X_range, X_value)
The OFFSET function is used for each range. In each case, it uses the MATCH function to find the first row of the range containing the target value. The -1
values are because these are offsets rather than locations; a match in the first row is an offset of 0
from the reference row. And note that the Y
column is offset by 2
, in this case, because I added an extra column to manually adjust a value. The OFFSET parameters pick the column containing the Y or X values, and select a range height of 2, which gives you the values below and above the target.
The result:
The analysis wizard does the heavy lifting, and whether you're using polynomial regression or splines, it required just one formula to generate the result.