Is there a term for finding a minimal set of N points that approximate a curve?
I have spent some time answering How do I discretize a continuous function avoiding noise generation (see picture), and throughout, I felt like I was reinventing a bike.
Essentially, the problem is:
- You are given a curve function - for any
x
, you can obtainy
. - You want to approximate the curve using a piecewise-linear function with exactly
N
points, based on some error metric, e.g. distance to the curve, or minimize the absolute difference of the area under the curves (thanks to @QuangHoang for pointing out these are different).
Here's an example of a curve I approximated using 20 points:
Question: I've coded this up using repeated bisections. Is there a library I could have used? Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?
Edit: upon request, here's how I've done it: Google Colab
Data:
import numpy as np
from scipy.signal import gaussian
N_MOCK = 2000
# A nice-ish mock distribution
xs = np.linspace(-10.0, 10.0, num=N_MOCK)
sigmoid = 1 / (1 + np.exp(-xs))
gauss = gaussian(N_MOCK, std=N_MOCK / 10)
ys = gauss - sigmoid + 1
xs += 10
xs /= 20
Plotting:
import matplotlib.pyplot as plt
def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
"""A simplified version of the provided plotting function"""
# Setting Axis properties and titles
fig, ax = plt.subplots(figsize=(20, 4))
ax.set_title(plot_name)
# Plotting stuff
ax.plot(cont_time, cont_array, label="Continuous", color='#0000ff')
ax.plot(disc_time, disc_array, label="Discrete", color='#00ff00')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)
Here's how I solved it, but I hope there's a more standard way:
import warnings
warnings.simplefilter('ignore', np.RankWarning)
def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
"""Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
xs = np.linspace(x0, x1, num=integral_points)
ys = straight_line(xs)
perfect_ys = ideal_line(xs)
err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0) # Remove (x1 - x0) to only look at avg errors
return err
def discretize_bisect(xs, ys, bin_count):
"""Returns xs and ys of discrete points"""
# For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
# If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
new_xs = [xs[0], xs[-1]]
new_ys = [ys[0], ys[-1]]
while len(new_xs) < bin_count:
errors = []
for i in range(len(new_xs)-1):
err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
errors.append(err)
max_segment_id = np.argmax(errors)
new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
new_y = ideal_line(new_x)
new_xs.insert(max_segment_id+1, new_x)
new_ys.insert(max_segment_id+1, new_y)
return new_xs, new_ys
Run:
BIN_COUNT = 25
new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)
plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))
Note: while I prefer numpy
, the answer can be a library in any language, or the name of the mathematical term. Please do not write lots of code, as I have done that myself already :)
Is there a nice term of this problem type that I failed to google out? Does this generalize to a broader problem set?
I know this problem as Expected Improvement (EI), or Bayesian optimization (permalink on archive.org). Given an expensive black box function for which you would like to find the global maximum, this algorithm yields the next position where to check for that maximum.
At first glance, this is different from your problem. You are looking for a way to approximate a curve with a small number of samples, while EI provides the places where the function has its most likely maximum. But both problems are equivalent insofar that you minimize an error function (which will change when you add another sample to your approximation) with the fewest possible points.
I believe this is the original research paper.
Jones, Donald & Schonlau, Matthias & Welch, William. (1998). Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization. 13. 455-492. 10.1023/A:1008306431147.
From section 1:
[...] the technique often requires the fewest function evaluations of all competing methods. This is possible because, with typical engineering functions, one can often interpolate and extrapolate quite accurately over large distances in the design space. Intuitively, the method is able to ‘see’ obvious trends or patterns in the data and ‘jump to conclusions’ instead of having to move step-by-step along some trajectory.
As to why it is efficient:
[...] the response surface approach provides a credible stopping rule based on the expected improvement from further searching. Such a stopping rule is possible because the statistical model provides confidence intervals on the function’s value at unsampled points – and the ‘reasonableness’ of these confidence intervals can be checked by model validation techniques.