Return Unique Element with a Tolerance
In Matlab, there is this unique
command that returns thew unique rows in an array. This is a very handy command.
But the problem is that I can't assign tolerance to it-- in double precision, we always have to compare two elements within a precision. Is there a built-in command that returns unique elements, within a certain tolerance?
Solution 1:
With R2015a, this question finally has a simple answer (see my other answer to this question for details). For releases prior to R2015a, there is such a built-in (undocumented) function: _mergesimpts
. A safe guess at the composition of the name is "merge similar points".
The function is called with the following syntax:
xMerged = builtin('_mergesimpts',x,tol,[type])
The data array x
is N-by-D
, where N
is the number of points, and D
is the number of dimensions. The tolerances for each dimension are specified by a D
-element row vector, tol
. The optional input argument type
is a string ('first'
(default) or 'average'
) indicating how to merge similar elements.
The output xMerged
will be M-by-D
, where M<=N
. It is sorted.
Examples, 1D data:
>> x = [1; 1.1; 1.05]; % elements need not be sorted
>> builtin('_mergesimpts',x,eps) % but the output is sorted
ans =
1.0000
1.0500
1.1000
Merge types:
>> builtin('_mergesimpts',x,0.1,'first')
ans =
1.0000 % first of [1, 1.05] since abs(1 - 1.05) < 0.1
1.1000
>> builtin('_mergesimpts',x,0.1,'average')
ans =
1.0250 % average of [1, 1.05]
1.1000
>> builtin('_mergesimpts',x,0.2,'average')
ans =
1.0500 % average of [1, 1.1, 1.05]
Examples, 2D data:
>> x = [1 2; 1.06 2; 1.1 2; 1.1 2.03]
x =
1.0000 2.0000
1.0600 2.0000
1.1000 2.0000
1.1000 2.0300
All 2D points unique to machine precision:
>> xMerged = builtin('_mergesimpts',x,[eps eps],'first')
xMerged =
1.0000 2.0000
1.0600 2.0000
1.1000 2.0000
1.1000 2.0300
Merge based on second dimension tolerance:
>> xMerged = builtin('_mergesimpts',x,[eps 0.1],'first')
xMerged =
1.0000 2.0000
1.0600 2.0000
1.1000 2.0000 % first of rows 3 and 4
>> xMerged = builtin('_mergesimpts',x,[eps 0.1],'average')
xMerged =
1.0000 2.0000
1.0600 2.0000
1.1000 2.0150 % average of rows 3 and 4
Merge based on first dimension tolerance:
>> xMerged = builtin('_mergesimpts',x,[0.2 eps],'average')
xMerged =
1.0533 2.0000 % average of rows 1 to 3
1.1000 2.0300
>> xMerged = builtin('_mergesimpts',x,[0.05 eps],'average')
xMerged =
1.0000 2.0000
1.0800 2.0000 % average of rows 2 and 3
1.1000 2.0300 % row 4 not merged because of second dimension
Merge based on both dimensions:
>> xMerged = builtin('_mergesimpts',x,[0.05 .1],'average')
xMerged =
1.0000 2.0000
1.0867 2.0100 % average of rows 2 to 4
Solution 2:
This is a difficult problem. I'd even claim it to be impossible to solve in general, because of what I'd call the transitivity problem. Suppose that we have three elements in a set, {A,B,C}. I'll define a simple function isSimilarTo, such that isSimilarTo(A,B) will return a true result if the two inputs are within a specified tolerance of each other. (Note that everything I will say here is meaningful in one dimension as well as in multiple dimensions.) So if two numbers are known to be "similar" to each other, then we will choose to group them together.
So suppose we have values {A,B,C} such that isSimilarTo(A,B) is true, and that isSimilarTo(B,C) is also true. Should we decide to group all three together, even though isSimilarTo(A,C) is false?
Worse, move to two dimensions. Start with k points equally spaced around the perimeter of a circle. Assume the tolerance is chosen such that any point is within the specified tolerance of its immediate neighbors, but not to any other point. How would you choose to resolve which points are "unique" in the setting?
I'll claim that this problem of intransitivity makes the grouping problem not possible to resolve, at least not perfectly, and certainly not in any efficient manner. Perhaps one might try an approach based on a k-means style of aggregation. But this will be quite inefficient, as well, such an approach generally needs to know in advance the number of groups to look for.
Having said that, I would still offer a compromise, something that can sometimes work within limits. The trick is found in Consolidator, as found on the Matlab Central file exchange. My approach was to effectively round the inputs to within the specified tolerance. Having done that, a combination of unique and accumarray allows the aggregation to be done efficiently, even for large sets of data in one or many dimensions.
This is a reasonable approach when the tolerance is large enough that when multiple pieces of data belong together, they will be rounded to the same value, with occasional errors made by the rounding step.
Solution 3:
As of R2015a, there is finally a function to do this, uniquetol
(before R2015a, see my other answer):
uniquetol
Set unique within a tolerance.
uniquetol
is similar tounique
. Whereasunique
performs exact comparisons,uniquetol
performs comparisons using a tolerance.
The syntax is straightforward:
C = uniquetol(A,TOL)
returns the unique values inA
using toleranceTOL
.
As are the semantics:
Each value of
C
is within tolerance of one value ofA
, but no two elements inC
are within tolerance of each other.C
is sorted in ascending order. Two valuesu
andv
are within tolerance if:
abs(u-v) <= TOL*max(A(:),[],1)
It can also operate "ByRows
", and the tolerance can be scaled by an input "DataScale
" rather than by the maximum value in the input data.
But there is an important note about uniqueness of the solutions:
There can be multiple valid
C
outputs that satisfy the condition, "no two elements inC
are within tolerance of each other." For example, swapping columns inA
can result in a different solution being returned, because the input is sorted lexicographically by the columns. Another result is thatuniquetol(-A,TOL)
may not give the same results as-uniquetol(A,TOL)
.
There is also a new function ismembertol
is related to ismember
in the same way as above.
Solution 4:
There is no such function that I know of. One tricky aspect is that if your tolerance is, say, 1e-10, and you have a vector with values that are equally spaced at 9e-11, the first and the third entry are not the same, but the first is the same as the second, and the second is the same as the third - so how many "uniques" are there?
One way to solve the problem is that you round your values to a desired precision, and then run unique on that. You can do that using round2 (http://www.mathworks.com/matlabcentral/fileexchange/4261-round2), or using the following simple way:
r = rand(100,1); % some random data
roundedData = round(r*1e6)/1e6; % round to 1e-6
uniqueValues = unique(roundedData);
You could also do it using the hist command, as long as the precision is not too high:
r = rand(100,1); % create 100 random values between 0 and 1
grid = 0:0.001:1; % creates a vector of uniquely spaced values
counts = hist(r,grid); % now you know for each element in 'grid' how many values there are
uniqueValues = grid(counts>0); % and these are the uniques
Solution 5:
I've come across this problem before. The trick is to first sort the data and then use the diff function to find the difference between each item. Then compare when that difference is less then your tolerance. This is the code that I use:
tol = 0.001
[Y I] = sort(items(:));
uni_mask = diff([0; Y]) > tol;
%if you just want the unique items:
uni_items = Y(uni_mask); %in sorted order
uni_items = items(I(uni_mask)); % in the original order
This doesn't take care of "drifting" ... so something like 0:0.00001:100 would actually return one unique value.
If you want something that can handle "drifting" then I would use histc but you need to make some sort of rough guess as to how many items you're willing to have.
NUM = round(numel(items) / 10); % a rough guess
bins = linspace(min(items), max(items), NUM);
counts = histc(items, bins);
unit_items = bins(counts > 0);
BTW: I wrote this in a text-editor away from matlab so there may be some stupid typos or off by one errors.
Hope that helps