Any faster option rather than Base.unique() in Julia?
I have a 2-dimensional matrix that I want to find its unique rows. For example, if
M=[1 2 3;
4 7 8;
1 2 3;
0 2 3]
then, the output of the Base.unique
command,
Base.unique(M, dims=1)
is the matrix
[1 2 3;
4 7 8;
0 2 3]
which is the one that I am looking for. However, Base.unique
seems to be quite slow when the size of the matrix is large. Is there any other faster option?
PS: The ThreadsX
module has a similar function, ThreadsX.unique
, which, according to this link, is faster than Base.unique
. However, it seems that it only accepts one-dimensional vectors (or at least that's what I have inffered from that function).
Solution 1:
Finding the unique elements in an array is just a fundamentally algorithmically hard problem. Most naive implementations would be O(n^2)
, and while you can do it in O(n)
time for a single list that is already sorted, well, then you have to sort first, and even quicksort
is, as we know, no better than O(nlog(n))
on average.
I spent a bit of time trying to cook up something that beats Base.unique
but I could only manage a few percent even in the best case so far; usually, I can only beat base if I can get the problem to make better use of your CPU's SIMD instructions (e.g. AVX on x86 or Neon on ARM) with LoopVectorization.jl, and this particular problem is just not very SIMD-friendly.
There is one trick I can think of which might be able to help you a bit here though. If unique
is hard, then unique
on rows of a matrix is, in technical terms, a pain in the a** to optimize. For example, consider the performance difference of:
julia> using BenchmarkTools
julia> M = rand(1:20,10^6,4)
1000000×4 Matrix{Int64}:
17 9 17 4
2 19 19 18
13 9 14 7
⋮
19 8 2 17
20 20 5 9
julia> @benchmark unique($M, dims=1)
BenchmarkTools.Trial: 58 samples with 1 evaluation.
Range (min … max): 67.044 ms … 105.911 ms ┊ GC (min … max): 0.00% … 11.76%
Time (median): 85.795 ms ┊ GC (median): 0.00%
Time (mean ± σ): 86.759 ms ± 10.448 ms ┊ GC (mean ± σ): 3.87% ± 5.09%
█ █ ▁ ▁ ▁▄▁ ▄ ▁ ▄ ▁ ▁ ▁▁▁ ▁
▆▆▁▆▆▁▁▁▁▁▆█▁█▁▁▆█▁▁█▁███▁▁█▆▆█▁▁▆▆▆▆▁▁▁█▆█▁█▁███▁█▆▆▆▁▁▁▆▁▆ ▁
67 ms Histogram: frequency by time 106 ms <
Memory estimate: 27.14 MiB, allocs estimate: 47.
julia> V = prod(M, dims=2)
1000000×1 Matrix{Int64}:
10404
12996
11466
⋮
5168
18000
julia> @benchmark unique($V)
BenchmarkTools.Trial: 555 samples with 1 evaluation.
Range (min … max): 7.739 ms … 12.680 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.512 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.989 ms ± 1.044 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆█▁▃▄
▃▄▅▇████████▅▄▄▄▅▅▅▃▃▃▄▄▅▄▅▃▅▄▅▄▃▄▄▃▄▃▂▄▃▁▃▂▃▂▃▃▃▃▃▃▂▃▃▃▃▃ ▃
7.74 ms Histogram: frequency by time 11.9 ms <
Memory estimate: 103.06 KiB, allocs estimate: 24.
about an order of magnitude. So, if there is any way you can make your matrix into a vector, that could help you a lot.
In particular, if the numbers in your real matrix can fit into a smaller type than Int64
or Float64
and there are only a few of them in each row, then you can potentially reversibly reinterpret
each row as a 64-bit type:
julia> M = Int16.(M)
1000000×4 Matrix{Int16}:
17 9 17 4
2 19 19 18
13 9 14 7
8 1 2 12
8 16 19 13
⋮
12 19 2 15
4 5 10 14
19 8 2 17
20 20 5 9
julia> Mᵥ = vec(reinterpret(UInt64, M'))
1000000-element reshape(reinterpret(UInt64, adjoint(::Matrix{Int16})), 1000000) with eltype UInt64:
0x0004001100090011
0x0012001300130002
0x0007000e0009000d
0x000c000200010008
0x000d001300100008
⋮
0x000f00020013000c
0x000e000a00050004
0x0011000200080013
0x0009000500140014
julia> reinterpret(Int16, unique(Mᵥ)')'
159674×4 adjoint(reinterpret(Int16, adjoint(::Vector{UInt64}))) with eltype Int16:
17 9 17 4
2 19 19 18
13 9 14 7
8 1 2 12
8 16 19 13
⋮
17 5 5 11
1 18 14 17
2 20 5 9
15 17 14 4
N.B. that you have to make sure the number of bits in each row adds up to the right number (padding if you have to), and you have to transpose first, because Julia is column-major!
However because reinterpret
is basically free, this whole operation can still end up being significantly faster than unique-by-rows:
julia> @benchmark unique($M, dims=1)
BenchmarkTools.Trial: 65 samples with 1 evaluation.
Range (min … max): 60.212 ms … 93.517 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 78.691 ms ┊ GC (median): 0.00%
Time (mean ± σ): 77.909 ms ± 6.788 ms ┊ GC (mean ± σ): 3.85% ± 5.57%
▁ ▃ ▁ ▃▃▁ █ ▁ ▁
▄▁▁▁▄▁▁▁▁▁▁▁▁▄▄▄▄▄▇▇▄▁▄▁▄█▇▁▁▁▄▄▄▄▇█▄█▇▁███▄▄▄█▁▁▁▄▇▁▁▁█▁▁█ ▁
60.2 ms Histogram: frequency by time 89.1 ms <
Memory estimate: 23.48 MiB, allocs estimate: 47.
julia> @benchmark reinterpret(Int16,unique(vec(reinterpret(UInt64, $M')))')'
BenchmarkTools.Trial: 137 samples with 1 evaluation.
Range (min … max): 30.070 ms … 46.186 ms ┊ GC (min … max): 0.00% … 15.62%
Time (median): 36.334 ms ┊ GC (median): 0.00%
Time (mean ± σ): 36.433 ms ± 3.491 ms ┊ GC (mean ± σ): 1.65% ± 4.51%
▁ ▁ ▃ ▁ █▆ ▁▁▃▆ ▁▃▆▁▁▃▁█▃█▁▆▃▁ █ ▃
▄▁█▇▇█▇█▇█▇██▇████▇██████████████▇█▇▁▄▁▄▄█▁▁▇▁▄▄▄▁▁▄▁▄▁▁▄▁▄ ▄
30.1 ms Histogram: frequency by time 46.2 ms <
Memory estimate: 5.96 MiB, allocs estimate: 44.
Since we've made your matrix into a vector, now in principle we could also substitute ThreadsX.unique
, but in my tests so far this turned out not to be faster:
julia> @benchmark reinterpret(Int16,ThreadsX.unique(vec(reinterpret(UInt64, $M')))')'
BenchmarkTools.Trial: 56 samples with 1 evaluation.
Range (min … max): 69.633 ms … 136.835 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 80.427 ms ┊ GC (median): 0.00%
Time (mean ± σ): 89.547 ms ± 18.360 ms ┊ GC (mean ± σ): 10.91% ± 13.49%
▄ █ ▁
▆▆▁▄█▇▆█▇▁▇▇▁▆▁▄▁▁▁▁▁▁▁▁▄▄▁▁▁▄▁▁▆▁▄▄▁▁▁▁▄█▆▁▄▄▁▁▁▁▄▁▁▄▁▁▁▁▁▄ ▁
69.6 ms Histogram: frequency by time 131 ms <
Memory estimate: 121.28 MiB, allocs estimate: 3310.