Any faster option rather than Base.unique() in Julia?

I have a 2-dimensional matrix that I want to find its unique rows. For example, if

M=[1 2 3;
   4 7 8;
   1 2 3;
   0 2 3]

then, the output of the Base.unique command,

Base.unique(M, dims=1)

is the matrix

[1 2 3;
 4 7 8;
 0 2 3]

which is the one that I am looking for. However, Base.unique seems to be quite slow when the size of the matrix is large. Is there any other faster option?

PS: The ThreadsX module has a similar function, ThreadsX.unique, which, according to this link, is faster than Base.unique. However, it seems that it only accepts one-dimensional vectors (or at least that's what I have inffered from that function).


Solution 1:

Finding the unique elements in an array is just a fundamentally algorithmically hard problem. Most naive implementations would be O(n^2), and while you can do it in O(n) time for a single list that is already sorted, well, then you have to sort first, and even quicksort is, as we know, no better than O(nlog(n)) on average.

I spent a bit of time trying to cook up something that beats Base.unique but I could only manage a few percent even in the best case so far; usually, I can only beat base if I can get the problem to make better use of your CPU's SIMD instructions (e.g. AVX on x86 or Neon on ARM) with LoopVectorization.jl, and this particular problem is just not very SIMD-friendly.

There is one trick I can think of which might be able to help you a bit here though. If unique is hard, then unique on rows of a matrix is, in technical terms, a pain in the a** to optimize. For example, consider the performance difference of:

julia> using BenchmarkTools

julia> M = rand(1:20,10^6,4)
1000000×4 Matrix{Int64}:
 17   9  17   4
  2  19  19  18
 13   9  14   7
  ⋮
 19   8   2  17
 20  20   5   9

julia> @benchmark unique($M, dims=1)
BenchmarkTools.Trial: 58 samples with 1 evaluation.
 Range (min … max):  67.044 ms … 105.911 ms  ┊ GC (min … max): 0.00% … 11.76%
 Time  (median):     85.795 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   86.759 ms ±  10.448 ms  ┊ GC (mean ± σ):  3.87% ±  5.09%

             █ █   ▁  ▁ ▁▄▁  ▄  ▁         ▄ ▁ ▁ ▁▁▁ ▁
  ▆▆▁▆▆▁▁▁▁▁▆█▁█▁▁▆█▁▁█▁███▁▁█▆▆█▁▁▆▆▆▆▁▁▁█▆█▁█▁███▁█▆▆▆▁▁▁▆▁▆ ▁
  67 ms           Histogram: frequency by time          106 ms <

 Memory estimate: 27.14 MiB, allocs estimate: 47.

julia> V = prod(M, dims=2)
1000000×1 Matrix{Int64}:
 10404
 12996
 11466
     ⋮
  5168
 18000

julia> @benchmark unique($V)
BenchmarkTools.Trial: 555 samples with 1 evaluation.
 Range (min … max):  7.739 ms … 12.680 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.512 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.989 ms ±  1.044 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃▆█▁▃▄
  ▃▄▅▇████████▅▄▄▄▅▅▅▃▃▃▄▄▅▄▅▃▅▄▅▄▃▄▄▃▄▃▂▄▃▁▃▂▃▂▃▃▃▃▃▃▂▃▃▃▃▃ ▃
  7.74 ms        Histogram: frequency by time        11.9 ms <

 Memory estimate: 103.06 KiB, allocs estimate: 24.

about an order of magnitude. So, if there is any way you can make your matrix into a vector, that could help you a lot.

In particular, if the numbers in your real matrix can fit into a smaller type than Int64 or Float64 and there are only a few of them in each row, then you can potentially reversibly reinterpret each row as a 64-bit type:

julia> M = Int16.(M)
1000000×4 Matrix{Int16}:
 17   9  17   4
  2  19  19  18
 13   9  14   7
  8   1   2  12
  8  16  19  13
  ⋮
 12  19   2  15
  4   5  10  14
 19   8   2  17
 20  20   5   9

julia> Mᵥ = vec(reinterpret(UInt64, M'))
1000000-element reshape(reinterpret(UInt64, adjoint(::Matrix{Int16})), 1000000) with eltype UInt64:
 0x0004001100090011
 0x0012001300130002
 0x0007000e0009000d
 0x000c000200010008
 0x000d001300100008
                  ⋮
 0x000f00020013000c
 0x000e000a00050004
 0x0011000200080013
 0x0009000500140014

julia> reinterpret(Int16, unique(Mᵥ)')'
159674×4 adjoint(reinterpret(Int16, adjoint(::Vector{UInt64}))) with eltype Int16:
 17   9  17   4
  2  19  19  18
 13   9  14   7
  8   1   2  12
  8  16  19  13
  ⋮
 17   5   5  11
  1  18  14  17
  2  20   5   9
 15  17  14   4

N.B. that you have to make sure the number of bits in each row adds up to the right number (padding if you have to), and you have to transpose first, because Julia is column-major!

However because reinterpret is basically free, this whole operation can still end up being significantly faster than unique-by-rows:

julia> @benchmark unique($M, dims=1)
BenchmarkTools.Trial: 65 samples with 1 evaluation.
 Range (min … max):  60.212 ms … 93.517 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     78.691 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   77.909 ms ±  6.788 ms  ┊ GC (mean ± σ):  3.85% ± 5.57%

                           ▁         ▃ ▁  ▃▃▁   █        ▁  ▁
  ▄▁▁▁▄▁▁▁▁▁▁▁▁▄▄▄▄▄▇▇▄▁▄▁▄█▇▁▁▁▄▄▄▄▇█▄█▇▁███▄▄▄█▁▁▁▄▇▁▁▁█▁▁█ ▁
  60.2 ms         Histogram: frequency by time        89.1 ms <

 Memory estimate: 23.48 MiB, allocs estimate: 47.

julia> @benchmark reinterpret(Int16,unique(vec(reinterpret(UInt64, $M')))')'
BenchmarkTools.Trial: 137 samples with 1 evaluation.
 Range (min … max):  30.070 ms … 46.186 ms  ┊ GC (min … max): 0.00% … 15.62%
 Time  (median):     36.334 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.433 ms ±  3.491 ms  ┊ GC (mean ± σ):  1.65% ±  4.51%

    ▁  ▁ ▃ ▁ █▆ ▁▁▃▆ ▁▃▆▁▁▃▁█▃█▁▆▃▁ █      ▃
  ▄▁█▇▇█▇█▇█▇██▇████▇██████████████▇█▇▁▄▁▄▄█▁▁▇▁▄▄▄▁▁▄▁▄▁▁▄▁▄ ▄
  30.1 ms         Histogram: frequency by time        46.2 ms <

 Memory estimate: 5.96 MiB, allocs estimate: 44.

Since we've made your matrix into a vector, now in principle we could also substitute ThreadsX.unique, but in my tests so far this turned out not to be faster:

julia> @benchmark reinterpret(Int16,ThreadsX.unique(vec(reinterpret(UInt64, $M')))')'
BenchmarkTools.Trial: 56 samples with 1 evaluation.
 Range (min … max):  69.633 ms … 136.835 ms  ┊ GC (min … max):  0.00% …  0.00%
 Time  (median):     80.427 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   89.547 ms ±  18.360 ms  ┊ GC (mean ± σ):  10.91% ± 13.49%

      ▄  █                                 ▁
  ▆▆▁▄█▇▆█▇▁▇▇▁▆▁▄▁▁▁▁▁▁▁▁▄▄▁▁▁▄▁▁▆▁▄▄▁▁▁▁▄█▆▁▄▄▁▁▁▁▄▁▁▄▁▁▁▁▁▄ ▁
  69.6 ms         Histogram: frequency by time          131 ms <

 Memory estimate: 121.28 MiB, allocs estimate: 3310.