Can System.arraycopy(…) be faster than O(n)?

Suppose I have an array of 64 bits in a 64-bit POSIX system.

Let's assume the processor cache contains my array and the registers are 64+ bits long.

Can I expect O(1) time from System.arraycopy(…)? In sense, the time of copying doesn't depend on the length of my array.

The question is related to Time complexity of System.arraycopy(...)?.

I feel I can expect it, but does it work like that under the bonnet? Do I become system- and JVM-dependent in this case?


Let's assume the processor cache contains my array

That isn't really relevant.

What is relevant: first of all, the type of the underlying array data.

As in: only when you have say, byte data[] = new byte[8], then you have 64 bits that really, sequentially sit one after the other in memory.

Because, when you have Byte data[] = new Byte[8] then you are already broken, as the slots in the array are just pointers, pointing somewhere on the heap!

So, let's rephrase: assuming you have a 64 bit architecture, then sure: the CPU should be able to treat 64 bits with a single instruction. Be that for reading the whole array into the CPU cache, or a CPU register, be it for copying that data to a different position in memory.

But I completely agree with the comment by user Joachim: Big-O isn’t really applicable here. We use Big-O to determine/estimate the "cost" of an algorithm. It is not tool to assess the differences between implementations. Especially given the fact that the OP mixes in so many different levels/layers here. Details such as CPU cache or "register width" aren't elements that matter to Big-O!

So, as said, what we can state: 64 bits can be moved around in "one shot" on a system that supports 64 bit top to bottom.


64-bit POSIX system.

POSIX has nothing to do with CPU features related to chunked data copying

Let's assume the processor cache contains my array

It would save from the main memory trip while executing copy instruction, but does not affect the order of the Big O notation.

the registers are 64+ bits long.

Even if you have AVX512 support on you architecture with 512 bits wide zmm registers and JDK 9+ (AFAIK runtime is aware of AVX512 starting JDK 9+) it would allow you to copy 8 packed 64-bit integers per one instruction, but does not affect the order of complexity.

So to copy, say, 1024 64-bit integers you would require to execute at least 128 vector instructions again yielding O(n) complexity, but with lower constant.


HotSpot implementation note:

The architecture-dependent code for arraycopy implementation is generated on JVM global bootstrapping "phase" here StubRoutines::initialize2.

Particular the chunked copy routine code generation is done in the platform dependent section of HotSpot code with copy_bytes_forward function (it is done with HotSpot's own Macro Assembler implementation).

The crucial parts of it is the CPU feature checks like

if (UseAVX > 2) {
  __ evmovdqul(xmm0, Address(end_from, qword_count, Address::times_8, -56), Assembler::AVX_512bit);
  __ evmovdqul(Address(end_to, qword_count, Address::times_8, -56), xmm0, Assembler::AVX_512bit);
} else if (UseAVX == 2) {
  __ vmovdqu(xmm0, Address(end_from, qword_count, Address::times_8, -56));
  __ vmovdqu(Address(end_to, qword_count, Address::times_8, -56), xmm0);
  __ vmovdqu(xmm1, Address(end_from, qword_count, Address::times_8, -24));
  __ vmovdqu(Address(end_to, qword_count, Address::times_8, -24), xmm1);
} else {
    //...
}

which produces the code based on the available CPU features. The features detector is generated and called earlier in architecture dependent generator generate_get_cpu_info based on cpuid instruction.