Very long D2H time vs. H2D (CUDA)

Solution 1:

As Robert pointed out, NSight displays the time from API start to finish, so the time between when the copy API is called and when it actually starts (after previous kernels are done) is included.