Will M.2 NVMe or itegrated graphics slow discrete graphics cards

TLDR:

Will M.2 NVMe or integrated graphics slow discrete graphics cards

In this specific case probably not.


Deep Learning involves running CUDA based algorithms for days on end on the GPU, and piping a lot of data back and forth between GPU/CPU/SSD. Bottlenecks will add up.

I have no CUDA experience at all, but from what I remember reading in articles the main issue with doing calculations on a GPU is usually not the raw transfer speed. Instead it is latency.

Mind the 'usually'. For any one specific program things could be quite different.

Latency is partially on issue of PCI-e speeds. PCI-e v3 connections can run at higher rates than PCI-e v2 lanes (8GHz vs 5GHz). So you want your cards with GPUs on the v3 lanes.

Looking at ark.intel.com it show that the Core i7 7700 has 16 PCI-e lanes coming directly from the CPU. (More can come from the chipset).

The PCI-e lanes connected directly to the CPU are PCI-e v3 and can be used as either 1x16, 2x8, 1x8+2x4. I am guessing that for doing CUDA with two discrete graphics cards you want to use two slots, both using 8 lanes.

So I'm concerned about PCIe Lanes. As I understand, the CPU only has 16.

Correct.

From what I understand, 8x lanes is plenty for GPU but 4x can see some slowdown.

Sort-of, yes. Most of the time PCI-e bus speed has exceeded what graphics cards needed. That changes over time. GPUs get faster. High end ones start to push the maximum bandwidth. PCI-e versions get upgraded...

Usually a top end graphics card has enough with about 8 lanes. Toms hardware did tests in the PCI-e v2 era and I think there results are still valid for modern setups with modern graphics cards on PCI-e v3:

x16: Maxed speed.
x8: A few % speed loss (say 2-5%)
x4: Still working fine for most things. Just do not put a high end dual GPU card in an x4 slot and run a 4k display with all games settings maxed out.

(I originally worried NVMe SSD would take lanes but seems CPU has 4 extra IO lanes)

It hasn't. Current Intel consumer chips very rather few PCI-e lanes. No more than that most of their target audience needs. That is a sane economic decision.

Their Xeon range has CPU with more PCI-e lanes. As have some of the new AMD chips (64 PCI-e lanes on AMD's ThreadRipper CPU. Twice that on server products)

Instead your motherboard has the Z270 chipset which provides additional PCI-e lanes.

The chipset supports these x1, x2, x4 configurations.

With a max of x4 it seems that 2x8 from the CPU still is the best bet for CUDA.

A lot of the remaining lanes are used by the motherboard for SATA, USB, network etc. Four of them are available to the end users via 4 PCIe 3.0 x1 expansion slots. Do not put the graphics cards in these!

As for NVME: The boards has two PCIe Gen3 x4 Ultra M.2 slots. The manual for this board mentions:

* If M2_1 is occupied by a SATA-type M.2 device, SATA_5 will be disabled.  
* If M2_2 is occupied by a SATA-type M.2 device, SATA_0 will be disabled.

I did not spiot any of the usual 'If M.2 is used with a NVME device then ...', which means it probably has dedicated lanes from the chipset.

Do I leave the 2 GPUs for compute, or will the integrated graphics create a bottleneck (ie force 8x 4x 4x, vs 8x 8x lanes for GPU) for the GPUs (meaning maybe I should disable integrated and share 1 GPU with desktop UI).

I do not see any reason why the intergrated graphics should bottleneck you. Not do I suspect problems it you disable it. Just having a machine doing calculations could even be done without graphics (just SSH in).

Having tw graphics cards dedicated to calculations and the intergrated graphics for the rest (e.g. a simply termina screen) seems cleanest to me. But that is not anything a can quantify.

My goal is to get max performance from GPUs. If there are bottlenecks, I want to know if it can cause more than 5% performance difference, which is 8hrs multiplied over a week of GPU usage

Best advice on this is to measure what your system is doing. If at all possible spent a week with it in test mode. Run it with both cards in x8. Measure. Use software to downgrade the PCI-e lanes to x4. Check how much performance is lost. If that is less that 1%, try with one cards at x16 and one card in a x1 slot. I am guessing that 2* x8 will be much better, but I might be wrong.

Do similar things with other settings. E.g. try running without hypertreading. (HT on is generally 30%-ish faster, but occasionally it actually slows things down, so test.) Try disabling power saving. etc etc.

Then after a few days of testing with a few different tests go run production.