Unable to get Mellanox ConnectX-3 to work with transceiver/cable
TL;DR Auto-negotiation was turned off on my switch and had manufacturer's recommend settings for a 40G connection. Turning auto-negotiation on solved the issue.
I want to answer my question with details from my adventure down this path on getting my 40 gigabit network setup. This way anyone else who tries this in the future has some reference points.
I think it's important to note I used my 40G NIC in Ethernet mode, not Infiniband. The Ethernet drivers did seem to work, but I ended on the OFED drivers because it worked I didn't want to mess with it anymore. If you plan on getting a setup like this, make sure your card is capable of Ethernet mode!
What I tried
Once I got the switch, NICs and cables I installed the OFED (OpenFabrics Enterprise Distribution) drivers/software provided by Mellanox/Nvidia. Once those failed to establish a link I used the tools built into the software to update their firmware. It was rather simple, only issue I had was finding the latest firmware .bin file for my specific cards. The firmware I used was 2.33.5000, still quite old but newer than what was on the cards.
After that failed I assumed the cables/transceivers (one unit) were the culprit. I swapped out the cables I had originally gotten for a pair (56G 10m AOC + 56G 2m DAC > 40G 11m AOC + 40G 1m DAC) of customized cables that were designed for the specific Mikrotik switch I had bought. Since these were customized, they took a month to arrive. Once these arrived and did not work I was stumped and proceeded to seek help on various forums. Before long it was suggested I purchase a tool from FS.com that would allow me to re-code the vendor on the transceivers to hopefully trick the NIC into working.
Since the cables were customized for the switch, I assumed it was the NIC that wasn't cooperating. Setting the transceiver to either IBM or Mellanox didn't work. After seeking further assistance a couple people suggested I find documentation on the NICs and find compatible cables/transceivers. I did find a PDF (although not provided or made by IBM/Mellanox) that listed some compatible part numbers that FS.com provided. So I purchased the IBM 49Y7890 1m DAC from FS.com.
Once it arrived I found this was not the solution either. Out of desperation I found a few threads of people that have flashed their cards to true Mellanox firmware. I decided to try my hand at it. After some troubleshooting on getting the updater to work, I successfully flashed firmware version 2.42.5000 with new PSID of MT_1100120019 (See 'This wasn't the end of it' paragraph 4 for details about how this can mess up things. See here for how to cross flash). After this attempt failed, further discussions took place on the issue and eventually concluded I should test the NICs directly connected to one another. Once I got the NICs connected to together and setup their subnet, I saw speeds of 36.5GBit/s using several iperf tests (since iperf and iperf3 are single threaded you will need to setup multiple for these speeds. I setup 16 each set to use 10 threads). Once I eliminated the NICs from the list of culprits I started to wonder if the auto-negotiation setting on the switch would have been an issue. Turning it back on I immediately saw 'link ok'.
This wasn't the end of it
I had gotten the setup to work, turns out there was no compatibility issue and I likely never needed to swap out my cables or buy the IBM one. I was ecstatic but this was far from over. I had intended to run this setup with Proxmox on my server and Windows as a client machine. Both of these systems would be equipped with 40G.
Since I knew I was going to mess up the Proxmox install several times, I first backed up everything to another drive. Once done I proceeded to install the Mellanox OFED drivers on Proxmox. There is a couple issues with trying this, the OFED drivers try to remove very critical packages from Proxmox since they 'interfere' with the drivers (they don't). So I edited the mlnxofedinstaller script and commented out all calls to the 'remove_old_packages' function. This prevented the installer from giving Proxmox a lobotomy.
At this point most things worked, the only issue I was having was sending data to the server. It wasn't accepting any more than a few megabytes per second, far less than I should be getting. I tried many different versions of the software, trying the Ubuntu 20.04, 19.XX didn't work because of dependencies that Proxmox doesn't have but those two installs do. I was forced to install Ubuntu 18.04 drivers since these were the latest drivers without dependency issues. Installing the drivers typically didn't solve the speed issue. So I tried to install kernel packages only using the --kernel-only
flag on the installer. At some point I had gotten the speeds I was looking for, but this was a fluke as I wasn't able to replicate them later. I decided to try some variation of the Debian 10 drivers, got slightly better speeds at 20MB/s. After some time bouncing ideas off someone else, I tried setting the 40G network to 9000 MTU. This lead to some seriously odd results. Speeds of barely 1 gigabit even though the entire setup had MTU of 9000. I switched it back to 1500 to do further testing on Ubuntu instead of Proxmox, since I had good speeds on Ubuntu. This failed to come through, the speed tests I ran originally must've been a fluke.
I decided to swap the NICs in the systems, labeling them 1 and 2 after I took them out so I didn't get them confused. After running more speed tests, it turns out the card in the Proxmox system was the issue. I was able to send at full speed, but not able to receive at full speed. I recalled the drivers updating the firmware on that NIC and didn't think much of it since I was using the latest version. So I re-flashed the cross flashed version I had originally installed. After further testing we concluded the limited speeds of 22GBit/s up and 11GBit/s down were the result of various bottle necks between the systems. Specifically testing on RAM disks with a 30 gibibyte file, we concluded the server with twice the DIMMs populated was able to write at twice the speed. Attempting to use the NVMe with a NTFS file system on the test system did poorly due to the compatibility layer being single threaded. After running a dozen more iperf tests, everything was running smoothly, even with the server running Proxmox.
A caveat when using the OFED drivers, you will lose the ability to connect to CIFS network shares. The OFED drivers unload this module until the driver is no longer running. The Ethernet drivers work but it may be necessary to cross flash to mellanox firmware.
The route ahead
Since I was on a budget of about $1,500, I had to go with some of the cheapest equipment I could find. Hence the $60 network cards. When I found this Mikrotik switch new for $500 I was stoked. It had all I needed for the best price I could find, even beating some used units. It didn't have port licensing and came with one of the top tier software licenses. It was really a hard deal to beat. Of course everything comes with compromises.
Even though I didn't really intend to use the 10G SFP+ ports, I wanted them for future expansion. I had gotten a SFP+ to RJ45 adapter and a 10G NIC so I had something to test while the 40G equipment was in shipping. I was able to receive a total of 2 gigabits per second on the 10G NIC. This was all the data I could feed it between my 1 gigabit internet connection and my 1 gigabit equipped server. But attempting to run a gigabit upload to the internet from the 10G card resulting in much lower speeds than I was expecting. I was only getting around 300Mbps despite quite reliably being able to hit 900Mbps. I proceeded to ask around and the theory of the switch not having the buffer size to drop 10G to 1G was the conclusion. This theory is furthered by switching the 1G uplink of my router to the 10G port and attempting to upload at gigabit from a 40G system (only a 4x drop, instead of a 10x) dropped the speeds down to ~1mbps. Which suggests the 48 1G ports have a shared buffer.
This isn't really an issue for my Windows machine, since I never upload at those speeds anyway. But for my server this is a pretty big deal. Having the upload bandwidth chopped to a third could end up being a real issue. After digging around some I found I could use routing metrics to force traffic through either the 40G NIC or the 1G NIC depending on where it's going. While this solution isn't 100% perfect, it still works quite well.
Using the route -n
command I'm able to see my current route paths. The goal is to modify the routes so the 40G is preferred for local connections and 1G is preferred for internet connections. The higher the metric on the route, the more expensive it is to use, so the system will use the least expensive route.
Proxmox ships with ifupdown by default, it's more stable and has more features. Netplan can add routes, but can't remove or modify them. It also doesn't allow you to run commands before, on or after interface start. You can use netplan, but you would need to setup a separate service to remove/modify additional routes.
This is my current /etc/network/interfaces
config, I had to add the post-up commands to my NICs to add the routes in;
auto ens18 # 1 Gigabit NIC
iface ens18 inet static
...
post-up /usr/sbin/route add -net 192.168.0.0/24 metric 1000 ens18
auto ens19 # 40 Gigabit NIC
iface ens19 inet static
...
post-up /usr/sbin/route add -net 0.0.0.0/0 gw 192.168.0.1 metric 1000 ens19
post-up /usr/sbin/route add -net 192.168.0.0/24 metric 1 ens19
post-up /usr/sbin/route del -net 192.168.0.0/24 metric 0 ens19
Your routes should look like;
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.1 0.0.0.0 UG 1 0 0 ens18
0.0.0.0 192.168.0.1 0.0.0.0 UG 1000 0 0 ens19
192.168.0.0 0.0.0.0 255.255.255.0 U 1 0 0 ens19
192.168.0.0 0.0.0.0 255.255.255.0 U 1000 0 0 ens18
Obviously these interfaces will need to be on different local IPs, I suggest using the IP set to the 40G NIC for anything local. If something needs to be port forwarded, use the gigabit NIC. It should be alright to use gigabit NIC locally as long as you aren't sending more than 100MB at a time. This routing can work if you send 40 gigabits/s local data to the IP bound to the gigabit port, it's not always consistent though.
It's important to note if you're modifying a route, you should add the modified version before you remove the old version. It's also important to note that your setup may not need the exact same as I posted above. For example my Proxmox install already adds a route for ens18 so I would need to remove that after I added the one I wanted.
And that's it! I have finally gotten my setup completed with the speeds I wanted. I'm able to transfer to my server at about 1.7GB/s and from at about 1GB/s (limitation being either NTFS or one of the SSDs).