Low performance both iSCSI and AoE

We are looking for a resonable speed storage. Because of low budget we decided to use software iSCSI or AoE targets. Before we change our production infrastucture, we are doing some tests to choose the best technology.

For testing we use:

  • Fujitsu Siemens RX200 S4 as target
  • Fujitsu Siemens RX200 S4 as initiator
  • NetGear managed 1GBit switch
  • onboard NICs (Broadcom w/TOE), EdiMax NICs, Broadcom NICs w/TOE - all 1GBit
  • target server is using a QLogic controller with 6 2TB WD Blue SATA drives.
  • both target and initiator operating systems are Ubuntu 16.04 LTS with all updates. Switch is dedicated for storage purpose. We test bonds and multipathing.

Our problem is low read speed. For testing we use dd and a 40-100GB file.

  • local read and write on a target server is over 300MB/s.
  • writing to server by iSCSI or AoE is over 200MB/s which satisfies us.
  • reading from server is always 95-99MB/s.

We have tried ietd, aoetools, LIO. We have used bonds of 2 NICs: balance-rr and LACP, multipathing with r-r. Used normal and jumbo frames. Finally we even did direct ethernet connection between target and host (no switch).

All test give more less the same results (of course using common NICs without TOE and iSCSI gave 20-30% worse results).

Testing network with iperf showed transfers about 200MB/s (2GBit). Watching NICs usage on target with bmon showed equal utilization of both devices (each about 50MB/s for reading, about 100MB/s for writing).

As we had no luck, we decided to use a third NIC (both sides of course). The results were strange:

  • 2 NICs - 50MB/s each
  • 3 NICs - 33MB/s each

Is there any limit on target software that disables output higher than 1GBit/s?

What do we do wrong?


Solution 1:

To squeeze maximum performance out of iSCSI connected storage you should use Jumbo Frames and MPIO (not LACP). RDMA/iSER is recommended if you can do that.

AOE (ATA over Ethernet) is old and is shit. We’ve got rid of Coraid years ago, already. We are using StarWind https://www.starwindsoftware.com/ as iSCSI target quite a while already and StarWind asked us to migrate off Coraid to whatever storage we could do.

So right now, we are very good with iSCSI provided by StarWind and using Windows, ESX and SCST http://scst.sourceforge.net/ on Linux as initiators. With RDMA/iSER it does up to 10 Gbit, very happy so far.

Solution 2:

Your expectation on how Ethernet link aggregation works are incorrect.

All aggregation methods other than balance-rr (ie: all methods whose mode > 0) do not give you a greater single-connection throughput; rather, they increase total available bandwidth when multiple connections are established from/to the affected hosts. In other words, LAG/LACP will not give you any benefits for this one-connection scenario.

The only aggregation method that can give you a single-session throughput greater than what you can normally have on a single interface is balance-rr, which distribute packets in a round-robin fashion. You had to set balance-rr on both the initiator and the target. However a big catch is that this is largely switch-dependent.

Anyway, if you set both target and initiator to balance-rr, directly connecting the two machines should give you increased performance. If not, can you post a iperf with balance-rr and both machines directly connected (no switch)? Also, please post the exact dd command you used for benchmarking.

Solution 3:

Note: I'm only talking about iSCSI here. I have no experience with AoE beyond reading about it, and I wouldn't implement it in any new infrastructure anyways (it's pretty much defunct).

Don't use balance-rr for anything other than some very specific point-to-point protocols. It has horrible performance when under almost any kind of real world load, and causes a slew of network issues (such as a LOT of jitter). Definitely don't use it with a switch.

Use MPIO without any bonding on the initiator side to accomplish load balancing and fault tolerance. To ensure that your paths do not get "mixed up" by sending all of your traffic down a single path, put individual paths (gigabit NICs between target and initiator, in your case) on separate subnets.

Feel free to bond the target side with LACP per path (as in two bonds for two paths for a total of four NICs, as an example target port configuration). This works great, and can balance multiple initiator connections that use the same paths. Also use jumbo frames and iSER if possible. Using LACP on the target will balance connections to each path among several NICs.

Using LACP on the initiator will only be effective if it's making many target portal connections with simultaneous use (not common for just about any workload). Even if you were to effectively implement LACP per path on the initiator, it would quickly become a cabling nightmare to use (for example) four additional fabrics to each box. If you need more than ~2Gib/s throughput to a single initiator, consider 10GiB/s ethernet.

Solution 4:

Most of the the responses on AoE are totally incorrect, counterfactual, and show a lack of AoE knowledge and experience. First off, it’s not defunct. CORAID is the vendor behind AoE and they restarted as “SouthSuite” while retaining the CORAID trademark. They are the same developers, too. They are making new products and supporting most of the old ones. They are pushing AoE development forward as well, as their open technical mailing lists clearly show. Check the website, it's all up to date and tells the whole story on their history page.

Someone said AoE won’t benefit from jumbo frames and was also flat wrong. It was supported after version 13 of the ‘vbladed’ was released. You do need to tune your MTU to support the new frame size, but otherwise it works great.

iSCSI runs in layer-5 of the OSI model. It’s usual transport is TCP. That gives you some error correction (due to checksums in TCP) and allows you to route the traffic over IP at layer-3. That’s about where iSCSI’s advantages stop. It’s real world performance is downright awful when you actually fairly compare it to something like FCP, AoE, or FCoE. I’d invite you to google “iscsi performance comparison” for the horror show.

Your read-speed issue could have been due to a network misconfiguration, turn off flow control and make sure you use a large enough socket buffer. You also didn’t mention if your underlying filesystem had been tuned for read-prefetching or not. Based on your scenario, that could help you a lot, but be careful not to use that with certain databases that demand caching be disabled.

802.3ad aggregation will not increase your single stream throughput very much, even in a round-robin scenario. It will also complicate your network config and give you a couple of new opportunities to shoot yourself in the foot by mismatching the PDU intervals or misconfiguring your Cisco VPC link to support the active-active state. Don’t use LACP with AoE, let it handle it’s own multipathing and multiplexing. Later versions of AoE handle this beautifully, and in most cases more gracefully than even FCP since it’s all automatic. Additional Ethernet ports give you more bandwidth and more resiliency. If you spread the host & initiator Ethernet ports over multiple switches, that can provide even more redundancy. There is no need to configure a bonding mode. Also, don’t run IP on the same interfaces you use for AoE. That’s been known to be problematic for performance at times, also.

In short, don’t listen to the AoE naysayers, they sound they don’t have much experience and are just riding trendy brainwaves. Shun the herd. Go configure a backing store with hand-tuned prefetching and you’ll probably see your read throughput go way up. Drop the use of aggregation protocols and run screaming from iSCSI. One last thing, stop using ‘dd’ it’s not a great test and is subject to bad caching effects. Use a real benchmark tool like ‘fio’, ‘iozone’, or ‘dbench’. Those give much more reliable results.