A roundrobin for incoming files

Solution 1:

Here is one solution to what you're looking for. No java is involved in the making of this system, just readily available Open Source bits. The model presented here can work with other technologies than the ones I'm using as an example.

Scalable Upload

  1. Files are HTTP POSTed to a specific Round-Robin DNS address.
  2. The system POSTing the files then drops a job into an AMQP system (Rabbit MQ here), by way of another pair of load-balancers, to start the processing workflow.
  3. The Load Balancers receiving the HTTP POST are each in front of a group of OpenStack Swift object store servers.
    • The load-balancers each have two or more OpenStack Swift object-store servers behind them.
    • 'Round Robin is not HA' can be if the targets are HA themselves. YMMV.
    • For extra durability, the IPs in the RRDNS could be individual hot-standby LB clusters.
  4. The Object Store server that actually gets the POST delivers the file to a Gluster-based file-system.
    • The Gluster system should be both Distributed (a.k.a. sharded) and Replicated. This allows it to scale to silly densities.
  5. The AMQP system dispatches the first job, make the backup, to an available processing node.
  6. Processing node copies the file from main storage to backup storage and reports success/failure as needed.
    • Failure mode processing is not diagrammed here. Essentially, keep trying until it works. And if it never works, run through an exceptions process.
  7. Once the backup is complete AMQP then dispatches the Processing job to an available processing node.
  8. Processing node either pulls the file to its local file-system or processes it directly from Gluster.
  9. Processing node deposits processing product wherever that goes and reports success to AMQP.

This setup should be able to ingest files at extreme rates of speed given enough servers. Getting 10GbE aggregate ingestion speeds should be doable if you upsize it enough. Of course, processing that much data that fast will require even more servers in your Processing machine-class. This setup should scale up to a thousand nodes, and probably beyond (though how far depends on what, exactly, you're doing with all of this).

The deep engineering challenges will be in the workflow management process hidden inside the AMQP process. That's all software, and probably custom built to your system's demands. But it should be well fed with data!

Solution 2:

Given that you've clarified that files will arrive via scp, I don't see any reason for the front-end server to exist at all, as the transport mechanism is something that can be redirected at layer 3.

I'd put an LVS director (pair) in front, with a processing server pool behind and a round-robin redirection policy. That makes it very easy to add and subtract servers to/from the pool, it increases reliability because there's no front-end server to fall over, and it means we don't have to address the pull/push question about getting the files from the front-end to the processing servers because there is no front-end.

Each pool server should then do two things when receiving a file - firstly, copy it to archival storage, then process the file and send it on its way.