Difference between different ways of writing records to Cassandra using Flink

I was going through lots of posts over SO and the official documentation of Flink but I couldn't get what I was looking for. I am looking for the difference between RichSinkFunction, RichAsyncFunction, AsyncIO and CassandraSink for writing records faster/multithreaded in the Cassandra DB using Flink.

My understanding is as follows:

  1. RichSinkFunction - If implemented properly, then it will do the work for you. Since it opens and closes the connection once.
  2. RichAsyncFunction- Implementation same as RichSinkFunction. It works in Sync mode originally. I can use executorService for multithreading purposes. Also, I read that if the capacity is passed thoughtfully, it can give you higher throughput.
  3. AsyncIO - Doesn't work multithreaded by default. Also, according to one of the SO answers, we can use executorService same as RichAsyncFunction for creating separate threads which are not mentioned in the documentation.
  4. CassandraSink - Provided by Flink with various properties. Will using setMaxConcurrentReqeuests give me faster results?

What would be the best to use among the mentioned classes for the purpose I am looking for via the Flink program?


Solution 1:

I think you can use CassandraSink, which uses DataStax's java-driver(https://github.com/datastax/java-driver) to access Cassandra. Flink uses executeAsync function to achieve better speed. As you mentioned, setMaxConcurrentRequests sets the max number of requests that can be sent from the same session.

RichSinkFunction is a fundamental function in Flink. We can implement our own Cassandra Sink with RichSinkFunction, but it requires work like initialising the client, creating threads, etc.

RichAsyncFunction is mainly for the AsyncIO operator. We still have to initialise the Cassandra client. Except that, we only have to focus on implementing the asyncInvoke function and failure handling.

CassandraSink implements RichSinkFunction with Cassandra client's async API. This is the easiest async API with the least code.