Difference between different ways of writing records to Cassandra using Flink
I was going through lots of posts over SO
and the official documentation of Flink but I couldn't get what I was looking for. I am looking for the difference between RichSinkFunction
, RichAsyncFunction
, AsyncIO
and CassandraSink
for writing records faster/multithreaded
in the Cassandra DB
using Flink
.
My understanding is as follows:
-
RichSinkFunction
- If implemented properly, then it will do the work for you. Since it opens and closes the connection once. -
RichAsyncFunction
- Implementation same as RichSinkFunction. It works inSync
mode originally. I can useexecutorService
for multithreading purposes. Also, I read that if the capacity is passed thoughtfully, it can give you higher throughput. -
AsyncIO
- Doesn't work multithreaded by default. Also, according to one of the SO answers, we can useexecutorService
same asRichAsyncFunction
for creating separate threads which are not mentioned in the documentation. -
CassandraSink
- Provided byFlink
with various properties. Will usingsetMaxConcurrentReqeuests
give me faster results?
What would be the best to use among the mentioned classes for the purpose I am looking for via the Flink program?
Solution 1:
I think you can use CassandraSink
, which uses DataStax's java-driver(https://github.com/datastax/java-driver) to access Cassandra. Flink uses executeAsync
function to achieve better speed. As you mentioned, setMaxConcurrentRequests
sets the max number of requests that can be sent from the same session.
RichSinkFunction
is a fundamental function in Flink. We can implement our own Cassandra Sink with RichSinkFunction
, but it requires work like initialising the client, creating threads, etc.
RichAsyncFunction
is mainly for the AsyncIO
operator. We still have to initialise the Cassandra client. Except that, we only have to focus on implementing the asyncInvoke
function and failure handling.
CassandraSink
implements RichSinkFunction
with Cassandra client's async API. This is the easiest async API with the least code.