Apache Spark vs Akka [closed]

Apache Spark is actually built on Akka.

Akka is a general purpose framework to create reactive, distributed, parallel and resilient concurrent applications in Scala or Java. Akka uses the Actor model to hide all the thread-related code and gives you really simple and helpful interfaces to implement a scalable and fault-tolerant system easily. A good example for Akka is a real-time application that consumes and process data coming from mobile phones and sends them to some kind of storage.

Apache Spark (not Spark Streaming) is a framework to process batch data using a generalized version of the map-reduce algorithm. A good example for Apache Spark is a calculation of some metrics of stored data to get a better insight of your data. The data gets loaded and processed on demand.

Apache Spark Streaming is able to perform similar actions and functions on near real-time small batches of data the same way you would do it if the data would be already stored.

UPDATE APRIL 2016

From Apache Spark 1.6.0, Apache Spark is no longer relying on Akka for communication between nodes. Thanks to @EugeneMi for the comment.

Spark is for data processing what Akka is to managing data and instruction flow in an application.

TL;DR

Spark and Akka are two different frameworks with different uses and use cases.

When building applications, distributed or otherwise, one may need to schedule and manage tasks through a parallel approach such as by using threads. Imagine a huge application with lots of threads. How complicated would that be?

TypeSafe's (now called Lightbend) Akka toolkit allows you to use Actor systems (originally derived from Erlang) that gives you an abstraction layer over threads. These actors are able to communicate with each other by passing anything and everything as messages, and do things parallel and without blocking other code.

Akka gives you a cherry on the top by providing you ways to run the Actors in a distributed environment.

Apache Spark, on the other hand, is a data processing framework for massive datasets that cannot be handled manually. Spark makes use of what we call an RDD (or Resilient Distributed Datasets) which is distributed list like abstraction layer over your traditional data structures so that operations could be performed on different node parallel to each other.

Spark makes use of the Akka toolkit for scheduling jobs between different nodes.

Apache Spark:

Apache Spark™ is a fast and general engine for large-scale data processing.

Spark run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).

Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)
Can run on clusters managed by Hadoop YARN or Apache Mesos, and can also run in Standalone mode
Provides APIs in Scala, Java, and Python, with support for other languages (such as R) on the way
In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.

We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

Have a look at infoQ and toptal articles for better understanding.

Major Use cases for Spark:

Machine Learning algorithms
Interactive analytics
Streaming data

Akka: from Letitcrash

Akka is an event-driven middleware framework, for building high performance and reliable distributed applications in Java and Scala. Akka decouples business logic from low-level mechanisms such as threads, locks and non-blocking IO. With Akka, you can easily configure how actors will be created, destroyed, scheduled, and restarted upon failure.

Have a look at this typesafe article for better understanding on Actor framework.

Akka provides fault-tolerance based on supervisor hierarchies. Every Actor can create other Actors, which it will then supervise, making decisions if they should be resumed, restarted, retired or if the problem should be escalated.

Have a look at Akka article & SO questions

Major use cases :

Transaction processing
Concurrency/parallelism
Simulation
Batch processing
Gaming and Betting
Complex Event Stream Processing

The choice between Apache Spark, Akka, or Kafka is heavily bent towards the use case (in particular the context and background of the services to be designed) in which they are being deployed. Some of the factors include Latency, Volume, 3rd party integrations, and the nature of the processing required (like batch or streaming, etc.). I found this resource to be of particular help - https://conferences.oreilly.com/strata/strata-ca-2016/public/schedule/detail/47251

Apache Spark vs Akka [closed]

Related

Recent Posts