How to integrate Apache Spark with MySQL for reading database tables as a spark dataframe? [closed]

I want to run my existing application with Apache Spark and MySQL.


From pySpark, it work for me :

dataframe_mysql = mySqlContext.read.format("jdbc").options(
    url="jdbc:mysql://localhost:3306/my_bd_name",
    driver = "com.mysql.jdbc.Driver",
    dbtable = "my_tablename",
    user="root",
    password="root").load()

With spark 2.0.x,you can use DataFrameReader and DataFrameWriter. Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.

Suppose using spark-shell.

read example

val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"

val df=spark.read.jdbc(url,"table_name",prop) 
df.show()

read example 2

val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:mysql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .load()

from spark doc

read example3

If you want to read data from a query result rather than a table.

val sql="""select * from db.your_table where id>1"""
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:mysql:dbserver")
  .option("dbtable",  s"( $sql ) t")
  .option("user", "username")
  .option("password", "password")
  .load()

write example

import org.apache.spark.sql.SaveMode

val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)

中文版戳我


Using Scala, this worked for me : Use the commands below:

sudo -u root spark-shell --jars /mnt/resource/lokeshtest/guava-12.0.1.jar,/mnt/resource/lokeshtest/hadoop-aws-2.6.0.jar,/mnt/resource/lokeshtest/aws-java-sdk-1.7.3.jar,/mnt/resource/lokeshtest/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar --packages com.databricks:spark-csv_2.10:1.2.0

import org.apache.spark.sql.SQLContext

val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()

dataframe_mysql.show()

For Scala if you use the sbt this will also work.

In your build.sbt file:

libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "1.6.2",
    "org.apache.spark" %% "spark-sql" % "1.6.2",
    "org.apache.spark" %% "spark-mllib" % "1.6.2",
    "mysql" % "mysql-connector-java" % "5.1.12"
)

Then you just need to declare your usage of the driver.

Class.forName("com.mysql.jdbc.Driver").newInstance

val conf = new SparkConf().setAppName("MY_APP_NAME").setMaster("MASTER")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val data = sqlContext.read
.format("jdbc")
.option("url", "jdbc:mysql://<HOST>:3306/<database>")
.option("user", <USERNAME>)
.option("password", <PASSWORD>)
.option("dbtable", "MYSQL_QUERY")
.load()

For Java(using maven), add spark dependencies and sql driver dependencies in your pom.xml file,

<properties>
    <java.version>1.8</java.version>
    <spark.version>1.6.3</spark.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
 <dependencies>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>6.0.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>${spark.version}</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.11</version>
        <scope>test</scope>
    </dependency>
</dependencies>

Sample code, suppose your mysql locates at local, database name is test, user name is root and password is password, and two tables in test db are table1 and table2

SparkConf sparkConf = new SparkConf();
SparkContext sc = new SparkContext("local", "spark-mysql-test", sparkConf);
SQLContext sqlContext = new SQLContext(sc);

// here you can run sql query
String sql = "(select * from table1 join table2 on table1.id=table2.table1_id) as test_table";
// or use an existed table directly
// String sql = "table1";
DataFrame dataFrame = sqlContext
    .read()
    .format("jdbc")
    .option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
    .option("user", "root")
    .option("password", "password")
    .option("dbtable", sql)
    .load();

// continue your logical code
......