Using pyspark to connect to PostgreSQL

Solution 1:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html

Then replace the database configuration values by yours.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/databasename") \
    .option("dbtable", "tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Solution 2:

The following worked for me with postgres on localhost:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

For the pyspark shell you use the SPARK_CLASSPATH environment variable:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrame as follows:

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tablename', properties=properties
)

or alternatively:

df = sqlContext.read.format('jdbc').\
    options(url='jdbc:%s' % url, dbtable='tablename').\
    load()

Note that when submitting the script via spark-submit, you need to define the sqlContext.

Solution 3:

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

and work fine in pyspark console and jupyter

Solution 4:

You normally need either:

  1. to install the Postgres Driver on your cluster,
  2. to provide the Postgres driver jar from your client with the --jars option
  3. or to provide the maven coordinates of the Postgres driver with --packages option.

If you detail how are you launching pyspark, we may give you more details.

Some clues/ideas:

spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell

Solution 5:

One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.

This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
 --packages org.postgresql:postgresql:9.4.1211\
 --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
 --master local[4] main.py

And in main.py:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataframe = spark.read.format('jdbc').options(
        url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
        database='my_db',
        dbtable='my_table'
    ).load()

dataframe.show()