Spark using PySpark read images
Solution 1:
Your data looks like the raw bytes from a real image file (JPG?). The problem with your data is that it should be bytes, not unicode. You have to figure out how to convert from unicode to bytes. There is a whole can of worms full of encoding traps you have to deal with, but you may be lucky using img.encode('iso-8859-1')
. I don't know and I will not deal with that in my answer.
The raw data for a PNG image looks like this:
rawdata = '\x89PNG\r\n\x1a\n\x00\x00...\x00\x00IEND\xaeB`\x82'
Once you have it in bytes, you can create a PIL image from the raw data, and read it as a nparray:
>>> from StringIO import StringIO
>>> from PIL import Image
>>> import numpy as np
>>> np.asarray(Image.open(StringIO(rawdata)))
array([[[255, 255, 255, 0],
[255, 255, 255, 0],
[255, 255, 255, 0],
...,
[255, 255, 255, 0],
[255, 255, 255, 0],
[255, 255, 255, 0]]], dtype=uint8)
All you need to make it work on Spark is SparkContext.binaryFiles
:
>>> images = sc.binaryFiles("path/to/images/")
>>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))
>>> images.values().map(image_to_array)
Solution 2:
In Spark 2.3 or later you can use built-in Spark tools to load image data into Spark DataFrame
. In 2.3
from pyspark.ml.image import ImageSchema
ImageSchema.readImages("path/to/images/")
In Spark 2.4 or later:
spark.read.format("image").load("path/to/images/")
This creates an object with following schema:
root
|-- image: struct (nullable = true)
| |-- origin: string (nullable = true)
| |-- height: integer (nullable = false)
| |-- width: integer (nullable = false)
| |-- nChannels: integer (nullable = false)
| |-- mode: integer (nullable = false)
| |-- data: binary (nullable = false)
where image content is loaded into image.data
.
At this moment this functionality is experimental, and lack required ecosystem, but should improve in the future.