creating spark data structure from multiline record

PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below

from operator import itemgetter

retrosheet = sc.newAPIHadoopFile(
    '/path/to/retrosheet/file',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
    .filter(itemgetter(1))
    .values()
    .filter(lambda x: x)
    .map(lambda v: (
        v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))

Since Spark 2.4 you can also read data into DataFrame using text reader

spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

JavaScript - why Array.prototype.fill actually fills a "pointer" of object when filling anything like 'new Object()'

Using regexes, how to efficiently match strings between double quotes with embedded double quotes?

Matplotlib and Ipython-notebook: Displaying exactly the figure that will be saved

Dynamically call macro from sas data step

Normalize array subscripts for 1-dimensional array so they start with 1

Programmatically navigating in React-Router v4

How to replace MySQL functions with PDO?

JQGRID - maintain check box selection state - page refresh / redirect / reload

SQL query to pivot a column using CASE WHEN

Add event handler to HTML element using javascript

Eclipse: Attach source/javadoc to a library via a local property

Set default home page via <welcome-file> in JSF project

creating spark data structure from multiline record

Related

Recent Posts