Can I get the file names for synced text files in my pipeline in Foundry?

The response from ollie299792458 will work only if dataFrameReaderClass is com.palantir.foundry.spark.input.TextDataFrameReader.

Alternatively you can get a file name when reading the dataset in Code Repositories or Workbooks using Spark input_file_name function:

Creates a string column for the file name of the current Spark task.

If you're immediately going into code repos or code workbooks, then you can use the input_file_name() function (see proggeo's answer below). This is likely easier and simpler than the below, but won't work if you're going to do something else with the data.

Schema Method

If you open your dataset, then go to Details -> Schema, you can edit the schema to add a file path column, for each row this will have the value of the path of the file that the row comes from.

The key part is the _filePath member of fieldSchemaList and "addFilePath": true under customMetadata. The first is a special column that TextDataFrameReader populates with the file path, the second tells the reader to populate that column. The other column in the example below (content) just contains everything in each file.

For more details see the Metadata section in the Foundry core backend in platform documentation. This is also possible for csv's and more structured data with different Reader classes.

Full schema example

{
"fieldSchemaList": [
    {
        "type": "STRING",
        "name": "content",
        "nullable": null,
        "userDefinedTypeClass": null,
        "customMetadata": {},
        "arraySubtype": null,
        "precision": null,
        "scale": null,
        "mapKeyType": null,
        "mapValueType": null,
        "subSchemas": null
    },
    {
      "type": "STRING",
      "name": "_filePath",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    }
],
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
    "textParserParams": {
      "parser": "SINGLE_COLUMN_PARSER",
      "nullValues": null,
      "nullValuesPerColumn": null,
      "charsetName": "UTF-8",
      "addFilePath": true,
      "addByteOffset": false,
      "addImportedAt": false
    }
}
}

Can I get the file names for synced text files in my pipeline in Foundry?

Related

Recent Posts