Using Tabula to pull tables out pdf

We have standard reports uploaded as PDFs on a daily basis. In the PDFs are some tables that we want to pull into datasets. I have tabula imported in code repositories but I can't seem to get code repositories to bring in the PDF.

I recieve this error:

"Error loading inputs"

"{"message":"Unable to resolve alias","alias":"/US Office/COO/pdf_tests/test.pdf","fallbackBranches":["master"]}"

And this is the very basic code I am using:

from transforms.api import transform_df, Input, Output
import tabula


@transform_df(
    Output("/US Office/COO/pdf_tests/datasets/pdf_read"),
    source_df=Input("/US Office/COO/pdf_tests/test.pdf"),
)
def compute(source_df):
    df = source_df
    df = tabula.read_pdf(df, pages='all')
    return df

I can easily make tabula output a csv with my local python install just not in Foundry. Any help you can give would be great as I am very new to Palantir foundry and code repositories.

Thank You!

First things first, Thank you to fmsf for getting me goin in the right direction.

Next, The Answer:

I was able to get it to work once I got rid of the .read() in the result variable. Tabula want to see a path(which is hard in this situation) or a Python file like object which is what the .open returns. below is the code that works and will hopefully help to get others started.

from transforms.api import transform, Input, Output
import tabula
from pyspark.sql.types import StructType, StructField, StringType
import pandas as pd


@transform(  #note that this is @transform and not @transform_df
    out=Output("/US Office/COO/pdf_tests/outdata"),     # The output datasheet path
    raw=Input("/US Office/COO/pdf_tests/datasets/dataset_pdf"),  # path to the input raw datasheet that contains the PDF
)
def compute(ctx, raw, out):
    rows = []
    result = []     # define results list
    fs = raw.filesystem()   # variable to make accsessing the datasheet filesystem easy
    latest_file = 'test.pdf'    # the name of the pdf inside the dataset
    df_schema = StructType([        # defining the spark schema
        StructField("system", StringType(), True),
        StructField("status", StringType(), True),
        StructField('date', StringType(), True),
        StructField('user', StringType(), True),
        StructField('id', StringType(), True)
        ])
    with fs.open(latest_file, mode='rb') as f:   # with statement that opens the pdf as f. rb essential as it has to be opened as a raw binary file
        result = tabula.read_pdf(f, pages='all', multiple_tables=True)   # basic tabula read function that generates a list of tables within another list.

    rows.append(result[0])  # adds the list of tables into the rows list
    df = pd.DataFrame(rows[0])  # Creates a pandas dataframe using the first table in the pdf
    df = ctx.spark_session.createDataFrame(df, schema=df_schema)    # casts the pandas df to a pyspark df using the defined schema
    out.write_dataframe(df)  # writes the pyspark dataframe to the ouput dataset

Using Tabula to pull tables out pdf

Related

Recent Posts