R: extract dates and numbers from PDF

I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information.

I need to extract the date and the time of the accident (first page) and some other information, like Pilot's age or its Flight experience. What I tried does the job for several files but is not working for each file the since code I am using is poorly written.

# an example with a single file
library(pdftools)
library(readr)

# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)

pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf), 
            what="character", sep = "\n")

# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, "  ")
date[[1]][9] #this method is not desirable since the date will not be always in that position

# Pilot age 
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = '  ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #

The main issue I got is when I am trying to extract the date and time of the accident. Is it possible to extract that exact piece of information by avoiding using a list as I did here?


Solution 1:

I think the best approach to achieve what you want is to use regex. In this case I use stringr library. The main idea with regex is to find the desire string pattern, in this case is the date 'July 29, 2014, 11:15'

Take on count that you'll have to check the date format for each pdf file

library(pdftools)
library(readr)
library(stringr)

# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(), "/example.pdf")
download.file(file, destfile)

pdf <- pdf_text(destfile)

## New code

# Regex pattern for date 'July 29, 2014, 11:15'
regex_pattern <- "[T|t]ime\\:(.*\\d{2}\\:\\d{2})"

# Getting date from page 1
grouped_matched <- str_match_all(pdf[1], regex_pattern)

# This returns a list with groups. You're interested in group 2
raw_date <- grouped_matched[[1]][2] # First element, second group
# Clean date
date <- trimws(raw_date)


# Using dplyr
library(dplyr)

date <- pdf[1] %>%
            str_match_all(regex_pattern) %>%
            .[[1]] %>% # First list element
            .[2] %>%   # Second group
            trimws()   # Remove extra white spaces

You can make a function to extract the date changing the regex pattern for different files

Regards