R: extract dates and numbers from PDF
I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information.
I need to extract the date and the time of the accident (first page) and some other information, like Pilot's age or its Flight experience. What I tried does the job for several files but is not working for each file the since code I am using is poorly written.
# an example with a single file
library(pdftools)
library(readr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf),
what="character", sep = "\n")
# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, " ")
date[[1]][9] #this method is not desirable since the date will not be always in that position
# Pilot age
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = ' ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #
The main issue I got is when I am trying to extract the date and time of the accident. Is it possible to extract that exact piece of information by avoiding using a list as I did here?
Solution 1:
I think the best approach to achieve what you want is to use regex
.
In this case I use stringr
library. The main idea with regex
is to find
the desire string pattern, in this case is the date 'July 29, 2014, 11:15'
Take on count that you'll have to check the date format for each pdf file
library(pdftools)
library(readr)
library(stringr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(), "/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
## New code
# Regex pattern for date 'July 29, 2014, 11:15'
regex_pattern <- "[T|t]ime\\:(.*\\d{2}\\:\\d{2})"
# Getting date from page 1
grouped_matched <- str_match_all(pdf[1], regex_pattern)
# This returns a list with groups. You're interested in group 2
raw_date <- grouped_matched[[1]][2] # First element, second group
# Clean date
date <- trimws(raw_date)
# Using dplyr
library(dplyr)
date <- pdf[1] %>%
str_match_all(regex_pattern) %>%
.[[1]] %>% # First list element
.[2] %>% # Second group
trimws() # Remove extra white spaces
You can make a function to extract the date changing the regex
pattern for different files
Regards