How to convert a pdf file into a text file?
Is there an easy way to extract plain text from a pdf file?
On *nix systems I used to have a command ps2ascii
that would do the job, but it seems that this command is not installed by default on my Mac.
What would be the easiest way to extract text from a pdf file or, alternatively, how to get ps2ascii
on my system?
Adobe Reader has a "Save as Text…" option under the File menu. Easiest way.
ps2ascii
is a part of Ghostscript, which can be installed on Mac OS X (and it might already be by default from the factory).
If you don't mind using a GUI, you can select text from a PDF opened with Preview.app
The following python script will output the text from a PDF document to a .txt file. (Note: There is no guarantee that the text is necessarily in 'logical' human readable order, due to the way that data is held in the PDF format.)
The script will create text files for any PDF files supplied as arguments to it on the command line (e.g. pdf2txt.py myPDF.pdf
), or you can use in Automator's "Run Shell Script" action, setting the shell type to python and Pass input to "As arguments".
#!/usr/bin/python
# coding: utf-8
import os, sys
from Quartz import PDFDocument
from CoreFoundation import (NSURL, NSString)
NSUTF8StringEncoding = 4
def pdf2txt():
for filename in sys.argv[1:]:
inputfile =filename.decode('utf-8')
shortName = os.path.splitext(filename)[0]
outputfile = shortName+" text.txt"
pdfURL = NSURL.fileURLWithPath_(inputfile)
pdfDoc = PDFDocument.alloc().initWithURL_(pdfURL)
if pdfDoc :
pdfString = NSString.stringWithString_(pdfDoc.string())
pdfString.writeToFile_atomically_encoding_error_(outputfile, True, NSUTF8StringEncoding, None)
if __name__ == "__main__":
pdf2txt()