Is there an efficient way to copy text from a PDF without the line breaks?
I need to get thousands of snippets of text from PDFs to a spreadsheet. They are short, seldom more than 2-3 rows, but each line break creates a new cell, and I have to repair that manually, which costs lots of time.
Because I have so many of them, using the "paste into Word and do a find-and-replace" workaround is just too time-wasting for me. Is there a way to have the line break disappear on copy? Maybe there is a viewer which offers a special copy mode for this, or has a plugin?
The documents are scientific articles. The text arrangement is quite linear. You can assume that the text I'm copying is not inside a table or a float, and not rotated or anything. (If such a thing happens, I think I'll deal with it manually). The text is frequently set in two columns, but I have no trouble marking just the text I need from its column. I don't need to preserve any special formatting. I'm willing to try a solution which removes all unprintable characters, for example. The texts are in English, it is OK if the solution only works in ASCII/strips all non-alphanumeric ASCII of the copied text.
I have a very strong preference for a solution which will work on Linux, possibly some kind of Okular plugin. But if there happens to be a Windows-only solution, I want to hear about it too. I have a license for a somewhat recent Acrobat Pro on the Windows machine.
Solution 1:
I had a similar problem while I was working on a text to speech script a while ago. My script would try to break up the text input into chunks by looking for newlines. With PDF files this would result in a mess because of the way each line ends with a newline.
So what I did was compose a few sed
and tr
commands to only consider newlines ending with a full stop as actual line breaks. It wasn't very pretty but it worked.
Using this snippet I wrote a small script for you that I hope will help:
#!/bin/bash
# title: copy_without_linebreaks
# author: Glutanimate (github.com/glutanimate)
# license: MIT license
# Parses currently selected text and removes
# newlines that aren't preceded by a full stop
SelectedText="$(xsel)"
ModifiedText="$(echo "$SelectedText" | \
sed 's/\.$/.|/g' | sed 's/^\s*$/|/g' | tr '\n' ' ' | tr '|' '\n')"
# - first sed command: replace end-of-line full stops with '|' delimiter and keep original periods.
# - second sed command: replace empty lines with same delimiter (e.g.
# to separate text headings from text)
# - subsequent tr commands: remove existing newlines; replace delimiter with
# newlines
# This is less than elegant but it works.
echo "$ModifiedText" | xsel -bi
The script uses xsel
to parse the currently highlighted text and then modifies it with the sed
and tr
command-line I mentioned above. The processed text is then passed back to the clipboard via xsel -bi
.
Here's how you can use the script in your scenario:
- Make sure you have
xsel
installed (sudo apt-get install xsel
on (K)Ubuntu) - save the script as
copy_without_linebreaks
or something similar and make it executable - assign the script to a hotkey of your choice in your WM preferences
- highlight some text and press the hotkey
- The clipboard should automatically be filled with the modified text
Solution 2:
This has been bugging me for years, so I figured out a general (Windows) solution using Autohotkey. Autohotkey is a lightweight, free, open-source scripting software for Windows to create hotkeys for almost anything imaginable.
When Ctrl+c is hit, the code only fires if the active window is a PDF reader, otherwise it simply copies the given selection as usual. In case of a PDF reader, it copies the selection, removes linebreaks and double spaces and puts result into the clipboard. If nothing is selected, the clipboard is practically untouched.
#IfWinActive ahk_class classFoxitReader
^c::
old := ClipboardAll
clipboard := ""
send ^c
clipwait 0.1
if clipboard =
clipboard := old
else {
tmp := RegExReplace(clipboard, "(\S.*?)\R(.*?\S)", "$1 $2")
clipboard := tmp
StringReplace clipboard, clipboard, % " ", % " ", A
clipwait 0.1
}
old := ""
tmp := ""
return
The only task before applying this code is the window class name (ahk_class
) of your reader. I use a single PDF reader for all cases (and I assume most people do that), FoxitReader, and its ahk_class
is classFoxitReader
. You can figure out the class for your own software easily by the WinGetClass
command (e.g. AcrobatSDIWindow
for Acrobat Reader).
If you prefer to read PDF-s in your browser, this is not your solution. Or you could simply remove the #IfWinActive ahk_class classFoxitReader
line so that the code always fires, but in this case the result will always be stripped of linebreaks and double spaces.
Solution 3:
Another thing that worked out for me was saving the pdf file as html. Paragraphs in the html stay intact, ready for copy&paste. Other file formats work as well, such as txt or rtf... This should also work on Linux systems.
Solution 4:
A third approach using macros is shown here, but I haven't tried it. I pasted the macros here for future reference, macro 2 is by the author of the source - "Deborah Savadra" - and macro 1 by her reader "Benjamin":
macro 1:
Sub pagebreaks()
'
' pagebreaks Macro
'
'
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "^p^p"
.Replacement.Text = "¬ ¬"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "¬"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
macro 2:
Sub pagebreaks()
'
' pagebreaks Macro
'
'
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "^p^p"
.Replacement.Text = "|"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "^p"
.Replacement.Text = " "
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "|"
.Replacement.Text = "^p^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
Solution 5:
There is a Windows solution shown here. One has to download the file "PDF Copy-Paster.exe" and run it before the copy&paste-action. I tried it out and it works just fine, except that it removes all linebreaks. So if you copy multiply paragraphs you later have only one.
There is a related question on SU with a littlebit explanation, it may be of interest for someone...