Is there an efficient way to copy text from a PDF without the line breaks?

I need to get thousands of snippets of text from PDFs to a spreadsheet. They are short, seldom more than 2-3 rows, but each line break creates a new cell, and I have to repair that manually, which costs lots of time.

Because I have so many of them, using the "paste into Word and do a find-and-replace" workaround is just too time-wasting for me. Is there a way to have the line break disappear on copy? Maybe there is a viewer which offers a special copy mode for this, or has a plugin?

The documents are scientific articles. The text arrangement is quite linear. You can assume that the text I'm copying is not inside a table or a float, and not rotated or anything. (If such a thing happens, I think I'll deal with it manually). The text is frequently set in two columns, but I have no trouble marking just the text I need from its column. I don't need to preserve any special formatting. I'm willing to try a solution which removes all unprintable characters, for example. The texts are in English, it is OK if the solution only works in ASCII/strips all non-alphanumeric ASCII of the copied text.

I have a very strong preference for a solution which will work on Linux, possibly some kind of Okular plugin. But if there happens to be a Windows-only solution, I want to hear about it too. I have a license for a somewhat recent Acrobat Pro on the Windows machine.

Solution 1:

I had a similar problem while I was working on a text to speech script a while ago. My script would try to break up the text input into chunks by looking for newlines. With PDF files this would result in a mess because of the way each line ends with a newline.

So what I did was compose a few sed and tr commands to only consider newlines ending with a full stop as actual line breaks. It wasn't very pretty but it worked.

Using this snippet I wrote a small script for you that I hope will help:

#!/bin/bash

# title: copy_without_linebreaks
# author: Glutanimate (github.com/glutanimate)
# license: MIT license

# Parses currently selected text and removes 
# newlines that aren't preceded by a full stop

SelectedText="$(xsel)"

ModifiedText="$(echo "$SelectedText" | \
    sed 's/\.$/.|/g' | sed 's/^\s*$/|/g' | tr '\n' ' ' | tr '|' '\n')"

#   - first sed command: replace end-of-line full stops with '|' delimiter and keep original periods.
#   - second sed command: replace empty lines with same delimiter (e.g.
#     to separate text headings from text)
#   - subsequent tr commands: remove existing newlines; replace delimiter with
#     newlines
# This is less than elegant but it works.

echo "$ModifiedText" | xsel -bi

The script uses xsel to parse the currently highlighted text and then modifies it with the sed and tr command-line I mentioned above. The processed text is then passed back to the clipboard via xsel -bi.

Here's how you can use the script in your scenario:

Make sure you have xsel installed (sudo apt-get install xsel on (K)Ubuntu)
save the script as copy_without_linebreaks or something similar and make it executable
assign the script to a hotkey of your choice in your WM preferences
highlight some text and press the hotkey
The clipboard should automatically be filled with the modified text

Solution 2:

This has been bugging me for years, so I figured out a general (Windows) solution using Autohotkey. Autohotkey is a lightweight, free, open-source scripting software for Windows to create hotkeys for almost anything imaginable.

When Ctrl+c is hit, the code only fires if the active window is a PDF reader, otherwise it simply copies the given selection as usual. In case of a PDF reader, it copies the selection, removes linebreaks and double spaces and puts result into the clipboard. If nothing is selected, the clipboard is practically untouched.

#IfWinActive ahk_class classFoxitReader
^c:: 
    old := ClipboardAll
    clipboard := ""
    send ^c
    clipwait 0.1
    if clipboard = 
        clipboard := old
    else {
        tmp := RegExReplace(clipboard, "(\S.*?)\R(.*?\S)", "$1 $2")
        clipboard := tmp
        StringReplace clipboard, clipboard, % "  ", % " ", A
        clipwait 0.1
        }
    old := ""
    tmp := ""
return

The only task before applying this code is the window class name (ahk_class) of your reader. I use a single PDF reader for all cases (and I assume most people do that), FoxitReader, and its ahk_class is classFoxitReader. You can figure out the class for your own software easily by the WinGetClass command (e.g. AcrobatSDIWindow for Acrobat Reader).

If you prefer to read PDF-s in your browser, this is not your solution. Or you could simply remove the #IfWinActive ahk_class classFoxitReader line so that the code always fires, but in this case the result will always be stripped of linebreaks and double spaces.

Solution 3:

Another thing that worked out for me was saving the pdf file as html. Paragraphs in the html stay intact, ready for copy&paste. Other file formats work as well, such as txt or rtf... This should also work on Linux systems.

Solution 4:

A third approach using macros is shown here, but I haven't tried it. I pasted the macros here for future reference, macro 2 is by the author of the source - "Deborah Savadra" - and macro 1 by her reader "Benjamin":

macro 1:

Sub pagebreaks()
'
' pagebreaks Macro
'
'
    Selection.Find.ClearFormatting
    Selection.Find.Replacement.ClearFormatting
    With Selection.Find
        .Text = "^p^p"
        .Replacement.Text = "¬ ¬"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "¬"
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

macro 2:

 Sub pagebreaks()
'
' pagebreaks Macro
'
'
    Selection.Find.ClearFormatting
    Selection.Find.Replacement.ClearFormatting
    With Selection.Find
        .Text = "^p^p"
        .Replacement.Text = "|"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "|"
        .Replacement.Text = "^p^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Solution 5:

There is a Windows solution shown here. One has to download the file "PDF Copy-Paster.exe" and run it before the copy&paste-action. I tried it out and it works just fine, except that it removes all linebreaks. So if you copy multiply paragraphs you later have only one.

There is a related question on SU with a littlebit explanation, it may be of interest for someone...