tabula extract table from pdf remove line break
You need to add a parameter. Replace
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1)
table[0]
with
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1, lattice = True)
table[0]
All this according to the documention here
Here is an example:
Se the article "https://effectivehealthcare.ahrq.gov/sites/default/files/pdf/methods-guidance-tests-bias_methods.pdf"
import tabula
import io
import pandas as pd
file1 = r"C:\Users\s-degossondevarennes\.......\Desktop\methods-guidance-tests-bias_methods.pdf"
table = tabula.read_pdf(file1,pages=3,lattice=True, )
df = table[0]
df = df.drop(['Unnamed: 1','Unnamed: 2','Description','Unnamed: 3'],axis=1)
df
returns:
Unnamed: 0 \
0 NaN
1 Spectrum effect
2 Context bias
3 Selection bias
4 NaN
5 Variation in test execution
6 Variation in test technology
7 Treatment paradox
8 Disease progression bias
9 NaN
10 Inappropriate reference\rstandard
11 Differential verification bias
12 Partial verification bias
13 NaN
14 Review bias
15 Clinical review bias
16 Incorporation bias
17 Observer variability
18 NaN
19 Handling of indeterminate\rresults
20 Arbitrary choice of threshold\rvalue
Source of Systematic Bias
0 Population
1 Tests may perform differently in various sampl...
2 Prevalence of the target condition varies acco...
3 The selection process determines the compositi...
4 Test Protocol: Materials and Methods
5 A sufficient description of the execution of i...
6 When the characteristics of a medical test cha...
7 Occurs when treatment is started on the basis ...
8 Occurs when the index test is performed an unu...
9 Reference Standard and Verification Procedure
10 Errors of imperfect reference standard bias th...
11 Part of the index test results is verified by ...
12 Only a selected sample of patients who underwe...
13 Interpretation
14 Interpretation of the index test or reference ...
15 Availability of clinical data such as age, sex...
16 The result of the index test is used to establ...
17 The reproducibility of test results is one det...
18 Analysis
19 A medical test can produce an uninterpretable ...
20 The selection of the threshold value for the i...
The three dots in the column Source of Systematic Bias
show that everything that was in that cell, with line breaks i considered as a single cell (item), not multiple cells. Another proof of that is
df.iloc[2,1]
returns the cell content:
'Prevalence of the target condition varies according to setting and may affect\restimates of test performance. Interpreters may consider test results to be\rpositive more frequently in settings with higher disease prevalence, which may\ralso affect estimates of test performance.'
There must be something with your pdf. If it's available online, share the link and I'll take a look.