How to create a multiple page pdf with pytesseract?
I'm trying to mark only a few words in a pdf and with the results I want to make a new pdf using only pytesseract.
Here is the code:
images = convert_from_path(name,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i in images:
img = cv.cvtColor(np.array(i),cv.COLOR_RGB2BGR)
d = pytesseract.image_to_data(img,output_type=Output.DICT,lang='eng+equ',config="--psm 6")
boxes = len(d['level'])
for i in range(boxes):
for e in functionEvent: #functionEvent is a list of strings
if e in d['text'][i]:
(x,y,w,h) = (d['left'][i],d['top'][i],d['width'][i],d['height'][i])
cv.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)
pdf = pytesseract.image_to_pdf_or_hocr(img,extension='pdf')
with open('results.pdf','w+b') as f:
f.write(pdf)
What have I tried:
with open('results.pdf','a+b') as f:
f.write(pdf)
If you know how can I fix this just let me know. Also I don't care at all if you recommand another module or your opinion how am I supposed to write code.
Thanks in advance!
Try using PyPDF2 to link your pdfs together. Firstly you extract your text from pdf with tesseract OCR and store it into list object like this :
for filename in tqdm(os.listdir(in_dir)):
img = Image.open(os.path.join(in_dir,filename))
pdf = pytesseract.image_to_pdf_or_hocr(img, lang='slk', extension='pdf')
pdf_pages.append(pdf)
then iterate trough each processed image or file, read the bytes and add pages using PdfFileReader like this(do not forget to import io):
pdf_writer = PdfFileWriter()
for page in pdf_pages:
pdf = PdfFileReader(io.BytesIO(page))
pdf_writer.addPage(pdf.getPage(0))
In the end create the file and store data to it:
file = open(out_dir, "w+b")
pdf_writer.write(file)
file.close()