How I can convert file with any format to text format using Python 3.6?

I am trying to have a converter that can convert any file of any format to text, so that processing becomes easier to me. I have used the Python textract library.
Here is the documentation: https://textract.readthedocs.io/en/stable/

I have install it using the pip and have tried to use it. But got error and could not understand how to resolve it.

>>> import textract
>>> text = textract.process('C:\Users\beta\Desktop\Projects Done With Specification.pdf', method='pdfminer')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Even I have tried using the command without specifying method.

>>> import textract
>>> text = textract.process('C:\Users\beta\Desktop\Projects Done With Specification.pdf')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Kindly let me know how I can get rid of this issue with your suggestion. If it is possible then please suggest me the solution, if there is anything else that can be handy instead of textract, then still you can suggest me. I would like to hear.

The \ character means different things in different contexts. In Windows pathnames, it is the directory separator. In Python strings, it introduces escape sequences. When specifying paths, you have to account for this.

Try any one of these:

text = textract.process('C:\\Users\\beta\\Desktop\\Projects Done With Specification.pdf', method='pdfminer')
text = textract.process(r'C:\Users\beta\Desktop\Projects Done With Specification.pdf', method='pdfminer')
text = textract.process('C:/Users/beta/Desktop/Projects Done With Specification.pdf', method='pdfminer')

The problem is with the string

'C:\Users\beta\Desktop\Projects Done With Specification.pdf'

The \U starts an eight-character Unicode escape, such as '\U00014321`. In your code, the escape is followed by the character 's', which is invalid.

You either need to duplicate all backslashes, or prefix the string with r (to produce a raw string).

How I can convert file with any format to text format using Python 3.6?

Related

Recent Posts