I am trying to OCR a mix of file types (pdf, doc, xls, etc) in batches of ~14,000 on a remote Windows 2012 server (over which I do not have full administrative permissions). I am using Python3 to run Tika 1.24, and outputting the results (via pymongo) to MongoDB on the same machine. The problem ..
Error reading a pdf file using tika Python I’m trying to read a pdf file using Python. I’ve tried PyPDF2, but the output given is not very accurate. So I’ve read here (How to extract text from a PDF file?) that using tika I would obtain a better solution. This is my code from tika ..
Tika throwing AttributeError through python on windows pdf=parser.from_file(“Sample.pdf”) print(pdf[‘content’]) I am trying to use the Tika module for python to read a pdf file (above code) however the following error is thrown: AttributeError: module ‘os’ has no attribute ‘setsid’ From my understanding setsid is only available in unix and I am on windows. Does this ..