Tika OCR slowing down over time running on Windows Server

  apache-tika, python-3.x, windows

I am trying to OCR a mix of file types (pdf, doc, xls, etc) in batches of ~14,000 on a remote Windows 2012 server (over which I do not have full administrative permissions). I am using Python3 to run Tika 1.24, and outputting the results (via pymongo) to MongoDB on the same machine.

The problem is that with each batch of files I OCR, the process slows down significantly. It is to the point where I’m not sure I’ll even ever be able to finish OCR’ing the files. I’ve made sure to shut down Powershell (which is how I’m running the Python script) between batches, in case there were multiple instances of the Tika server running in the background and slowing things down, but it doesn’t help. I looked to this post to see if it was a memory leak problem, but since the default file batch setting is 100,000, it doesn’t seem like this would help me much . . .? It does look like memory leaks might have been an issue in an earlier version of Tika (and it looks like it’s technically an upstream issue, not Tika itself).

I’m especially flummoxed because running essentially the exact same script on my local Mac, I can complete many batches of files in quick succession in a very short amount of time.

What incredibly obvious thing am I missing that makes this script slow down over time on Windows Server 2012 but is not a problem on a Mac?

from pathlib import Path
import tika
from tika import parser
import pymongo
from pymongo import MongoClient
import re
import sys
import csv
import glob

DIRECTORY = ‘pathtofiles'

tika.initVM()
counter = 0
pathlist = Path(DIRECTORY).glob('**/*')

for path in pathlist:
    counter+=1
    try:
        ext = path.suffix
        if ext in ('', '.DS_Store', '.zip', '.pages', '.db', '.rtf', '.rar', '.swf', '.tif', '.tiff', '.jpg', '.jpg', '.jpeg', '.png', '.bmp'): #
            pass
        else:
            file_dict = {}
            parsed = parser.from_file(str(path), requestOptions={'headers': headers, 'timeout': 300})
            file_dict['filename']=path.name
            file_dict['metadata']=parsed["metadata"]
            file_dict['file_contents']=parsed["content"]

            try:
                client = MongoClient()
            except:
                print("Could not connect to MongoDB")
            db = client.DBNAME
            collection = db.COLLNAME
            try:
                collection.update({"filename":path.name}, file_dict, upsert=True)
            except Exception as e:
                print (e)
                print("Could not insert to collection")

    except Exception as e:
            print('')
            print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(e).__name__, e)

Source: Windows Questions

LEAVE A COMMENT