Use pdfplumber with multi-threading

I am trying to extract tables using pdfplumber page by page using multithreading.

Code:

def print_tables(p, ts):
    tables = p.extract_tables(table_settings=ts)

    for table in tables:
        for row in table:
            print(row)


pdf = pdfplumber.open("/path/to/file.pdf")

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

threads = list()

for page in pdf.pages:
    thread = Thread(target=print_tables, args=(page, ts))
    thread.start()

for thread in threads:
    thread.join()

But, it fails with the error TypeError: unsupported operand type(s) for *: 'int' and 'PSLiteral'. I can use it without threads though.

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

AgoloASehailycommented, Jun 27, 2022

if you run page.chars before the threading block it should work fine

1reaction

jsvinecommented, Feb 13, 2021

I spent some time today looking into this, and have a partial answer. Commit https://github.com/jsvine/pdfplumber/commit/a019517346ff84b914fc20c399ba15767957f3f9 makes some small changes — assigning the device and interpreter at the page level, rather than PDF level — that eliminates the initial (and most common) exception @samkit-jain was seeing, based on the code he shared. However, it does not fully solve the problem, for at least two reasons:

First, the results of the table extraction are sometimes (unpredictably/randomly) incorrect, in a very specific way related to the positioning of characters. But that’s just for the PDF and code @samkit-jain shared; there may be (and very likely are) other ways in which the results would be incorrect for other PDFs and code.

Second, occasionally the code does throw an exception; here’s the most recent I noticed:

  File "[...]/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "[...]/pdfminer/pdfinterp.py", line 906, in render_contents
    self.init_resources(resources)
  File "[...]/pdfminer/pdfinterp.py", line 354, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "[...]/pdfminer/pdfinterp.py", line 187, in get_font
    font = PDFTrueTypeFont(self, spec)
  File "[...]/pdfminer/pdffont.py", line 618, in __init__
    widths = list_value(spec.get('Widths', [0]*256))
  File "[...]/pdfminer/pdftypes.py", line 161, in list_value
    x = resolve1(x)
  File "[...]/pdfminer/pdftypes.py", line 82, in resolve1
    x = x.resolve(default=default)
  File "[...]/pdfminer/pdftypes.py", line 70, in resolve
    return self.doc.getobj(self.objid)
  File "[...]/pdfminer/pdfdocument.py", line 683, in getobj
    obj = self._getobj_parse(index, objid)
  File "[...]/pdfminer/pdfdocument.py", line 646, in _getobj_parse
    (_, kwd) = self._parser.nexttoken()
  File "[...]/pdfminer/psparser.py", line 495, in nexttoken
    token = self._tokens.pop(0)
IndexError: pop from empty list

Based on those two things, and after examining the pdfminer.six code more closely, my hunch is that the main obstacle with getting multithreading to work may be PSBaseParser/PDFParser’s document-wide ._tokens stack. It seems to get reset when starting to parse a page, but if two pages are getting parsed simultaneously that would seem to cause problems.

Given how deeply entwined that logic is within pdfminer.six, I don’t think there’s much we can do in pdfplumber to fully resolve this. If you want to get simultaneous-page-parsing working, I’d suggest opening an issue in pdfminer.six using an example from that library.