Use pdfplumber with multi-threading

See original GitHub issue

I am trying to extract tables using pdfplumber page by page using multithreading.

Code:

def print_tables(p, ts):
    tables = p.extract_tables(table_settings=ts)

    for table in tables:
        for row in table:
            print(row)


pdf = pdfplumber.open("/path/to/file.pdf")

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

threads = list()

for page in pdf.pages:
    thread = Thread(target=print_tables, args=(page, ts))
    thread.start()

for thread in threads:
    thread.join()

But, it fails with the error TypeError: unsupported operand type(s) for *: 'int' and 'PSLiteral'. I can use it without threads though.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
AgoloASehailycommented, Jun 27, 2022

if you run page.chars before the threading block it should work fine

1reaction
jsvinecommented, Feb 13, 2021

I spent some time today looking into this, and have a partial answer. Commit https://github.com/jsvine/pdfplumber/commit/a019517346ff84b914fc20c399ba15767957f3f9 makes some small changes — assigning the device and interpreter at the page level, rather than PDF level — that eliminates the initial (and most common) exception @samkit-jain was seeing, based on the code he shared. However, it does not fully solve the problem, for at least two reasons:

First, the results of the table extraction are sometimes (unpredictably/randomly) incorrect, in a very specific way related to the positioning of characters. But that’s just for the PDF and code @samkit-jain shared; there may be (and very likely are) other ways in which the results would be incorrect for other PDFs and code.

Second, occasionally the code does throw an exception; here’s the most recent I noticed:

  File "[...]/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "[...]/pdfminer/pdfinterp.py", line 906, in render_contents
    self.init_resources(resources)
  File "[...]/pdfminer/pdfinterp.py", line 354, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "[...]/pdfminer/pdfinterp.py", line 187, in get_font
    font = PDFTrueTypeFont(self, spec)
  File "[...]/pdfminer/pdffont.py", line 618, in __init__
    widths = list_value(spec.get('Widths', [0]*256))
  File "[...]/pdfminer/pdftypes.py", line 161, in list_value
    x = resolve1(x)
  File "[...]/pdfminer/pdftypes.py", line 82, in resolve1
    x = x.resolve(default=default)
  File "[...]/pdfminer/pdftypes.py", line 70, in resolve
    return self.doc.getobj(self.objid)
  File "[...]/pdfminer/pdfdocument.py", line 683, in getobj
    obj = self._getobj_parse(index, objid)
  File "[...]/pdfminer/pdfdocument.py", line 646, in _getobj_parse
    (_, kwd) = self._parser.nexttoken()
  File "[...]/pdfminer/psparser.py", line 495, in nexttoken
    token = self._tokens.pop(0)
IndexError: pop from empty list

Based on those two things, and after examining the pdfminer.six code more closely, my hunch is that the main obstacle with getting multithreading to work may be PSBaseParser/PDFParser’s document-wide ._tokens stack. It seems to get reset when starting to parse a page, but if two pages are getting parsed simultaneously that would seem to cause problems.

Given how deeply entwined that logic is within pdfminer.six, I don’t think there’s much we can do in pdfplumber to fully resolve this. If you want to get simultaneous-page-parsing working, I’d suggest opening an issue in pdfminer.six using an example from that library.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multiprocessing for appending text extracted with a loop to a ...
I created a working function that iterates through all PDFs in a directory, extracts text using pdfplumber and appends it to a list....
Read more >
Developers - Use pdfplumber with multi-threading - - Bountysource
I am trying to extract tables using pdfplumber page by page using multithreading. Code: def print_tables(p, ts): tables ...
Read more >
Multithreading Support - Html-to-pdf.net
The render and save methods of the PdfConverter object can be safely called from multiple threads of an application and all the memory...
Read more >
How To Easily Extract Text From Any PDF With Python - Medium
1. Import your module. pip install pdfplumber -qimport pdfplumber. Now let's take a look at the main functions PDF Plumber ...
Read more >
后面我们会捡重点讲解,先看下如何用pdfplumber提取pdf表格 ...
Note: Any bookmarks or article threading associated with pages are not ... You can use these components to modify pdfminer. pdfplumber提取表格有很多的细节 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found