Use pdfplumber with multi-threading
See original GitHub issueI am trying to extract tables using pdfplumber page by page using multithreading.
Code:
def print_tables(p, ts):
tables = p.extract_tables(table_settings=ts)
for table in tables:
for row in table:
print(row)
pdf = pdfplumber.open("/path/to/file.pdf")
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
}
threads = list()
for page in pdf.pages:
thread = Thread(target=print_tables, args=(page, ts))
thread.start()
for thread in threads:
thread.join()
But, it fails with the error TypeError: unsupported operand type(s) for *: 'int' and 'PSLiteral'. I can use it without threads though.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Multiprocessing for appending text extracted with a loop to a ...
I created a working function that iterates through all PDFs in a directory, extracts text using pdfplumber and appends it to a list....
Read more >Developers - Use pdfplumber with multi-threading - - Bountysource
I am trying to extract tables using pdfplumber page by page using multithreading. Code: def print_tables(p, ts): tables ...
Read more >Multithreading Support - Html-to-pdf.net
The render and save methods of the PdfConverter object can be safely called from multiple threads of an application and all the memory...
Read more >How To Easily Extract Text From Any PDF With Python - Medium
1. Import your module. pip install pdfplumber -qimport pdfplumber. Now let's take a look at the main functions PDF Plumber ...
Read more >后面我们会捡重点讲解,先看下如何用pdfplumber提取pdf表格 ...
Note: Any bookmarks or article threading associated with pages are not ... You can use these components to modify pdfminer. pdfplumber提取表格有很多的细节 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
if you run
page.charsbefore the threading block it should work fineI spent some time today looking into this, and have a partial answer. Commit https://github.com/jsvine/pdfplumber/commit/a019517346ff84b914fc20c399ba15767957f3f9 makes some small changes — assigning the
deviceandinterpreterat the page level, rather than PDF level — that eliminates the initial (and most common) exception @samkit-jain was seeing, based on the code he shared. However, it does not fully solve the problem, for at least two reasons:First, the results of the table extraction are sometimes (unpredictably/randomly) incorrect, in a very specific way related to the positioning of characters. But that’s just for the PDF and code @samkit-jain shared; there may be (and very likely are) other ways in which the results would be incorrect for other PDFs and code.
Second, occasionally the code does throw an exception; here’s the most recent I noticed:
Based on those two things, and after examining the
pdfminer.sixcode more closely, my hunch is that the main obstacle with getting multithreading to work may bePSBaseParser/PDFParser’s document-wide._tokensstack. It seems to get reset when starting to parse a page, but if two pages are getting parsed simultaneously that would seem to cause problems.Given how deeply entwined that logic is within
pdfminer.six, I don’t think there’s much we can do inpdfplumberto fully resolve this. If you want to get simultaneous-page-parsing working, I’d suggest opening an issue inpdfminer.sixusing an example from that library.