pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

See original GitHub issue

Describe the bug

Following the example in CodeParrot, I receive an array size limitation error when deduplicating larger datasets.

Steps to reproduce the bug

dataset_name = "the_pile"
ds = load_dataset(dataset_name, split="train")
ds = ds.map(preprocess, num_proc=num_workers)
uniques = set(ds.unique("hash"))

Gists for minimum reproducible example: https://gist.github.com/conceptofmind/c5804428ea1bd89767815f9cd5f02d9a https://gist.github.com/conceptofmind/feafb07e236f28d79c2d4b28ffbdb6e2

Expected results

Chunking and writing out a deduplicated dataset.

Actual results

return dataset._data.column(column).unique().to_pylist()
File "pyarrow/table.pxi", line 394, in pyarrow.lib.ChunkedArray.unique
File "pyarrow/_compute.pyx", line 531, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 330, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 124, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
albertvillanovacommented, Aug 22, 2022

Thanks @loubnabnl for pointing out the solution to this issue.

1reaction
conceptofmindcommented, Aug 20, 2022

Hi @loubnabnl,

Yes, the issue is solved in the discussion thread.

I will close this issue.

Thank you again for all of your help.

Enrico

Read more comments on GitHub >

github_iconTop Results From Across the Web

List array cannot contain more than 2147483646 child ...
... in pyarrow.lib.check_status() ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648.
Read more >
[#ARROW-3762] [C++] Parquet arrow::Table reads error when ...
ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 ...
Read more >
Minhash Deduplication - Datasets - Hugging Face Forums
The base element has an estimation jaccard similarity higher than the ... array cannot contain more than 2147483646 bytes, have 2147483648.
Read more >
pandas to_parquet fails on large datasets - Stack Overflow
It seems you succeeded with Pyarrow to write but not to read, and failed to write with fastparquet, thus did not get to...
Read more >
Apache Arrow 2.0.0 (2020-10-13)
Apache Arrow 2.0.0 (2020-10-13). Bug Fixes. ARROW-2367 - [Python] ListArray has trouble with sizes greater than kMaximumCapacity ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found