ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
See original GitHub issueHi, I’m trying to load a dataset from Dataframe, but I get the error:
---------------------------------------------------------------------------
ArrowCapacityError Traceback (most recent call last)
<ipython-input-7-146b6b495963> in <module>
----> 1 dataset = Dataset.from_pandas(emb)
~/miniconda3/envs/dev/lib/python3.7/site-packages/nlp/arrow_dataset.py in from_pandas(cls, df, features, info, split)
223 info.features = features
224 pa_table: pa.Table = pa.Table.from_pandas(
--> 225 df=df, schema=pa.schema(features.type) if features is not None else None
226 )
227 return cls(pa_table, info=info, split=split)
~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
591 for i, maybe_fut in enumerate(arrays):
592 if isinstance(maybe_fut, futures.Future):
--> 593 arrays[i] = maybe_fut.result()
594
595 types = [x.type for x in arrays]
~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
426 raise CancelledError()
427 elif self._state == FINISHED:
--> 428 return self.__get_result()
429
430 self._condition.wait(timeout)
~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/thread.py in run(self)
55
56 try:
---> 57 result = self.fn(*self.args, **self.kwargs)
58 except BaseException as exc:
59 self.future.set_exception(exc)
~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
557
558 try:
--> 559 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
560 except (pa.ArrowInvalid,
561 pa.ArrowNotImplementedError,
~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
My code is :
from nlp import Dataset
dataset = Dataset.from_pandas(emb)
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
[GitHub] [arrow] westonpace commented on issue #10776
[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180.
Read more >[#ARROW-3762] [C++] Parquet arrow::Table reads error when ...
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError ... BinaryArray cannot contain more than 2147483646 bytes,...
Read more >pandas to_parquet fails on large datasets - Stack Overflow
It seems you succeeded with Pyarrow to write but not to read, and failed to write with fastparquet, thus did not get to...
Read more >Data Types and In-Memory Data Model — Apache Arrow v3.0.0
Nested types: list, struct, and union. Dictionary type: An encoded categorical type (more on this later). Each logical data type in Arrow has...
Read more >How to convert to/from Arrow and Parquet - Awkward Array
As such, arrays can usually be shared without copying, but not always. The Apache Parquet file format has strong connections to Arrow with...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It looks like it’s going to be fixed in pyarrow 2.0.0 😃
In the meantime I suggest to chunk big dataframes to create several small datasets, and then concatenate them using concatenate_datasets
It looks like a Pyarrow limitation. I was able to reproduce the error with
I also tried with 50% of the dataframe and it actually works. I created an issue on Apache Arrow’s JIRA here
One way to fix that would be to chunk the dataframe and concatenate arrow tables.