ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648

See original GitHub issue

Hi, I’m trying to load a dataset from Dataframe, but I get the error:

---------------------------------------------------------------------------
ArrowCapacityError                        Traceback (most recent call last)
<ipython-input-7-146b6b495963> in <module>
----> 1 dataset = Dataset.from_pandas(emb)

~/miniconda3/envs/dev/lib/python3.7/site-packages/nlp/arrow_dataset.py in from_pandas(cls, df, features, info, split)
    223         info.features = features
    224         pa_table: pa.Table = pa.Table.from_pandas(
--> 225             df=df, schema=pa.schema(features.type) if features is not None else None
    226         )
    227         return cls(pa_table, info=info, split=split)

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    591         for i, maybe_fut in enumerate(arrays):
    592             if isinstance(maybe_fut, futures.Future):
--> 593                 arrays[i] = maybe_fut.result()
    594 
    595     types = [x.type for x in arrays]

~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    426                 raise CancelledError()
    427             elif self._state == FINISHED:
--> 428                 return self.__get_result()
    429 
    430             self._condition.wait(timeout)

~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    557 
    558         try:
--> 559             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    560         except (pa.ArrowInvalid,
    561                 pa.ArrowNotImplementedError,

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648

My code is :

from nlp import Dataset
dataset = Dataset.from_pandas(emb)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
lhoestqcommented, Sep 25, 2020

It looks like it’s going to be fixed in pyarrow 2.0.0 😃

In the meantime I suggest to chunk big dataframes to create several small datasets, and then concatenate them using concatenate_datasets

0reactions
lhoestqcommented, Sep 11, 2020

It looks like a Pyarrow limitation. I was able to reproduce the error with

import pandas as pd
import numpy as np
import pyarrow as pa

 n = 1713614
df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
pa.Table.from_pandas(df)

I also tried with 50% of the dataframe and it actually works. I created an issue on Apache Arrow’s JIRA here

One way to fix that would be to chunk the dataframe and concatenate arrow tables.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [arrow] westonpace commented on issue #10776
[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180.
Read more >
[#ARROW-3762] [C++] Parquet arrow::Table reads error when ...
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError ... BinaryArray cannot contain more than 2147483646 bytes,...
Read more >
pandas to_parquet fails on large datasets - Stack Overflow
It seems you succeeded with Pyarrow to write but not to read, and failed to write with fastparquet, thus did not get to...
Read more >
Data Types and In-Memory Data Model — Apache Arrow v3.0.0
Nested types: list, struct, and union. Dictionary type: An encoded categorical type (more on this later). Each logical data type in Arrow has...
Read more >
How to convert to/from Arrow and Parquet - Awkward Array
As such, arrays can usually be shared without copying, but not always. The Apache Parquet file format has strong connections to Arrow with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found