ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648

Hi, I’m trying to load a dataset from Dataframe, but I get the error:

---------------------------------------------------------------------------
ArrowCapacityError                        Traceback (most recent call last)
<ipython-input-7-146b6b495963> in <module>
----> 1 dataset = Dataset.from_pandas(emb)

~/miniconda3/envs/dev/lib/python3.7/site-packages/nlp/arrow_dataset.py in from_pandas(cls, df, features, info, split)
    223         info.features = features
    224         pa_table: pa.Table = pa.Table.from_pandas(
--> 225             df=df, schema=pa.schema(features.type) if features is not None else None
    226         )
    227         return cls(pa_table, info=info, split=split)

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    591         for i, maybe_fut in enumerate(arrays):
    592             if isinstance(maybe_fut, futures.Future):
--> 593                 arrays[i] = maybe_fut.result()
    594 
    595     types = [x.type for x in arrays]

~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    426                 raise CancelledError()
    427             elif self._state == FINISHED:
--> 428                 return self.__get_result()
    429 
    430             self._condition.wait(timeout)

~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/miniconda3/envs/dev/lib/python3.7/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    557 
    558         try:
--> 559             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    560         except (pa.ArrowInvalid,
    561                 pa.ArrowNotImplementedError,

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/miniconda3/envs/dev/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648

My code is :

from nlp import Dataset
dataset = Dataset.from_pandas(emb)

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

3reactions

lhoestqcommented, Sep 25, 2020

It looks like it’s going to be fixed in pyarrow 2.0.0 😃

In the meantime I suggest to chunk big dataframes to create several small datasets, and then concatenate them using concatenate_datasets

0reactions

lhoestqcommented, Sep 11, 2020

It looks like a Pyarrow limitation. I was able to reproduce the error with

import pandas as pd
import numpy as np
import pyarrow as pa

 n = 1713614
df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
pa.Table.from_pandas(df)

I also tried with 50% of the dataframe and it actually works. I created an issue on Apache Arrow’s JIRA here

One way to fix that would be to chunk the dataframe and concatenate arrow tables.

Top Results From Across the Web

[GitHub] [arrow] westonpace commented on issue #10776

[GitHub] [arrow] westonpace commented on issue #10776: Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180.

[#ARROW-3762] [C++] Parquet arrow::Table reads error when ...

When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError ... BinaryArray cannot contain more than 2147483646 bytes,...

pandas to_parquet fails on large datasets - Stack Overflow

It seems you succeeded with Pyarrow to write but not to read, and failed to write with fastparquet, thus did not get to...

Data Types and In-Memory Data Model — Apache Arrow v3.0.0

Nested types: list, struct, and union. Dictionary type: An encoded categorical type (more on this later). Each logical data type in Arrow has...

How to convert to/from Arrow and Parquet - Awkward Array

As such, arrays can usually be shared without copying, but not always. The Apache Parquet file format has strong connections to Arrow with...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post