[BUG-REPORT] - ArrowInvalid: offset overflow while concatenating arrays
See original GitHub issueThank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description When doing operations on large dataframes with long string columns (~500 characters), slicing the dataframe results in the error
ArrowInvalid: offset overflow while concatenating arrays
This doesn’t happen with small datasets, and also doesn’t happen with short strings. It’s explicitly a problem with many large strings.
Example
import vaex
from vaex.dataframe import DataFrame
from random import random
import numpy as np
x = str(random())*25
def create_test_df(
num_samples: int = 10000000, num_classes: int = 20
):
id_column = np.arange(num_samples)
val1 = np.random.randint(0, 20, size=num_samples)
val2 = np.random.randint(0, 20, size=num_samples)
text_data = [x for _ in range(num_samples)]
score = np.random.uniform(0, 1.0, size=num_samples)
matrix = {
'id': id_column,
'val1': val1,
'val2': val2,
'score': score,
'text': text_data
}
return vaex.from_arrays(**matrix)
d2 = create_test_df(num_samples=10000000)
d2.sort(by='score')[0:500].to_records()
In the trace, I see this
~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/vaex/column.py in __getitem__(self, slice)
283 take_indices[mask] = 0
284 if isinstance(ar_unfiltered, supported_arrow_array_types):
--> 285 ar = ar_unfiltered.take(vaex.array_types.to_arrow(take_indices))
286 else:
287 ar = ar_unfiltered[take_indices]
which lead me to some investigation and found this and this - I think you need to switch to using .slice instead of .take
Do you have any ideas for a workaround I can use for now?
Software information
- Vaex version (
import vaex; vaex.__version__):
{'vaex': '4.5.0',
'vaex-core': '4.5.1',
'vaex-viz': '0.5.0',
'vaex-hdf5': '0.10.0',
'vaex-server': '0.6.1',
'vaex-astro': '0.9.0',
'vaex-jupyter': '0.6.0',
'vaex-ml': '0.14.0'}
- Vaex was installed via: pip / conda-forge / from source - pip
- OS: Macos big sur
Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc…).
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:10 (5 by maintainers)
Top Related StackOverflow Question
I just tried on a 600M dataframe, and it works just fine… so. i am afraid some kind of preproducible example that we can run locally is a must…
Also maybe best to open a new issue if you can reproduce it so we can track it better
@JovanVeljanoski I sent you an email with a reproducible example using my real data (too big to attach here). I discovered a workaround during troubleshooting which lead me to believe what I think is the source of the bug in Arrow or Vaex