[BUG-REPORT] - ArrowInvalid: offset overflow while concatenating arrays

See original GitHub issue

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description When doing operations on large dataframes with long string columns (~500 characters), slicing the dataframe results in the error

ArrowInvalid: offset overflow while concatenating arrays

This doesn’t happen with small datasets, and also doesn’t happen with short strings. It’s explicitly a problem with many large strings.

Example

import vaex
from vaex.dataframe import DataFrame
from random import random
import numpy as np
x = str(random())*25
def create_test_df(
    num_samples: int = 10000000, num_classes: int = 20
):

    id_column = np.arange(num_samples)
    val1 = np.random.randint(0, 20, size=num_samples)
    val2 = np.random.randint(0, 20, size=num_samples)
    text_data = [x for _ in range(num_samples)]

    score = np.random.uniform(0, 1.0, size=num_samples)

    matrix = {
        'id': id_column,
        'val1': val1,
        'val2': val2,
        'score': score,
        'text': text_data
    }
    return vaex.from_arrays(**matrix)

d2 = create_test_df(num_samples=10000000)
d2.sort(by='score')[0:500].to_records()

In the trace, I see this

~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/vaex/column.py in __getitem__(self, slice)
    283             take_indices[mask] = 0
    284         if isinstance(ar_unfiltered, supported_arrow_array_types):
--> 285             ar = ar_unfiltered.take(vaex.array_types.to_arrow(take_indices))
    286         else:
    287             ar = ar_unfiltered[take_indices]

which lead me to some investigation and found this and this - I think you need to switch to using .slice instead of .take

Do you have any ideas for a workaround I can use for now?

Software information

  • Vaex version (import vaex; vaex.__version__):
{'vaex': '4.5.0',
 'vaex-core': '4.5.1',
 'vaex-viz': '0.5.0',
 'vaex-hdf5': '0.10.0',
 'vaex-server': '0.6.1',
 'vaex-astro': '0.9.0',
 'vaex-jupyter': '0.6.0',
 'vaex-ml': '0.14.0'}
  • Vaex was installed via: pip / conda-forge / from source - pip
  • OS: Macos big sur

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc…).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
JovanVeljanoskicommented, Oct 12, 2022

I just tried on a 600M dataframe, and it works just fine… so. i am afraid some kind of preproducible example that we can run locally is a must…

Also maybe best to open a new issue if you can reproduce it so we can track it better

0reactions
hermidalccommented, Oct 13, 2022

@JovanVeljanoski I sent you an email with a reproducible example using my real data (too big to attach here). I discovered a workaround during troubleshooting which lead me to believe what I think is the source of the bug in Arrow or Vaex

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG-REPORT] groupby error: ArrowInvalid: offset overflow ...
I was explicitly asked by @maartenbreddels (here) to write about it in a new issue. I'm using Vaex installed by pip with Python...
Read more >
offset overflow while concatenating arrays - Datasets
Hello everyone, I am adding a FAISS index on the MSMARCO passages dataset that has ~8.8M ... throws ArrowInvalid: offset overflow while concatenating...
Read more >
[#ARROW-10172] [Python] pyarrow.concat_arrays segfaults if ...
[Python] pyarrow.concat_arrays segfaults if a resulting StringArray's ... ArrowInvalid: offset overflow while concatenating arrays an.
Read more >
[jira] [Updated] (ARROW-9773) [C++] Take kernel can't handle ...
ArrowInvalid : offset overflow while concatenating arrays > {code} > In this example, it would be useful if Take() or a higher-level wrapper...
Read more >
Combine Arrays (Overlay with offset) - Stack Overflow
It seems that mod is going to be the slowest operation in the standard implementation of this problem. Fortunately, if the offsets are ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found