Error when loading a HUGE json file (pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries)

See original GitHub issue

Hi, thanks for the great library. I have used the brilliant library for a couple of small projects, and now using it for a fairly big project. When loading a huge json file of 500GB, pyarrow complains as follows:

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 531, in incomplete_dir
    yield tmp_dir
  File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 573, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 650, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 1027, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/app/.cache/huggingface/modules/datasets_modules/datasets/json/9498524fd296a6cca99c66d6c5be507d1c0991f5a814e535b507f4a66096a641/json.py", line 83, in _generate_tables
    parse_options=self.config.pa_parse_options,
  File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

When using only a small portion of the sample file, say first 100 lines, it works perfectly well…

I see that it is the error from pyarrow, but could you give me a hint or possible solutions? #369 describes the same error and #372 claims to have fixed the issue, but I have no clue why I am still getting this one. Thanks in advance!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
jpilaulcommented, Apr 8, 2021

I made more tests. I used a smaller dataset and I was getting the same error, which means that it was not necessarily linked to the dataset size. To make both my smaller and larger datasets work, I got rid of lists with the json file. I had the following data format:

[
  {'key': "a", 'value': ['one', 'two', 'three']},
  {'key': "b", 'value': ['four', 'five', 'six']}
]

I changed to:

  {'key': "a", 'value': 'one\ntwo\nthree'},
  {'key': "b", 'value': 'four\nfive\nsix']}

and that worked!

I used the following to reformat my json file:

with open(file_name, "w", encoding="utf-8") as f:
    for item in list_:
        f.write(json.dumps(item) + "\n")

This works with block_size_10MB = 10 << 20 or without specifying block_size.

2reactions
lhoestqcommented, Apr 7, 2021

We’re using the JSON loader of pyarrow. It parses the file chunk by chunk to load the dataset. This issue happens when there’s no delimiter in one chunk of data. For json line, the delimiter is the end of line. So with a big value for chunk_size this should have worked unless you have one extremely long line in your file.

Also what version of pyarrow are you using ?

FInally I wonder if it could be an issue on pyarrow’s side when using big json files. (I haven’t tested big json files like yours)

Read more comments on GitHub >

github_iconTop Results From Across the Web

[#ARROW-13314] [Python] JSON parsing segment fault on ...
Hello, I have a big JSON file (~300MB) with complex records (nested ... the straddling object exception (or seg fault) once the file...
Read more >
[jira] [Commented] (ARROW-13314) JSON parsing segment ...
ArrowInvalid : straddling object straddles two block boundaries (try to increase block size?) {code} In debug mode I also get those two extra ......
Read more >
Huggingface datasets map n inputs to m outputs crash ...
I am running it this problem while using the datasets library from huggingface. From the docs I see that ...
Read more >
Trainer Question Answering evaluation metrics - Transformers
I keep running into an error whenever I try to use costume datasets. I found this recent issue on the repo but it's...
Read more >
Parse JSON in the background - Flutter documentation
However, you might need to perform an expensive computation, such as parsing a very large JSON document. If this work takes more than...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found