Error when loading a HUGE json file (pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries)
See original GitHub issueHi, thanks for the great library. I have used the brilliant library for a couple of small projects, and now using it for a fairly big project. When loading a huge json file of 500GB, pyarrow complains as follows:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 531, in incomplete_dir
yield tmp_dir
File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 573, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 650, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/datasets/builder.py", line 1027, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "/home/user/.pyenv/versions/3.7.9/lib/python3.7/site-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/app/.cache/huggingface/modules/datasets_modules/datasets/json/9498524fd296a6cca99c66d6c5be507d1c0991f5a814e535b507f4a66096a641/json.py", line 83, in _generate_tables
parse_options=self.config.pa_parse_options,
File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)
When using only a small portion of the sample file, say first 100 lines, it works perfectly well…
I see that it is the error from pyarrow, but could you give me a hint or possible solutions? #369 describes the same error and #372 claims to have fixed the issue, but I have no clue why I am still getting this one. Thanks in advance!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:9 (4 by maintainers)
Top Results From Across the Web
[#ARROW-13314] [Python] JSON parsing segment fault on ...
Hello, I have a big JSON file (~300MB) with complex records (nested ... the straddling object exception (or seg fault) once the file...
Read more >[jira] [Commented] (ARROW-13314) JSON parsing segment ...
ArrowInvalid : straddling object straddles two block boundaries (try to increase block size?) {code} In debug mode I also get those two extra ......
Read more >Huggingface datasets map n inputs to m outputs crash ...
I am running it this problem while using the datasets library from huggingface. From the docs I see that ...
Read more >Trainer Question Answering evaluation metrics - Transformers
I keep running into an error whenever I try to use costume datasets. I found this recent issue on the repo but it's...
Read more >Parse JSON in the background - Flutter documentation
However, you might need to perform an expensive computation, such as parsing a very large JSON document. If this work takes more than...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I made more tests. I used a smaller dataset and I was getting the same error, which means that it was not necessarily linked to the dataset size. To make both my smaller and larger datasets work, I got rid of lists with the json file. I had the following data format:
I changed to:
and that worked!
I used the following to reformat my json file:
This works with
block_size_10MB = 10 << 20or without specifyingblock_size.We’re using the JSON loader of pyarrow. It parses the file chunk by chunk to load the dataset. This issue happens when there’s no delimiter in one chunk of data. For json line, the delimiter is the end of line. So with a big value for chunk_size this should have worked unless you have one extremely long line in your file.
Also what version of pyarrow are you using ?
FInally I wonder if it could be an issue on pyarrow’s side when using big json files. (I haven’t tested big json files like yours)