ValueError: memoryview is too large; when reading from a large memory mapped file in Dask.
See original GitHub issueWhat happened:
My Dask run crashes throws a ValueError: memoryview is too large when converting a large NumPy array (stored as pickle, read with memory map) to a Zarr file with Dask.
This only happens when:
- When the memory mapped file is large enough.
- When reading the file with memory mapping (via
joblib.load(file, mmap_mode='r')) - When running on a cluster (either local or remote)
See full log of this happening on a local Dask cluster at dask_failure.log
I was wondering if this could be circumvented somehow from the Dask side?
What you expected to happen:
Reading memory mapped file to work on cluster
Minimal Complete Verifiable Example:
Reproducable with a LocalCluster:
import dask
import dask.array as da
import dask.distributed
import joblib
import numpy as np
cluster = dask.distributed.LocalCluster(
n_workers=1,
threads_per_worker=10,
processes=False,
memory_limit='55GB'
)
client = dask.distributed.Client(address=cluster)
display(client)
in_path = '/tmp/test.pkl' # 7.5G
out_path = '/tmp/test.zarr'
joblib.dump(
value=np.random.rand(100_000, 10_000),
filename=in_path
)
def to_zarr(in_path, out_path):
data = joblib.load(filename=in_path, mmap_mode='r')
data_da = da.from_array(
x=data,
chunks='64 MiB',
name=False,
)
data_zarr = da.to_zarr(
arr=data_da,
url=out_path,
compute=True,
)
dask.delayed(to_zarr)(in_path, out_path).compute()
Anything else we need to know?:
The last lines of the log seem to be coming from msgpack:
msgpack/_packer.pyx in msgpack._cmsgpack.Packer._pack()
ValueError: memoryview is too large
Looking at a related issue (https://github.com/explosion/spaCy/issues/6875) this seems to be a related to a hard-coded limit in msgpack which results in the ValueError.
Environment:
- Dask version: ‘2021.04.0’
- Msgpack version: 1.0.2
- Zarr version: ‘2.7.0’
- Xarray version: ‘0.17.0’
- JobLib version: ‘1.0.1’
- Numpy version: ‘1.20.2’
- Python version: Python 3.9.0
- Operating System: Docker container with Debian GNU/Linux 10 (buster) running on AWS ubuntu AMI.
- Install method (conda, pip, source): conda 4.10.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Related StackOverflow Question
Thanks for the update with an example! I’m able to reproduce the
msgpackerror with the latestdaskanddistributedrelease (version 2021.04.0) but the issue seems to be resolved with the latest development version ofdaskanddistributed. This issue was probably resolved by https://github.com/dask/dask/pull/7525, which fixed a related-looking issue (https://github.com/dask/distributed/issues/4652). Could you try using the development version ofdaskanddistributedto confirm your issue is fixed?Glad to hear it, thanks for following up here! You should be able to install directly from GitHub with