ValueError: memoryview is too large; when reading from a large memory mapped file in Dask.

What happened:

My Dask run crashes throws a ValueError: memoryview is too large when converting a large NumPy array (stored as pickle, read with memory map) to a Zarr file with Dask.

This only happens when:

When the memory mapped file is large enough.
When reading the file with memory mapping (via joblib.load(file, mmap_mode='r'))
When running on a cluster (either local or remote)

See full log of this happening on a local Dask cluster at dask_failure.log

I was wondering if this could be circumvented somehow from the Dask side?

What you expected to happen:

Reading memory mapped file to work on cluster

Minimal Complete Verifiable Example:

Reproducable with a LocalCluster:

import dask
import dask.array as da
import dask.distributed
import joblib
import numpy as np

cluster = dask.distributed.LocalCluster(
    n_workers=1,
    threads_per_worker=10,
    processes=False,
    memory_limit='55GB'
)
client = dask.distributed.Client(address=cluster)
display(client)


in_path = '/tmp/test.pkl'  # 7.5G
out_path = '/tmp/test.zarr'


joblib.dump(
    value=np.random.rand(100_000, 10_000),
    filename=in_path
)


def to_zarr(in_path, out_path):
    data = joblib.load(filename=in_path, mmap_mode='r')
    data_da = da.from_array(
        x=data,
        chunks='64 MiB',
        name=False,
    )
    data_zarr = da.to_zarr(
        arr=data_da,
        url=out_path,
        compute=True,
    )


dask.delayed(to_zarr)(in_path, out_path).compute()

Anything else we need to know?:

The last lines of the log seem to be coming from msgpack:

msgpack/_packer.pyx in msgpack._cmsgpack.Packer._pack()
ValueError: memoryview is too large

Looking at a related issue (https://github.com/explosion/spaCy/issues/6875) this seems to be a related to a hard-coded limit in msgpack which results in the ValueError.

Environment:

Dask version: ‘2021.04.0’
Msgpack version: 1.0.2
Zarr version: ‘2.7.0’
Xarray version: ‘0.17.0’
JobLib version: ‘1.0.1’
Numpy version: ‘1.20.2’
Python version: Python 3.9.0
Operating System: Docker container with Debian GNU/Linux 10 (buster) running on AWS ubuntu AMI.
Install method (conda, pip, source): conda 4.10.0

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

jrbourbeaucommented, Apr 13, 2021

Thanks for the update with an example! I’m able to reproduce the msgpack error with the latest dask and distributed release (version 2021.04.0) but the issue seems to be resolved with the latest development version of dask and distributed. This issue was probably resolved by https://github.com/dask/dask/pull/7525, which fixed a related-looking issue (https://github.com/dask/distributed/issues/4652). Could you try using the development version of dask and distributed to confirm your issue is fixed?

0reactions

jrbourbeaucommented, Apr 13, 2021

Glad to hear it, thanks for following up here! You should be able to install directly from GitHub with

pip install git+https://github.com/dask/dask

Top Results From Across the Web

dask.array.core - Dask documentation

IF they are large (as defined by sizeof) then we put them into the graph on ... It stores values chunk by chunk...

big data dataframe from an on-disk mem-mapped binary struct ...

I would like to be able to mmap the file and map each field to a numpy array (or another form) where the...

Python mmap: Improved File I/O With Memory Mapping

This code writes text to a memory-mapped file. However, it will raise a ValueError exception if the file is empty at the time...

Parallel computing with Dask - Xarray

Once you've manipulated a Dask array, you can still write a dataset too big to fit into memory back to disk by using...

Python notes - syntax and language - changes and py2/3

py3: no more backticks, use repr() explicitly. py3: int and long are now (syntactically) the same thing, so you can use int everywhere....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

ValueError: memoryview is too large; when reading from a large memory mapped file in Dask.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post