ValueError: memoryview is too large; when reading from a large memory mapped file in Dask.

See original GitHub issue

What happened:

My Dask run crashes throws a ValueError: memoryview is too large when converting a large NumPy array (stored as pickle, read with memory map) to a Zarr file with Dask.

This only happens when:

  • When the memory mapped file is large enough.
  • When reading the file with memory mapping (via joblib.load(file, mmap_mode='r'))
  • When running on a cluster (either local or remote)

See full log of this happening on a local Dask cluster at dask_failure.log

I was wondering if this could be circumvented somehow from the Dask side?

What you expected to happen:

Reading memory mapped file to work on cluster

Minimal Complete Verifiable Example:

Reproducable with a LocalCluster:

import dask
import dask.array as da
import dask.distributed
import joblib
import numpy as np

cluster = dask.distributed.LocalCluster(
    n_workers=1,
    threads_per_worker=10,
    processes=False,
    memory_limit='55GB'
)
client = dask.distributed.Client(address=cluster)
display(client)


in_path = '/tmp/test.pkl'  # 7.5G
out_path = '/tmp/test.zarr'


joblib.dump(
    value=np.random.rand(100_000, 10_000),
    filename=in_path
)


def to_zarr(in_path, out_path):
    data = joblib.load(filename=in_path, mmap_mode='r')
    data_da = da.from_array(
        x=data,
        chunks='64 MiB',
        name=False,
    )
    data_zarr = da.to_zarr(
        arr=data_da,
        url=out_path,
        compute=True,
    )


dask.delayed(to_zarr)(in_path, out_path).compute()

Anything else we need to know?:

The last lines of the log seem to be coming from msgpack:

msgpack/_packer.pyx in msgpack._cmsgpack.Packer._pack()
ValueError: memoryview is too large

Looking at a related issue (https://github.com/explosion/spaCy/issues/6875) this seems to be a related to a hard-coded limit in msgpack which results in the ValueError.

Environment:

  • Dask version: ‘2021.04.0’
  • Msgpack version: 1.0.2
  • Zarr version: ‘2.7.0’
  • Xarray version: ‘0.17.0’
  • JobLib version: ‘1.0.1’
  • Numpy version: ‘1.20.2’
  • Python version: Python 3.9.0
  • Operating System: Docker container with Debian GNU/Linux 10 (buster) running on AWS ubuntu AMI.
  • Install method (conda, pip, source): conda 4.10.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jrbourbeaucommented, Apr 13, 2021

Thanks for the update with an example! I’m able to reproduce the msgpack error with the latest dask and distributed release (version 2021.04.0) but the issue seems to be resolved with the latest development version of dask and distributed. This issue was probably resolved by https://github.com/dask/dask/pull/7525, which fixed a related-looking issue (https://github.com/dask/distributed/issues/4652). Could you try using the development version of dask and distributed to confirm your issue is fixed?

0reactions
jrbourbeaucommented, Apr 13, 2021

Glad to hear it, thanks for following up here! You should be able to install directly from GitHub with

pip install git+https://github.com/dask/dask
Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.array.core - Dask documentation
IF they are large (as defined by sizeof) then we put them into the graph on ... It stores values chunk by chunk...
Read more >
big data dataframe from an on-disk mem-mapped binary struct ...
I would like to be able to mmap the file and map each field to a numpy array (or another form) where the...
Read more >
Python mmap: Improved File I/O With Memory Mapping
This code writes text to a memory-mapped file. However, it will raise a ValueError exception if the file is empty at the time...
Read more >
Parallel computing with Dask - Xarray
Once you've manipulated a Dask array, you can still write a dataset too big to fit into memory back to disk by using...
Read more >
Python notes - syntax and language - changes and py2/3
py3: no more backticks, use repr() explicitly. py3: int and long are now (syntactically) the same thing, so you can use int everywhere....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found