PyArrow as optional dependency

See original GitHub issue

Is your feature request related to a problem? Please describe.

The PyArrow package contains some very large libraries (e.g., libarrow.so (50MB) and libarrow_flight.so (14M)). This makes it very hard to use the Pandera package in a serverless environment, since packages have strict size limits and PyArrow is required. Hence, the issue is that Pandera is practically unusable for AWS Lambda.

Describe the solution you’d like

It seems that PyArrow is not really part of the core of Pandera. Therefore, I would like to suggest to make pyarrow an optional dependency to allow Pandera to be used in environment with strict size constraints.

Describe alternatives you’ve considered

Not applicable.

Additional context

Not applicable.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:6

github_iconTop GitHub Comments

2reactions
cosmicBboycommented, May 16, 2022

this use case makes sense… I agree pyarrow should be an optional dependency (not a package extra).

Removing this as a dependency from the project, we need to:

  • remove it from environment.yml file
  • run python scripts/generate_pip_deps_from_conda.py to sync change to requirements-dev.txt
  • remove it from setup.py file.

Additionally, we should explicitly raise a TypeError when users try to use the pyarrow pandas string format string[pyarrow], which pandera translates to a pandera DataType here: https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L464-L483. Can probably raise the error in the __post_init__ method if storage == "pyarrow".

@jeffzi anything I might have missed here?

0reactions
jeffzicommented, Jul 14, 2022

Unfortunately serverless does not automatically ignore layer dependencies. You have to list them in the python plugin config.

you are supposed to specify certain dependencies not to be deployed but if it’s combined with Poetry this setting is (silently) ignored.

I also use poetry and I agree the plugin is inconsistent. In some cases, I had to resort to using slimPatterns to force ignore deps:

  pythonRequirements:
    useStaticCache: false
    useDownloadCache: false
    zip: false
    slim: true
    slimPatterns:
      - "botocore*"
      - "cache*"
      - "lxml*"

Another important issue is that layers do actually count for the total size of the deployment package. So splitting packages by layer will not help with that.

Forgot about that 😦 At work I have a dedicated layer for pyarrow. The pyarrow layer strips the *so files your mentioned. It reduces the size, but it’s obviously better to remove pyarrow as a dependency of pandera.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyArrow as optional dependency · Issue #851 - GitHub
The PyArrow package contains some very large libraries (e.g., ... I agree pyarrow should be an optional dependency (not a package extra).
Read more >
Installing PyArrow — Apache Arrow v10.0.1
PyArrow is currently compatible with Python 3.7, 3.8, 3.9, 3.10 and 3.11. Using Conda¶. Install the latest version of PyArrow from conda-forge using...
Read more >
pydata/pandas - Gitter
ImportError: Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow. but i have install pyarrow...
Read more >
pandas needs pyarrow, which doesn't exist - Launchpad Bugs
ImportError: Missing optional dependency 'pyarrow'. Use pip or conda to install pyarrow. The removal of support for feather and the absence ...
Read more >
python-pyarrow 10.0.0-1 (x86_64) - Arch Linux
Dependencies (20). arrow · python-numpy · python-setuptools-scm · python-cffi (optional) - interact with C code; python-fsspec (optional) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found