PyArrow as optional dependency
See original GitHub issueIs your feature request related to a problem? Please describe.
The PyArrow package contains some very large libraries (e.g., libarrow.so (50MB) and libarrow_flight.so (14M)). This makes it very hard to use the Pandera package in a serverless environment, since packages have strict size limits and PyArrow is required. Hence, the issue is that Pandera is practically unusable for AWS Lambda.
Describe the solution you’d like
It seems that PyArrow is not really part of the core of Pandera. Therefore, I would like to suggest to make pyarrow an optional dependency to allow Pandera to be used in environment with strict size constraints.
Describe alternatives you’ve considered
Not applicable.
Additional context
Not applicable.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:6
Top Results From Across the Web
PyArrow as optional dependency · Issue #851 - GitHub
The PyArrow package contains some very large libraries (e.g., ... I agree pyarrow should be an optional dependency (not a package extra).
Read more >Installing PyArrow — Apache Arrow v10.0.1
PyArrow is currently compatible with Python 3.7, 3.8, 3.9, 3.10 and 3.11. Using Conda¶. Install the latest version of PyArrow from conda-forge using...
Read more >pydata/pandas - Gitter
ImportError: Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow. but i have install pyarrow...
Read more >pandas needs pyarrow, which doesn't exist - Launchpad Bugs
ImportError: Missing optional dependency 'pyarrow'. Use pip or conda to install pyarrow. The removal of support for feather and the absence ...
Read more >python-pyarrow 10.0.0-1 (x86_64) - Arch Linux
Dependencies (20). arrow · python-numpy · python-setuptools-scm · python-cffi (optional) - interact with C code; python-fsspec (optional) ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
this use case makes sense… I agree pyarrow should be an optional dependency (not a package extra).
Removing this as a dependency from the project, we need to:
environment.ymlfilepython scripts/generate_pip_deps_from_conda.pyto sync change to requirements-dev.txtsetup.pyfile.Additionally, we should explicitly raise a TypeError when users try to use the pyarrow pandas string format
string[pyarrow], which pandera translates to a pandera DataType here: https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L464-L483. Can probably raise the error in the__post_init__method ifstorage == "pyarrow".@jeffzi anything I might have missed here?
Unfortunately serverless does not automatically ignore layer dependencies. You have to list them in the python plugin config.
I also use poetry and I agree the plugin is inconsistent. In some cases, I had to resort to using
slimPatternsto force ignore deps:Forgot about that 😦 At work I have a dedicated layer for pyarrow. The pyarrow layer strips the *so files your mentioned. It reduces the size, but it’s obviously better to remove pyarrow as a dependency of pandera.