Pandas doesn't recognize Pyarrow as a Parquet engine even though it's installed

See original GitHub issue

Code Sample, a copy-pastable example if possible

In [6]: pd.io.parquet.get_engine('auto')
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-6-77cb1d6c8933> in <module>
----> 1 pd.io.parquet.get_engine('auto')

~/miniconda3/lib/python3.6/site-packages/pandas/io/parquet.py in get_engine(engine)
     30             pass
     31
---> 32         raise ImportError("Unable to find a usable engine; "
     33                           "tried using: 'pyarrow', 'fastparquet'.\n"
     34                           "pyarrow or fastparquet is required for parquet "

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

Problem description

Pandas doesn’t recognize Pyarrow as a Parquet engine even though it’s installed. Note that you can see that Pyarrow 0.12.0 is installed in the output of pd.show_versions() below.

Expected Output

In [2]: pd.io.parquet.get_engine('auto')
Out[2]: <pandas.io.parquet.PyArrowImpl at 0x119c78f28>

Output of pd.show_versions()

In [5]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.0 pytest: 3.9.3 pip: 18.1 setuptools: 40.5.0 Cython: None numpy: 1.15.4 scipy: 1.1.0 pyarrow: 0.12.0 xarray: None IPython: 7.1.1 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.2.5 bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:5
  • Comments:16 (3 by maintainers)

github_iconTop GitHub Comments

16reactions
bmwillycommented, Jan 28, 2019

TLDR: I got it working by uninstalling via conda and installing with pip. So it appears that there’s something off about that specific conda version. Sorry for the noise.

Details below for others.

I didn’t have multiple versions of pyarrow installed.

I uninstalled via conda, verified I didn’t have pyarrow from pip, reinstalled via conda, and got the same error:

brandon@datascience-dev:~
$ conda uninstall pyarrow parquet-cpp
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/brandon/miniconda3

  removed specs:
    - parquet-cpp
    - pyarrow


The following packages will be REMOVED:

  parquet-cpp-1.5.1-1
  pyarrow-0.11.1-py36hbbcf98d_1002


Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
brandon@datascience-dev:~
$ pip uninstall pyarrow
Skipping pyarrow as it is not installed.
brandon@datascience-dev:~
$ python -c 'import pyarrow'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pyarrow'
brandon@datascience-dev:~
$ pip show pyarrow
brandon@datascience-dev:~
$ conda list pyarrow
# packages in environment at /home/brandon/miniconda3:
#
# Name                    Version                   Build  Channel
brandon@datascience-dev:~
$ conda install -y pyarrow
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/brandon/miniconda3

  added / updated specs:
    - pyarrow


The following NEW packages will be INSTALLED:

  parquet-cpp        conda-forge/noarch::parquet-cpp-1.5.1-2
  pyarrow            conda-forge/linux-64::pyarrow-0.12.0-py36hbbcf98d_2

The following packages will be UPDATED:

  arrow-cpp                        0.11.1-py36h0e61e49_1004 --> 0.12.0-py36h0e61e49_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
brandon@datascience-dev:~
$ python -c 'import pyarrow'

And then

brandon@datascience-dev:~
$ jupyter console
Jupyter console 6.0.0

Python 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.



In [1]: import pandas as pd

In [2]: pd.io.parquet.PyArrowImpl()
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/miniconda3/lib/python3.6/site-packages/pandas/io/parquet.py in __init__(self)
     82             import pyarrow
---> 83             import pyarrow.parquet
     84         except ImportError:

~/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py in <module>
     29 import pyarrow.lib as lib
---> 30 import pyarrow._parquet as _parquet
     31

ImportError: /home/brandon/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.12: undefined symbol: _ZNK5boost16re_detail_10680031cpp_regex_traits_implementationIcE17transform_primaryB5cxx11EPKcS4_

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-2-9b1c87fa6892> in <module>
----> 1 pd.io.parquet.PyArrowImpl()

~/miniconda3/lib/python3.6/site-packages/pandas/io/parquet.py in __init__(self)
     84         except ImportError:
     85             raise ImportError(
---> 86                 "pyarrow is required for parquet support\n\n"
     87                 "you can install via conda\n"
     88                 "conda install pyarrow -c conda-forge\n"

ImportError: pyarrow is required for parquet support

you can install via conda
conda install pyarrow -c conda-forge

or via pip
pip install -U pyarrow

I got it working by uninstalling via conda and installing with pip:

$ pip install -U pyarrow
Collecting pyarrow
  Downloading https://files.pythonhosted.org/packages/42/be/8682e7f6c12dd42f31f32b18d5d61b4f578f12aaa9058081aef954a481d3/pyarrow-0.12.0-cp36-cp36m-manylinux1_x86_64.whl (12.5MB)
    100% |████████████████████████████████| 12.5MB 6.0MB/s
Requirement already satisfied, skipping upgrade: six>=1.0.0 in /home/brandon/miniconda3/lib/python3.6/site-packages (from pyarrow) (1.11.0)
Requirement already satisfied, skipping upgrade: numpy>=1.14 in /home/brandon/miniconda3/lib/python3.6/site-packages (from pyarrow) (1.15.4)
Installing collected packages: pyarrow
Successfully installed pyarrow-0.12.0
brandon@datascience-dev:~/miniconda3/lib/python3.6/site-packages
$ python
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.io.parquet.PyArrowImpl()
<pandas.io.parquet.PyArrowImpl object at 0x7fc8ab9ee0f0>

So it appears that there’s something off about that specific conda version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to identify Pandas' backend for Parquet - python
I have a Conda distribution with the Intel distribution and "it works": I can use pandas.DataFrame.to_parquet . However I do not have pyarrow...
Read more >
Installing PyArrow — Apache Arrow v10.0.1
PyArrow is currently compatible with Python 3.7, 3.8, 3.9, 3.10 and 3.11. Using Conda¶. Install the latest version of PyArrow from conda-forge using...
Read more >
pd.read_parquet: Read Parquet Files in Pandas
How to speed up reading parquet files with PyArrow; How to specify the engine used to read a parquet file in Pandas. Table...
Read more >
Installation — pandas 2.0.0.dev0+933.g70eef55e43 ...
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for data analysis...
Read more >
Python and Parquet Performance
I've used fastparquet with pandas when its PyArrow engine has a ... it to Parquet format, after which using it with PySpark or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found