Does pandera support validating enum.Enum or subclases of it ?

See original GitHub issue

Discussed in https://github.com/unionai-oss/pandera/discussions/907

<div type='discussions-op-text'>

Originally posted by davidandreoletti August 8, 2022 Assuming a pedantic like class declaration with:

class SizeEnum(enum.Enum):
     BIG = "big"
     SMALL = "small"
class SummaryDFSchema(pandera...):
     size : pandera.Series[SizeEnum]
     name : ...

Currently pandera fails (via exception raised) because it seems pandas does not reconignize the Enum as a registered custom dtype.

What methods/workaround could be used to let pandera enforce/check the column contains SizeEnum types (rather than one of its string values such as “big”)?</div>

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
dantheandcommented, Aug 17, 2022

For string-type enums, I’ve found a fairly elegant solution is to pass them directly to pandera.Field as the isin= argument. I don’t know if this would work for non-hashable python objects.

import enum
import pandera
import pandas as pd

class TestEnum(str, enum.Enum):
    CLASS_1 = "class 1"
    CLASS_2 = "class 2"

class Schema(pandera.SchemaModel):
    class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum)
2reactions
the-matt-morriscommented, Aug 12, 2022

This is a cool idea. It’s something I’ve added in for a specific use case of mine, though I admit all my Enum subclasses are strings so I’m not hitting on any obscure cases. Gotta be a better way to do this than what I hacked together, but I added a step to SchemaModel.__init_subclass__ to update fields with Enum annotations to have categorical types with the categories defined by the Enum values.

One argument for using categorical type is that it can handle data types other than strings:

This one is the first example in the enum docs:

from enum import Enum
import pandas as pd

class Color(Enum):
    RED = 1
    GREEN = 2
    BLUE = 3


class MySchema(SchemaModel):
    color: Series[Color]


df = pd.DataFrame({"color": [1, 2, 3]})
MySchema.validate(df)
  color
0     1
1     2
2     3
df = pd.DataFrame({"color": [1, 2, 3, 4]})
MySchema.validate(df)
...
pandera.errors.SchemaError: Error while coercing 'color' to type category: Could not coerce <class 'pandas.core.series.Series'> data_container into type category:
   index  failure_case
0      3             4

That said, not sure what to do about Enum subclasses that have values that are not scalars:

class Planet(Enum):
    MERCURY = (3.303e+23, 2.4397e6)
    VENUS   = (4.869e+24, 6.0518e6)
    EARTH   = (5.976e+24, 6.37814e6)
    MARS    = (6.421e+23, 3.3972e6)
    JUPITER = (1.9e+27,   7.1492e7)
    SATURN  = (5.688e+26, 6.0268e7)
    URANUS  = (8.686e+25, 2.5559e7)
    NEPTUNE = (1.024e+26, 2.4746e7)
    def __init__(self, mass, radius):
        self.mass = mass       # in kilograms
        self.radius = radius   # in meters

In this case, the expected values in the series would be (float, float) tuples that correspond to values of a Planet value. Maybe that is ok.

It seems an important question regards whether the expected series values are instances of the Enum (i.e. Color.RED) or the Enum values (1). I would think the values, but thoughts on that?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Schema Models - pandera
Or you can use the SchemaModel() class directly to validate dataframes, which is syntactic sugar that simply delegates to the validate() method.
Read more >
Schema Models - pandera
The enumeration PandasDtype is not directly supported because the type parameter of a typing.Generic cannot be an enumeration 1. Instead, you can use...
Read more >
stable PDF - pandera
pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data.
Read more >
pandera.extensions - Read the Docs
import warnings from enum import Enum from functools import partial, ... ValueError( "Element-wise checks should support DataFrame and Series " "validation.
Read more >
Niels Bantilan, Nigel Markey, Jean-Francois Zinque - pandera
If the dataframe does not pass validation checks, ... The enumeration PandasDtype is not directly supported because the type parameter of a ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found