Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New @strict_dataclass decorator for dataclass validation #2895

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Feb 28, 2025

Follow-up after huggingface/transformers#36329 and slack discussions (private).

The idea is to add a layer of validation on top of Python's built-in dataclasses.

Example

from huggingface_hub.utils import strict_dataclass, validated_field

def positive_int(value: int):
    if not value >= 0:
        raise ValueError(f"Value must be positive, got {value}")

@strict_dataclass
class User:
    name: str
    age: int = validated_field(positive_int)

user = User(name="John", age=30)
(...)

# assign invalid type
user.age = "31"
# huggingface_hub.errors.StrictDataclassFieldValidationError: Validation error for field 'age':
#   TypeError: Field 'age' expected int, got str (value: '30')

# assign invalid value
user.age = -1
#huggingface_hub.errors.StrictDataclassFieldValidationError: Validation error for field 'age':
#    ValueError: Value must be positive, got -1

What it does ?

  1. Provides a decorator @strict_dataclass built on top of @dataclass. When decorated, class values are validated.
  2. Fields are validated based on type annotation (str, bool, dict, etc.). Type annotation can be a deeply nested (e.g. Dict[str, List[Optional[DummyClass]] is correctly validated)
  3. User can define custom validators in addition to type check using validated_field (built on top of field)
  4. Fields are validated on value assignment meaning at initialization but also each time someone updates a value.

What it doesn't do (yet) ?

  • doesn't have the concept of "class validator" to validate all fields are coherent. To implement them, we would to execute them once in __post_init__ and then "on-demand" with a .validate() method?. We cannot run them on each field-assignment as it would prevent modifying related value (if values A and B must be coherent, we want to be able to change both A and B and then validate).

Why not chose pydantic ? (or attrs? or marshmallow_dataclass?)

  • See discussion in Question for community: We're considering adding pydantic as a base requirement to 🤗 transformers transformers#36329 related to adding pydantic as a new dependency. Would be an heavy addition + require careful logic to support both v1 and v2.
  • we do not want most of pydantic's features, especially the ones related to automatic casting, jsonschema, serializations, aliases, ...
  • we do not need to be able to instantiate a class from a dictionary
  • we do not want to mutate data. In this PR, "validation" refers to "checking if a value is valid". In Pydanctic, "validation" refers to "casting a value, possibly mutating it and then check if it's value".
  • we do not need blazing fast validation. @strict_dataclass is not meant for heavy load where performances is critical. Common use case will be to validate a model configuration (only done once and very neglectable compared to running a model). This allows us to keep code minimal.

Plan:

  • test it on real use cases (typically transformers @gante)
  • iterate on the design until we have something satisfying
  • (optional) find a good design to define "class validators".
  • (optional) add a set of generic validators that could be reused downstream
  • document it, add tests, etc.
  • merge?

We won't push for it / release it until we are sure at least the transformers use case is covered.

Notes:

This @strict_dataclass might be useful in huggingface_hub itself in the future but that's not its primary goal for now.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants