33. Data Classes for Simple Structured Data

In Chapter 30, we learned how to create classes to define our own types. We wrote __init__ methods to initialize instances, __repr__ methods to display them, and __eq__ methods to compare them. While this approach works perfectly, it involves writing a lot of repetitive code, especially when a class primarily exists to store data.

Python's data classes provide a cleaner, more concise way to create classes that are mainly containers for data. By using the @dataclass decorator, Python automatically generates common methods like __init__, __repr__, and __eq__ based on the class attributes you define. This reduces boilerplate code and makes your intentions clearer.

33.1) What Data Classes Are and When to Use Them

A data class is a class designed primarily to store data values. Instead of manually writing initialization and comparison methods, you define the attributes your class should have, and Python generates the necessary methods automatically.

Why Data Classes Matter

Consider a regular class for representing a book:

python

class Book:
    def __init__(self, title, author, year):
        self.title = title
        self.author = author
        self.year = year
    
    def __repr__(self):
        return f"Book(title={self.title!r}, author={self.author!r}, year={self.year})"
    
    def __eq__(self, other):
        if not isinstance(other, Book):
            return False
        return (self.title == other.title and 
                self.author == other.author and 
                self.year == other.year)
 
book1 = Book("1984", "George Orwell", 1949)
print(book1)  # Output: Book(title='1984', author='George Orwell', year=1949)
 
book2 = Book("1984", "George Orwell", 1949)
print(book1 == book2)  # Output: True

This works, but notice how much code we wrote just to store three pieces of information. The __init__, __repr__, and __eq__ methods follow predictable patterns—they simply handle the attributes we defined.

Data classes eliminate this repetition. They're particularly useful when:

Your class primarily stores data rather than implementing complex behavior
You need standard methods like initialization, string representation, and equality comparison
You want clearer, more maintainable code with less boilerplate
You're creating configuration objects, data transfer objects, or simple records

Data classes don't replace regular classes—they complement them. Use regular classes when you need custom initialization logic, complex methods, or inheritance hierarchies. Use data classes when you mainly need a structured container for related data.

The Relationship Between Data Classes and Regular Classes

Data classes are still regular Python classes. They support all the features we learned in Chapters 30-32: methods, properties, inheritance, and special methods. The @dataclass decorator simply automates the creation of common methods, saving you from writing repetitive code.

33.2) Creating Data Classes with @dataclass

To create a data class, you import the dataclass decorator from the dataclasses module and apply it to your class definition. Inside the class, you define class attributes with type annotations that specify what data the class should hold.

Basic Data Class Syntax

python

from dataclasses import dataclass
 
@dataclass
class Student:
    name: str
    student_id: int
    gpa: float
 
# Create instances
alice = Student("Alice Johnson", 12345, 3.8)
bob = Student("Bob Smith", 12346, 3.5)
 
print(alice)  # Output: Student(name='Alice Johnson', student_id=12345, gpa=3.8)
print(bob)    # Output: Student(name='Bob Smith', student_id=12346, gpa=3.5)

Let's break down what the @dataclass decorator does:

@dataclass: Applying this decorator makes Python automatically write the __init__, __repr__, and __eq__ methods for you
Automatic __init__: Python creates an initialization method that accepts these three parameters in the order they're defined and assigns them to instance attributes
Automatic __repr__: Python creates a string representation showing the class name and all attribute values
Automatic __eq__: Python creates an equality comparison method that compares all attributes
Converts type annotations into instance attributes: In a regular class, writing name: str in the class body creates a class attribute. But the @dataclass decorator changes this behavior—it uses these type annotations to define instance attributes instead. Each instance gets its own name, student_id, and gpa attributes.

The key difference from regular classes:

python

# Regular class - these are class attributes (shared by all instances)
class RegularStudent:
    name: str
    student_id: int
 
# Data class - these become instance attributes (each instance has its own)
@dataclass
class DataStudent:
    name: str
    student_id: int

Understanding Type Annotations in Data Classes

In data classes, type annotations define the attributes and document their expected types:

python

from dataclasses import dataclass
 
@dataclass
class Product:
    name: str
    price: float
    in_stock: bool
 
# Using the correct types as documented
laptop = Product("Laptop", 999.99, True)
print(laptop)  # Output: Product(name='Laptop', price=999.99, in_stock=True)
 
# Python doesn't enforce types - this runs without error
macbook = Product("Macbook", "expensive", True)
print(macbook)  # Output: Product(name='Macbook', price='expensive', in_stock=True)
 
# But using wrong types will cause problems later:
discounted = laptop.price * 0.9     # Works: 899.991
discounted = macbook.price * 0.9    # TypeError: can't multiply sequence by non-int of type 'float'
 
tax = laptop.price + 50             # Works: 1049.99
tax = macbook.price + 50            # TypeError: can only concatenate str (not "int") to str

Python won't stop you from passing the wrong types when creating a data class instance. The type annotations are primarily documentation—they tell other programmers (and type-checking tools like mypy) what types you expect, but Python doesn't enforce them at runtime. This is consistent with Python's dynamic typing philosophy.

However, following the type annotations makes your code more predictable and easier to debug. When you use the wrong types, errors will appear later when you try to use the data, making bugs harder to trace. Type-checking tools can catch these mismatches before you run your code, helping you find problems early.

Accessing and Modifying Attributes

Data class instances work exactly like regular class instances. You access and modify attributes using dot notation:

python

from dataclasses import dataclass
 
@dataclass
class Employee:
    name: str
    position: str
    salary: float
 
emp = Employee("Sarah Chen", "Software Engineer", 95000.0)
 
# Access attributes
print(emp.name)      # Output: Sarah Chen
print(emp.position)  # Output: Software Engineer
 
# Modify attributes
emp.salary = 100000.0
emp.position = "Senior Software Engineer"
 
print(emp)  # Output: Employee(name='Sarah Chen', position='Senior Software Engineer', salary=100000.0)

Data classes are mutable by default—you can change their attributes after creation. This is different from tuples or named tuples, which are immutable. If you need immutability, you can configure the data class with frozen=True (we'll explore this in Section 33.4).

33.3) Generated Methods: `init`, `repr`, and `eq`

The @dataclass decorator automatically generates three essential methods. Understanding what these methods do helps you use data classes effectively and know when to customize them.

The Generated `init` Method

The __init__ method initializes a new instance with the provided values. Python generates it based on the order of your attribute definitions:

python

from dataclasses import dataclass
 
@dataclass
class Rectangle:
    width: float
    height: float
 
# The generated __init__ accepts width and height in that order
rect = Rectangle(10.5, 5.0)
print(rect.width)   # Output: 10.5
print(rect.height)  # Output: 5.0
 
# You can also use keyword arguments
rect2 = Rectangle(height=8.0, width=12.0)
print(rect2.width)   # Output: 12.0
print(rect2.height)  # Output: 8.0

The generated __init__ is equivalent to writing:

python

def __init__(self, width: float, height: float):
    self.width = width
    self.height = height

This automatic generation saves you from writing repetitive initialization code, especially for classes with many attributes.

The Generated `repr` Method

The __repr__ method provides a string representation of the instance that shows all attribute values. This is invaluable for debugging and logging:

python

from dataclasses import dataclass
 
@dataclass
class Point:
    x: float
    y: float
    label: str
 
point = Point(3.5, 7.2, "A")
print(point)  # Output: Point(x=3.5, y=7.2, label='A')
print(repr(point))  # Output: Point(x=3.5, y=7.2, label='A')

The generated __repr__ follows the convention of showing the class name and all attributes in a format that could be used to recreate the object. This is much more helpful than the default representation you'd get without __repr__: <__main__.Point object at 0x...>.

The Generated `eq` Method

The __eq__ method enables equality comparison between instances. Two data class instances are considered equal if all their corresponding attributes are equal:

python

from dataclasses import dataclass
 
@dataclass
class Color:
    red: int
    green: int
    blue: int
 
color1 = Color(255, 0, 0)
color2 = Color(255, 0, 0)
color3 = Color(0, 255, 0)
 
print(color1 == color2)  # Output: True (same RGB values)
print(color1 == color3)  # Output: False (different RGB values)
print(color1 is color2)  # Output: False (different objects in memory)

This automatic equality comparison is based on value equality, not identity. Even though color1 and color2 are different objects in memory (as shown by is), they're considered equal because their attributes match.

The generated __eq__ method compares attributes in the order they're defined:

python

from dataclasses import dataclass
 
@dataclass
class Book:
    title: str
    author: str
    year: int
 
book1 = Book("1984", "George Orwell", 1949)
book2 = Book("1984", "George Orwell", 1949)
book3 = Book("Animal Farm", "George Orwell", 1945)
 
print(book1 == book2)  # Output: True (all attributes match)
print(book1 == book3)  # Output: False (title and year differ)
 
# Comparison with non-Book objects returns False
print(book1 == "1984")  # Output: False
print(book1 == None)    # Output: False

Comparing Generated Methods to Manual Implementation

To appreciate what data classes provide, let's compare the data class version with a manual implementation:

python

from dataclasses import dataclass
 
# Data class version (concise)
@dataclass
class PersonData:
    first_name: str
    last_name: str
    age: int
 
# Equivalent manual version (verbose)
class PersonManual:
    def __init__(self, first_name: str, last_name: str, age: int):
        self.first_name = first_name
        self.last_name = last_name
        self.age = age
    
    def __repr__(self):
        return f"PersonManual(first_name={self.first_name!r}, last_name={self.last_name!r}, age={self.age})"
    
    def __eq__(self, other):
        if not isinstance(other, PersonManual):
            return False
        return (self.first_name == other.first_name and
                self.last_name == other.last_name and
                self.age == other.age)
 
# Both work identically
p1 = PersonData("Alice", "Johnson", 30)
p2 = PersonManual("Alice", "Johnson", 30)
 
print(p1)  # Output: PersonData(first_name='Alice', last_name='Johnson', age=30)
print(p2)  # Output: PersonManual(first_name='Alice', last_name='Johnson', age=30)

The data class version achieves the same functionality with significantly less code. This reduction in boilerplate makes your code easier to read, maintain, and modify.

Adding Custom Methods to Data Classes

Data classes can have custom methods just like regular classes. The @dataclass decorator only generates the initialization, representation, and equality methods—you're free to add any other functionality:

python

from dataclasses import dataclass
 
@dataclass
class Temperature:
    celsius: float
    
    def to_fahrenheit(self):
        """Convert temperature to Fahrenheit."""
        return (self.celsius * 9/5) + 32
    
    def to_kelvin(self):
        """Convert temperature to Kelvin."""
        return self.celsius + 273.15
    
    def is_freezing(self):
        """Check if temperature is at or below freezing point."""
        return self.celsius <= 0
 
temp = Temperature(25.0)
print(temp)  # Output: Temperature(celsius=25.0)
print(f"{temp.celsius}°C = {temp.to_fahrenheit()}°F")  # Output: 25.0°C = 77.0°F
print(f"Kelvin: {temp.to_kelvin()}")  # Output: Kelvin: 298.15
print(f"Freezing: {temp.is_freezing()}")  # Output: Freezing: False
 
cold_temp = Temperature(-5.0)
print(f"Freezing: {cold_temp.is_freezing()}")  # Output: Freezing: True

Data classes handle the repetitive parts (initialization, representation, and comparison) while letting you add custom methods for your specific needs, as shown with the temperature conversion methods above.

33.4) Default Values and Field Options

Data classes support default values for attributes, allowing you to create instances without specifying every parameter. You can also use the field() function to configure advanced behaviors like excluding attributes from comparisons or controlling how they appear in the string representation.

Providing Default Values

You can assign default values to attributes directly in the class definition. Attributes with defaults must come after attributes without defaults:

python

from dataclasses import dataclass
 
@dataclass
class User:
    username: str
    email: str
    is_active: bool = True  # Default value
    role: str = "user"      # Default value
 
# Create instances with and without defaults
user1 = User("alice", "alice@example.com")
print(user1)  # Output: User(username='alice', email='alice@example.com', is_active=True, role='user')
 
user2 = User("bob", "bob@example.com", False, "admin")
print(user2)  # Output: User(username='bob', email='bob@example.com', is_active=False, role='admin')
 
# Use keyword arguments to override specific defaults
user3 = User("charlie", "charlie@example.com", role="moderator")
print(user3)  # Output: User(username='charlie', email='charlie@example.com', is_active=True, role='moderator')

The ordering rule (attributes without defaults before attributes with defaults) prevents ambiguity in the generated __init__ method. This is the same requirement as for function parameters with default values, which we learned in Chapter 20.

Mutable Default Values and Why They're Not Allowed

Data classes protect you from a common mistake with mutable defaults. If you try to use a mutable object like a list or dictionary directly as a default, you'll get an error:

python

from dataclasses import dataclass
 
# This will raise an error
@dataclass
class ShoppingCart:
    customer: str
    items: list = []  # ValueError: mutable default <class 'list'> for field items is not allowed: use default_factory

This error prevents the same problem we saw with function default arguments in Chapter 20, where all instances would share the same mutable object.

Using field() with default_factory for Mutable Defaults

The solution is to use the field() function with default_factory, which creates a new default value for each instance:

python

from dataclasses import dataclass, field
 
@dataclass
class ShoppingCart:
    customer: str
    items: list = field(default_factory=list)  # Correct: New list per instance
 
# Now each instance gets its own list
cart1 = ShoppingCart("Alice")
cart1.items.append("Book")
print(cart1.items)  # Output: ['Book']
 
cart2 = ShoppingCart("Bob")
print(cart2.items)  # Output: [] - Bob has an empty list
 
cart2.items.append("Laptop")
print(cart1.items)  # Output: ['Book'] - Alice's cart unchanged
print(cart2.items)  # Output: ['Laptop'] - Bob's cart independent

The default_factory parameter takes a function (like list, dict, or set) that will be called to create a new default value each time you create an instance without providing that attribute. For example, default_factory=list means Python will call list() to create a new empty list for each instance.

Excluding Fields from Comparison

Sometimes you want certain attributes to be excluded from equality comparisons. Use field(compare=False) for this:

python

from dataclasses import dataclass, field
from datetime import datetime
 
@dataclass
class LogEntry:
    message: str
    level: str
    timestamp: datetime = field(compare=False)  # Don't compare timestamps
 
# Create two log entries with the same message but different times
entry1 = LogEntry("User logged in", "INFO", datetime(2024, 1, 15, 10, 30))
entry2 = LogEntry("User logged in", "INFO", datetime(2024, 1, 15, 10, 35))
 
# They're equal because timestamp is excluded from comparison
print(entry1 == entry2)  # Output: True
 
# But they have different timestamps
print(entry1.timestamp)  # Output: 2024-01-15 10:30:00
print(entry2.timestamp)  # Output: 2024-01-15 10:35:00

This is useful when you have metadata fields (like timestamps, IDs, or internal counters) that shouldn't affect whether two instances are considered equal.

Excluding Fields from Representation

You can also exclude fields from the string representation using field(repr=False):

python

from dataclasses import dataclass, field
 
@dataclass
class Account:
    username: str
    email: str
    password: str = field(repr=False)  # Don't show password in repr
 
account = Account("alice", "alice@example.com", "secret123")
print(account)  # Output: Account(username='alice', email='alice@example.com')
# Password is not shown, but it's still stored
print(account.password)  # Output: secret123

This is particularly useful for sensitive data like passwords, API keys, or large data structures that would clutter the representation.

Making Data Classes Immutable with frozen=True

By default, data class instances are mutable—you can change their attributes after creation. If you want immutable instances (like tuples), use frozen=True:

python

from dataclasses import dataclass
 
@dataclass(frozen=True)
class Point:
    x: float
    y: float
 
point = Point(3.0, 4.0)
print(point)  # Output: Point(x=3.0, y=4.0)
 
# Attempting to modify raises an error
try:
    point.x = 5.0
except AttributeError as e:
    print(f"Error: {e}")  # Output: Error: cannot assign to field 'x'

Frozen data classes are useful when you want to ensure data integrity or use instances as dictionary keys (since dictionary keys must be immutable). When a data class is frozen, Python also generates a __hash__ method, making instances hashable:

python

from dataclasses import dataclass
 
@dataclass(frozen=True)
class Coordinate:
    latitude: float
    longitude: float
 
# Frozen instances can be dictionary keys
locations = {
    Coordinate(40.7128, -74.0060): "New York",
    Coordinate(51.5074, -0.1278): "London",
    Coordinate(35.6762, 139.6503): "Tokyo"
}
 
nyc = Coordinate(40.7128, -74.0060)
print(locations[nyc])  # Output: New York

33.5) Custom Initialization with `__post_init__`

Sometimes you need to perform additional setup after the generated __init__ method runs. The __post_init__ method is called automatically after initialization, allowing you to validate data, compute derived attributes, or perform other setup tasks.

Basic `__post_init__` Usage

The __post_init__ method is called after all attributes have been set by the generated __init__:

python

from dataclasses import dataclass
 
@dataclass
class Rectangle:
    width: float
    height: float
    area: float = 0.0  # Will be computed in __post_init__
    
    def __post_init__(self):
        """Calculate area after initialization."""
        self.area = self.width * self.height
 
rect = Rectangle(5.0, 3.0)
print(rect)  # Output: Rectangle(width=5.0, height=3.0, area=15.0)
print(f"Area: {rect.area}")  # Output: Area: 15.0

The __post_init__ method has access to all the instance attributes that were set during initialization. This is useful for computing derived values that depend on multiple attributes.

Validating Data in post_init

A common use of __post_init__ is to validate that the provided data meets certain requirements:

python

from dataclasses import dataclass
 
@dataclass
class BankAccount:
    account_number: str
    balance: float
    
    def __post_init__(self):
        """Validate account data."""
        if self.balance < 0:
            raise ValueError("Balance cannot be negative")
 
# Valid account
account1 = BankAccount("ACC001", 1000.0)
print(account1)  # Output: BankAccount(account_number='ACC001', balance=1000.0)
 
# Invalid account - negative balance
try:
    account2 = BankAccount("ACC002", -500.0)
except ValueError as e:
    print(f"Error: {e}")  # Output: Error: Balance cannot be negative

This validation ensures that instances are always in a valid state. If the data doesn't meet requirements, the instance is never created, preventing invalid objects from existing in your program.

Using post_init with field(init=False)

Sometimes you want an attribute that's computed in __post_init__ but shouldn't be a parameter in __init__. Use field(init=False) for this:

python

from dataclasses import dataclass, field
import math
 
@dataclass
class Circle:
    radius: float
    area: float = field(init=False)  # Not a parameter in __init__
    circumference: float = field(init=False)
    
    def __post_init__(self):
        """Compute area and circumference from radius."""
        self.area = math.pi * self.radius ** 2
        self.circumference = 2 * math.pi * self.radius
 
# Only radius is required during initialization
circle = Circle(5.0)
print(circle)  # Output: Circle(radius=5.0, area=78.53981633974483, circumference=31.41592653589793)
print(f"Area: {circle.area:.2f}")  # Output: Area: 78.54
print(f"Circumference: {circle.circumference:.2f}")  # Output: Circumference: 31.42

This pattern is useful when you have attributes that are always computed from other attributes and should never be set directly during initialization.

Data classes represent a modern Python feature that reduces boilerplate while maintaining the full power of classes. They're particularly valuable for creating clean, readable code when working with structured data. As you continue learning Python, you'll find data classes becoming a natural choice for many data-centric programming tasks, complementing the regular classes you learned in Chapters 30-32.

33. Data Classes for Simple Structured Data

33.1) What Data Classes Are and When to Use Them

Why Data Classes Matter

The Relationship Between Data Classes and Regular Classes

33.2) Creating Data Classes with @dataclass

Basic Data Class Syntax

Understanding Type Annotations in Data Classes

Accessing and Modifying Attributes

33.3) Generated Methods: __init__, __repr__, and __eq__

The Generated __init__ Method

The Generated __repr__ Method

The Generated __eq__ Method

Comparing Generated Methods to Manual Implementation

Adding Custom Methods to Data Classes

33.4) Default Values and Field Options

Providing Default Values

Mutable Default Values and Why They're Not Allowed

Using field() with default_factory for Mutable Defaults

Excluding Fields from Comparison

Excluding Fields from Representation

Making Data Classes Immutable with frozen=True

33.5) Custom Initialization with __post_init__

Basic __post_init__ Usage

Validating Data in post_init

Using post_init with field(init=False)

33.3) Generated Methods: `init`, `repr`, and `eq`

The Generated `init` Method

The Generated `repr` Method

The Generated `eq` Method

33.5) Custom Initialization with `__post_init__`

Basic `__post_init__` Usage