33. Data Classes for Simple Structured Data
In Chapter 30, we learned how to create classes to define our own types. We wrote __init__ methods to initialize instances, __repr__ methods to display them, and __eq__ methods to compare them. While this approach works perfectly, it involves writing a lot of repetitive code, especially when a class primarily exists to store data.
Python's data classes provide a cleaner, more concise way to create classes that are mainly containers for data. By using the @dataclass decorator, Python automatically generates common methods like __init__, __repr__, and __eq__ based on the class attributes you define. This reduces boilerplate code and makes your intentions clearer.
33.1) What Data Classes Are and When to Use Them
A data class is a class designed primarily to store data values. Instead of manually writing initialization and comparison methods, you define the attributes your class should have, and Python generates the necessary methods automatically.
Why Data Classes Matter
Consider a regular class for representing a book:
class Book:
def __init__(self, title, author, year):
self.title = title
self.author = author
self.year = year
def __repr__(self):
return f"Book(title={self.title!r}, author={self.author!r}, year={self.year})"
def __eq__(self, other):
if not isinstance(other, Book):
return False
return (self.title == other.title and
self.author == other.author and
self.year == other.year)
book1 = Book("1984", "George Orwell", 1949)
print(book1) # Output: Book(title='1984', author='George Orwell', year=1949)
book2 = Book("1984", "George Orwell", 1949)
print(book1 == book2) # Output: TrueThis works, but notice how much code we wrote just to store three pieces of information. The __init__, __repr__, and __eq__ methods follow predictable patterns—they simply handle the attributes we defined.
Data classes eliminate this repetition. They're particularly useful when:
- Your class primarily stores data rather than implementing complex behavior
- You need standard methods like initialization, string representation, and equality comparison
- You want clearer, more maintainable code with less boilerplate
- You're creating configuration objects, data transfer objects, or simple records
Data classes don't replace regular classes—they complement them. Use regular classes when you need custom initialization logic, complex methods, or inheritance hierarchies. Use data classes when you mainly need a structured container for related data.
The Relationship Between Data Classes and Regular Classes
Data classes are still regular Python classes. They support all the features we learned in Chapters 30-32: methods, properties, inheritance, and special methods. The @dataclass decorator simply automates the creation of common methods, saving you from writing repetitive code.
33.2) Creating Data Classes with @dataclass
To create a data class, you import the dataclass decorator from the dataclasses module and apply it to your class definition. Inside the class, you define class attributes with type annotations that specify what data the class should hold.
Basic Data Class Syntax
from dataclasses import dataclass
@dataclass
class Student:
name: str
student_id: int
gpa: float
# Create instances
alice = Student("Alice Johnson", 12345, 3.8)
bob = Student("Bob Smith", 12346, 3.5)
print(alice) # Output: Student(name='Alice Johnson', student_id=12345, gpa=3.8)
print(bob) # Output: Student(name='Bob Smith', student_id=12346, gpa=3.5)Let's break down what the @dataclass decorator does:
-
@dataclass: Applying this decorator makes Python automatically write the__init__,__repr__, and__eq__methods for you -
Automatic
__init__: Python creates an initialization method that accepts these three parameters in the order they're defined and assigns them to instance attributes -
Automatic
__repr__: Python creates a string representation showing the class name and all attribute values -
Automatic
__eq__: Python creates an equality comparison method that compares all attributes -
Converts type annotations into instance attributes: In a regular class, writing
name: strin the class body creates a class attribute. But the@dataclassdecorator changes this behavior—it uses these type annotations to define instance attributes instead. Each instance gets its ownname,student_id, andgpaattributes.
The key difference from regular classes:
# Regular class - these are class attributes (shared by all instances)
class RegularStudent:
name: str
student_id: int
# Data class - these become instance attributes (each instance has its own)
@dataclass
class DataStudent:
name: str
student_id: intUnderstanding Type Annotations in Data Classes
In data classes, type annotations define the attributes and document their expected types:
from dataclasses import dataclass
@dataclass
class Product:
name: str
price: float
in_stock: bool
# Using the correct types as documented
laptop = Product("Laptop", 999.99, True)
print(laptop) # Output: Product(name='Laptop', price=999.99, in_stock=True)
# Python doesn't enforce types - this runs without error
macbook = Product("Macbook", "expensive", True)
print(macbook) # Output: Product(name='Macbook', price='expensive', in_stock=True)
# But using wrong types will cause problems later:
discounted = laptop.price * 0.9 # Works: 899.991
discounted = macbook.price * 0.9 # TypeError: can't multiply sequence by non-int of type 'float'
tax = laptop.price + 50 # Works: 1049.99
tax = macbook.price + 50 # TypeError: can only concatenate str (not "int") to strPython won't stop you from passing the wrong types when creating a data class instance. The type annotations are primarily documentation—they tell other programmers (and type-checking tools like mypy) what types you expect, but Python doesn't enforce them at runtime. This is consistent with Python's dynamic typing philosophy.
However, following the type annotations makes your code more predictable and easier to debug. When you use the wrong types, errors will appear later when you try to use the data, making bugs harder to trace. Type-checking tools can catch these mismatches before you run your code, helping you find problems early.
Accessing and Modifying Attributes
Data class instances work exactly like regular class instances. You access and modify attributes using dot notation:
from dataclasses import dataclass
@dataclass
class Employee:
name: str
position: str
salary: float
emp = Employee("Sarah Chen", "Software Engineer", 95000.0)
# Access attributes
print(emp.name) # Output: Sarah Chen
print(emp.position) # Output: Software Engineer
# Modify attributes
emp.salary = 100000.0
emp.position = "Senior Software Engineer"
print(emp) # Output: Employee(name='Sarah Chen', position='Senior Software Engineer', salary=100000.0)Data classes are mutable by default—you can change their attributes after creation. This is different from tuples or named tuples, which are immutable. If you need immutability, you can configure the data class with frozen=True (we'll explore this in Section 33.4).
33.3) Generated Methods: __init__, __repr__, and __eq__
The @dataclass decorator automatically generates three essential methods. Understanding what these methods do helps you use data classes effectively and know when to customize them.
The Generated __init__ Method
The __init__ method initializes a new instance with the provided values. Python generates it based on the order of your attribute definitions:
from dataclasses import dataclass
@dataclass
class Rectangle:
width: float
height: float
# The generated __init__ accepts width and height in that order
rect = Rectangle(10.5, 5.0)
print(rect.width) # Output: 10.5
print(rect.height) # Output: 5.0
# You can also use keyword arguments
rect2 = Rectangle(height=8.0, width=12.0)
print(rect2.width) # Output: 12.0
print(rect2.height) # Output: 8.0The generated __init__ is equivalent to writing:
def __init__(self, width: float, height: float):
self.width = width
self.height = heightThis automatic generation saves you from writing repetitive initialization code, especially for classes with many attributes.
The Generated __repr__ Method
The __repr__ method provides a string representation of the instance that shows all attribute values. This is invaluable for debugging and logging:
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
label: str
point = Point(3.5, 7.2, "A")
print(point) # Output: Point(x=3.5, y=7.2, label='A')
print(repr(point)) # Output: Point(x=3.5, y=7.2, label='A')The generated __repr__ follows the convention of showing the class name and all attributes in a format that could be used to recreate the object. This is much more helpful than the default representation you'd get without __repr__: <__main__.Point object at 0x...>.
The Generated __eq__ Method
The __eq__ method enables equality comparison between instances. Two data class instances are considered equal if all their corresponding attributes are equal:
from dataclasses import dataclass
@dataclass
class Color:
red: int
green: int
blue: int
color1 = Color(255, 0, 0)
color2 = Color(255, 0, 0)
color3 = Color(0, 255, 0)
print(color1 == color2) # Output: True (same RGB values)
print(color1 == color3) # Output: False (different RGB values)
print(color1 is color2) # Output: False (different objects in memory)This automatic equality comparison is based on value equality, not identity. Even though color1 and color2 are different objects in memory (as shown by is), they're considered equal because their attributes match.
The generated __eq__ method compares attributes in the order they're defined:
from dataclasses import dataclass
@dataclass
class Book:
title: str
author: str
year: int
book1 = Book("1984", "George Orwell", 1949)
book2 = Book("1984", "George Orwell", 1949)
book3 = Book("Animal Farm", "George Orwell", 1945)
print(book1 == book2) # Output: True (all attributes match)
print(book1 == book3) # Output: False (title and year differ)
# Comparison with non-Book objects returns False
print(book1 == "1984") # Output: False
print(book1 == None) # Output: FalseComparing Generated Methods to Manual Implementation
To appreciate what data classes provide, let's compare the data class version with a manual implementation:
from dataclasses import dataclass
# Data class version (concise)
@dataclass
class PersonData:
first_name: str
last_name: str
age: int
# Equivalent manual version (verbose)
class PersonManual:
def __init__(self, first_name: str, last_name: str, age: int):
self.first_name = first_name
self.last_name = last_name
self.age = age
def __repr__(self):
return f"PersonManual(first_name={self.first_name!r}, last_name={self.last_name!r}, age={self.age})"
def __eq__(self, other):
if not isinstance(other, PersonManual):
return False
return (self.first_name == other.first_name and
self.last_name == other.last_name and
self.age == other.age)
# Both work identically
p1 = PersonData("Alice", "Johnson", 30)
p2 = PersonManual("Alice", "Johnson", 30)
print(p1) # Output: PersonData(first_name='Alice', last_name='Johnson', age=30)
print(p2) # Output: PersonManual(first_name='Alice', last_name='Johnson', age=30)The data class version achieves the same functionality with significantly less code. This reduction in boilerplate makes your code easier to read, maintain, and modify.
Adding Custom Methods to Data Classes
Data classes can have custom methods just like regular classes. The @dataclass decorator only generates the initialization, representation, and equality methods—you're free to add any other functionality:
from dataclasses import dataclass
@dataclass
class Temperature:
celsius: float
def to_fahrenheit(self):
"""Convert temperature to Fahrenheit."""
return (self.celsius * 9/5) + 32
def to_kelvin(self):
"""Convert temperature to Kelvin."""
return self.celsius + 273.15
def is_freezing(self):
"""Check if temperature is at or below freezing point."""
return self.celsius <= 0
temp = Temperature(25.0)
print(temp) # Output: Temperature(celsius=25.0)
print(f"{temp.celsius}°C = {temp.to_fahrenheit()}°F") # Output: 25.0°C = 77.0°F
print(f"Kelvin: {temp.to_kelvin()}") # Output: Kelvin: 298.15
print(f"Freezing: {temp.is_freezing()}") # Output: Freezing: False
cold_temp = Temperature(-5.0)
print(f"Freezing: {cold_temp.is_freezing()}") # Output: Freezing: TrueData classes handle the repetitive parts (initialization, representation, and comparison) while letting you add custom methods for your specific needs, as shown with the temperature conversion methods above.
33.4) Default Values and Field Options
Data classes support default values for attributes, allowing you to create instances without specifying every parameter. You can also use the field() function to configure advanced behaviors like excluding attributes from comparisons or controlling how they appear in the string representation.
Providing Default Values
You can assign default values to attributes directly in the class definition. Attributes with defaults must come after attributes without defaults:
from dataclasses import dataclass
@dataclass
class User:
username: str
email: str
is_active: bool = True # Default value
role: str = "user" # Default value
# Create instances with and without defaults
user1 = User("alice", "alice@example.com")
print(user1) # Output: User(username='alice', email='alice@example.com', is_active=True, role='user')
user2 = User("bob", "bob@example.com", False, "admin")
print(user2) # Output: User(username='bob', email='bob@example.com', is_active=False, role='admin')
# Use keyword arguments to override specific defaults
user3 = User("charlie", "charlie@example.com", role="moderator")
print(user3) # Output: User(username='charlie', email='charlie@example.com', is_active=True, role='moderator')The ordering rule (attributes without defaults before attributes with defaults) prevents ambiguity in the generated __init__ method. This is the same requirement as for function parameters with default values, which we learned in Chapter 20.
Mutable Default Values and Why They're Not Allowed
Data classes protect you from a common mistake with mutable defaults. If you try to use a mutable object like a list or dictionary directly as a default, you'll get an error:
from dataclasses import dataclass
# This will raise an error
@dataclass
class ShoppingCart:
customer: str
items: list = [] # ValueError: mutable default <class 'list'> for field items is not allowed: use default_factoryThis error prevents the same problem we saw with function default arguments in Chapter 20, where all instances would share the same mutable object.
Using field() with default_factory for Mutable Defaults
The solution is to use the field() function with default_factory, which creates a new default value for each instance:
from dataclasses import dataclass, field
@dataclass
class ShoppingCart:
customer: str
items: list = field(default_factory=list) # Correct: New list per instance
# Now each instance gets its own list
cart1 = ShoppingCart("Alice")
cart1.items.append("Book")
print(cart1.items) # Output: ['Book']
cart2 = ShoppingCart("Bob")
print(cart2.items) # Output: [] - Bob has an empty list
cart2.items.append("Laptop")
print(cart1.items) # Output: ['Book'] - Alice's cart unchanged
print(cart2.items) # Output: ['Laptop'] - Bob's cart independentThe default_factory parameter takes a function (like list, dict, or set) that will be called to create a new default value each time you create an instance without providing that attribute. For example, default_factory=list means Python will call list() to create a new empty list for each instance.
Excluding Fields from Comparison
Sometimes you want certain attributes to be excluded from equality comparisons. Use field(compare=False) for this:
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class LogEntry:
message: str
level: str
timestamp: datetime = field(compare=False) # Don't compare timestamps
# Create two log entries with the same message but different times
entry1 = LogEntry("User logged in", "INFO", datetime(2024, 1, 15, 10, 30))
entry2 = LogEntry("User logged in", "INFO", datetime(2024, 1, 15, 10, 35))
# They're equal because timestamp is excluded from comparison
print(entry1 == entry2) # Output: True
# But they have different timestamps
print(entry1.timestamp) # Output: 2024-01-15 10:30:00
print(entry2.timestamp) # Output: 2024-01-15 10:35:00This is useful when you have metadata fields (like timestamps, IDs, or internal counters) that shouldn't affect whether two instances are considered equal.
Excluding Fields from Representation
You can also exclude fields from the string representation using field(repr=False):
from dataclasses import dataclass, field
@dataclass
class Account:
username: str
email: str
password: str = field(repr=False) # Don't show password in repr
account = Account("alice", "alice@example.com", "secret123")
print(account) # Output: Account(username='alice', email='alice@example.com')
# Password is not shown, but it's still stored
print(account.password) # Output: secret123This is particularly useful for sensitive data like passwords, API keys, or large data structures that would clutter the representation.
Making Data Classes Immutable with frozen=True
By default, data class instances are mutable—you can change their attributes after creation. If you want immutable instances (like tuples), use frozen=True:
from dataclasses import dataclass
@dataclass(frozen=True)
class Point:
x: float
y: float
point = Point(3.0, 4.0)
print(point) # Output: Point(x=3.0, y=4.0)
# Attempting to modify raises an error
try:
point.x = 5.0
except AttributeError as e:
print(f"Error: {e}") # Output: Error: cannot assign to field 'x'Frozen data classes are useful when you want to ensure data integrity or use instances as dictionary keys (since dictionary keys must be immutable). When a data class is frozen, Python also generates a __hash__ method, making instances hashable:
from dataclasses import dataclass
@dataclass(frozen=True)
class Coordinate:
latitude: float
longitude: float
# Frozen instances can be dictionary keys
locations = {
Coordinate(40.7128, -74.0060): "New York",
Coordinate(51.5074, -0.1278): "London",
Coordinate(35.6762, 139.6503): "Tokyo"
}
nyc = Coordinate(40.7128, -74.0060)
print(locations[nyc]) # Output: New York33.5) Custom Initialization with __post_init__
Sometimes you need to perform additional setup after the generated __init__ method runs. The __post_init__ method is called automatically after initialization, allowing you to validate data, compute derived attributes, or perform other setup tasks.
Basic __post_init__ Usage
The __post_init__ method is called after all attributes have been set by the generated __init__:
from dataclasses import dataclass
@dataclass
class Rectangle:
width: float
height: float
area: float = 0.0 # Will be computed in __post_init__
def __post_init__(self):
"""Calculate area after initialization."""
self.area = self.width * self.height
rect = Rectangle(5.0, 3.0)
print(rect) # Output: Rectangle(width=5.0, height=3.0, area=15.0)
print(f"Area: {rect.area}") # Output: Area: 15.0The __post_init__ method has access to all the instance attributes that were set during initialization. This is useful for computing derived values that depend on multiple attributes.
Validating Data in post_init
A common use of __post_init__ is to validate that the provided data meets certain requirements:
from dataclasses import dataclass
@dataclass
class BankAccount:
account_number: str
balance: float
def __post_init__(self):
"""Validate account data."""
if self.balance < 0:
raise ValueError("Balance cannot be negative")
# Valid account
account1 = BankAccount("ACC001", 1000.0)
print(account1) # Output: BankAccount(account_number='ACC001', balance=1000.0)
# Invalid account - negative balance
try:
account2 = BankAccount("ACC002", -500.0)
except ValueError as e:
print(f"Error: {e}") # Output: Error: Balance cannot be negativeThis validation ensures that instances are always in a valid state. If the data doesn't meet requirements, the instance is never created, preventing invalid objects from existing in your program.
Using post_init with field(init=False)
Sometimes you want an attribute that's computed in __post_init__ but shouldn't be a parameter in __init__. Use field(init=False) for this:
from dataclasses import dataclass, field
import math
@dataclass
class Circle:
radius: float
area: float = field(init=False) # Not a parameter in __init__
circumference: float = field(init=False)
def __post_init__(self):
"""Compute area and circumference from radius."""
self.area = math.pi * self.radius ** 2
self.circumference = 2 * math.pi * self.radius
# Only radius is required during initialization
circle = Circle(5.0)
print(circle) # Output: Circle(radius=5.0, area=78.53981633974483, circumference=31.41592653589793)
print(f"Area: {circle.area:.2f}") # Output: Area: 78.54
print(f"Circumference: {circle.circumference:.2f}") # Output: Circumference: 31.42This pattern is useful when you have attributes that are always computed from other attributes and should never be set directly during initialization.
Data classes represent a modern Python feature that reduces boilerplate while maintaining the full power of classes. They're particularly valuable for creating clean, readable code when working with structured data. As you continue learning Python, you'll find data classes becoming a natural choice for many data-centric programming tasks, complementing the regular classes you learned in Chapters 30-32.