Using Model Refinement¶

The refine_data_model feature allows you to modify existing Pydantic models using natural language instructions. This powerful capability lets you evolve your data models as requirements change without having to manually rewrite them.

Basic Usage¶

from structx import Extractor
from pydantic import BaseModel

# Initialize the extractor
extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key"
)

# Original model
class UserProfile(BaseModel):
    name: str
    email: str
    age: int

# Refine the model
refined_model = extractor.refine_data_model(
    model=UserProfile,
    instructions="""
    1. Add a 'phone_number' field as a string in the format '123-456-7890'
    2. Change 'age' to 'birth_date' using datetime type
    3. Add validation to ensure email contains '@'
    """,
    model_name="EnhancedUserProfile"  # Optional custom name
)

# Create an instance of the refined model
user = refined_model(
    name="John Doe",
    email="[email protected]",
    birth_date="1990-01-01",
    phone_number="123-456-7890"
)

print(user)
# EnhancedUserProfile(name='John Doe', email='[email protected]', birth_date=datetime.datetime(1990, 1, 1, 0, 0), phone_number='123-456-7890')

How It Works¶

The refine_data_model method:

Takes an existing Pydantic model and natural language instructions
Analyzes the model structure and the requested changes
Generates a new model with the specified modifications
Ensures proper validation rules are applied
Returns the refined model ready for use

Example Use Cases¶

Adding New Fields¶

class Product(BaseModel):
    name: str
    price: float
    category: str

enhanced_product = extractor.refine_data_model(
    model=Product,
    instructions="Add an 'in_stock' boolean field and a 'tags' field that accepts a list of strings"
)

# Result: Product with name, price, category, in_stock, and tags fields

Modifying Field Types and Validation¶

class Order(BaseModel):
    id: str
    items: List[str]
    total: float

validated_order = extractor.refine_data_model(
    model=Order,
    instructions="""
    1. Make 'id' follow the pattern 'ORD-' followed by 6 digits
    2. Change 'items' to accept a list of dictionaries with 'product_id' and 'quantity' fields
    3. Ensure 'total' is always positive
    """
)

# Result: Order with validated id, structured items list, and positive total

Removing Fields¶

class UserSettings(BaseModel):
    user_id: str
    preferences: Dict[str, str]
    last_login: datetime
    created_at: datetime
    updated_at: datetime

simplified_settings = extractor.refine_data_model(
    model=UserSettings,
    instructions="Remove the 'created_at' and 'updated_at' fields"
)

# Result: UserSettings without created_at and updated_at fields

Complex Transformations¶

class SimpleAddress(BaseModel):
    street: str
    city: str
    country: str

detailed_address = extractor.refine_data_model(
    model=SimpleAddress,
    instructions="""
    1. Split 'street' into 'street_name' and 'street_number'
    2. Add a 'postal_code' field with appropriate validation for postal codes
    3. Add a 'state_province' field that's required for US and Canada but optional otherwise
    4. Make 'country' use a two-letter country code format
    """
)

# Result: A more detailed address model with proper validation

Best Practices¶

Provide Clear Instructions¶

Be specific about what changes you want to make. Include:

Fields to add, modify, or remove
Validation requirements
Type changes
Default values if needed

Validate the Results¶

Always check the generated model to ensure it meets your requirements:

# Print the model schema
print(refined_model.model_json_schema())

# Test with valid and invalid data
try:
    invalid_user = refined_model(
        name="John Doe",
        email="invalid-email",  # Missing '@'
        birth_date="1990-01-01",
        phone_number="123-456-7890"
    )
except ValueError as e:
    print(f"Validation works: {e}")

Provide Context When Needed¶

For complex refinements, providing context helps the model understand your intent:

medical_record = extractor.refine_data_model(
    model=PatientRecord,
    instructions="""
    Context: We're updating our medical records system to comply with new regulations.

    Changes needed:
    1. Add a 'consent_given' boolean field that defaults to False
    2. Make 'patient_id' follow the new format 'PAT-' followed by 8 digits
    3. Add an 'emergency_contact' field with name and phone number
    4. Ensure all dates use ISO format
    """
)

Advanced Features¶

Custom Model Names¶

You can specify a custom name for the refined model:

admin_user = extractor.refine_data_model(
    model=User,
    instructions="Add admin-specific fields like 'permissions' and 'access_level'",
    model_name="AdminUser"
)

Working with Nested Models¶

The refinement process can handle nested models:

class Address(BaseModel):
    street: str
    city: str
    country: str

class Person(BaseModel):
    name: str
    address: Address

enhanced_person = extractor.refine_data_model(
    model=Person,
    instructions="""
    1. Add 'email' and 'phone' fields to the Person model
    2. Add 'postal_code' to the nested Address model
    """
)

Limitations¶

Complex custom validators might need manual adjustment
Very specialized domain-specific validations may require additional guidance
The quality of the refinement depends on the clarity of your instructions

Conclusion¶

Model refinement provides a powerful way to evolve your data models using natural language. It's particularly useful for:

Rapid prototyping
Adapting models to changing requirements
Adding validation without manual coding
Exploring different model structures

By combining the flexibility of natural language with the type safety of Pydantic, refine_data_model helps you maintain robust data models with minimal effort.

Next Steps¶

Learn about Token Usage Tracking to monitor resource consumption
Explore Custom Models for creating specialized extraction models
Try Multiple Queries for complex extraction scenarios
See how to use Async Operations for better performance