Custom Models¶

While structx dynamically generates models based on your queries, you can also use your own custom Pydantic models for extraction.

Custom Model Processing Flow¶

View Custom Model Processing Flow Diagram

graph LR
    A[Custom Model] --> B[Model Analysis]
    B --> C[Field Mapping]
    C --> D[Guide Generation]
    D --> E[Direct Extraction]
    E --> F[Type Validation]
    F --> G[Result Object]

    subgraph "Model Analysis"
        B1[Extract Schema] --> B2[Analyze Fields]
        B2 --> B3[Identify Types]
        B3 --> B4[Parse Descriptions]
    end

    subgraph "Field Mapping"
        C1[Data Column Detection] --> C2[Field Alignment]
        C2 --> C3[Type Compatibility]
        C3 --> C4[Mapping Strategy]
    end

    subgraph "Extraction Benefits"
        H[Skip Query Analysis]
        I[Skip Model Generation]
        J[Direct Type Safety]
        K[Validation Included]
    end

    B --> B1
    C --> C1
    E --> H

Using Custom Models¶

Define Your Model¶

from pydantic import BaseModel, Field
from typing import List, Optional, date

class Party(BaseModel):
    name: str = Field(description="Name of the party")
    role: str = Field(description="Role of the party (e.g., Client, Consultant)")

class ConsultancyAgreement(BaseModel):
    parties: List[Party] = Field(description="The parties to the agreement")
    effective_date: date = Field(description="The effective date of the agreement")
    governing_law: str = Field(description="The governing law of the agreement")

class Invoice(BaseModel):
    invoice_number: str = Field(description="The invoice number")
    total_amount: float = Field(description="The total amount of the invoice")
    issue_date: date = Field(description="The date the invoice was issued")

Extract with Custom Model¶

# Extract from a legal document
result = extractor.extract(
    data="scripts/example_input/free-consultancy-agreement.docx",
    model=ConsultancyAgreement
)

# Access the extracted data
for agreement in result.data:
    print(f"Agreement effective date: {agreement.effective_date}")
    for party in agreement.parties:
        print(f"- Party: {party.name} ({party.role})")
    print(f"Governing Law: {agreement.governing_law}")

# Extract from an invoice
result_invoice = extractor.extract(
    data="scripts/example_input/S0305SampleInvoice.pdf",
    model=Invoice
)

for invoice in result_invoice.data:
    print(f"Invoice Number: {invoice.invoice_number}")
    print(f"Total Amount: {invoice.total_amount}")
    print(f"Issue Date: {invoice.issue_date}")

Reusing Generated Models¶

You can also reuse models generated from previous extractions:

# First extraction generates a model for a contract
result1 = extractor.extract(
    data="scripts/example_input/free-consultancy-agreement.docx",
    query="extract parties and effective date"
)

# Reuse the model for another contract
result2 = extractor.extract(
    data="another_contract.docx",
    model=result1.model
)

Generating Models Without Extraction¶

You can generate a model without performing extraction using get_schema:

# Generate a model based on a query and a sample from a legal document
LegalClauseModel = extractor.get_schema(
    query="extract the termination clause, including notice period and conditions",
    data="scripts/example_input/free-consultancy-agreement.docx"
)

# Inspect the model
print(LegalClauseModel.model_json_schema())

# Use the model for extraction
result = extractor.extract(
    data="scripts/example_input/free-consultancy-agreement.docx",
    model=LegalClauseModel
)

Extending Generated Models¶

You can extend generated models with additional fields or validation:

For more advanced model modifications, you can also use the Model Refinement feature to update your models using natural language instructions:

# Generate a base model
ContractModel = extractor.get_schema(
    query="extract parties and effective date",
    data="scripts/example_input/free-consultancy-agreement.docx"
)

# Refine it with natural language
EnhancedContractModel = extractor.refine_data_model(
    model=ContractModel,
    refinement_instructions="""
    1. Add a 'governing_law' field of type string.
    2. Add a 'termination_notice_days' field of type integer.
    3. Make the 'parties' field a list of strings.
    """
)

# check token usage
usage = EnhancedContractModel.usage
print(f"Total tokens used: {usage.total_tokens}")
print(f"By step: {[(s.name, s.tokens) for s in usage.steps]}")

Model Validation¶

Pydantic models provide built-in validation:

# Create an instance with validation
try:
    agreement = ConsultancyAgreement(
        parties=[{"name": "Client Corp", "role": "Client"}, {"name": "Consultant LLC", "role": "Consultant"}],
        effective_date="2025-01-01",
        governing_law="State of Delaware"
    )
    print("Valid agreement:", agreement)
except Exception as e:
    print("Validation error:", e)

Best Practices¶

Add Field Descriptions: Always include descriptions for your fields to guide the extraction
Use Type Hints: Proper type hints help ensure correct extraction
Set Default Values: Use defaults for optional fields
Add Validation: Include validation rules for better data quality
Keep Models Focused: Create models that focus on specific extraction tasks

Next Steps¶

Learn about Model Refinement for updating models with natural language
Explore Unstructured Text handling
See how to use Multiple Queries for complex extractions
Try Async Operations for better performance