Custom Models¶
While structx
dynamically generates models based on your queries, you can also
use your own custom Pydantic models for extraction.
Using Custom Models¶
Define Your Model¶
from pydantic import BaseModel, Field
from typing import List, Optional
class Metric(BaseModel):
name: str = Field(description="Name of the metric")
value: float = Field(description="Value of the metric")
unit: Optional[str] = Field(default=None, description="Unit of measurement")
class Incident(BaseModel):
timestamp: str = Field(description="When the incident occurred")
system: str = Field(description="Affected system")
severity: str = Field(description="Severity level")
metrics: List[Metric] = Field(default_factory=list, description="Related metrics")
resolution: Optional[str] = Field(default=None, description="Resolution steps")
Extract with Custom Model¶
result = extractor.extract(
data=df,
query="extract incident information including timestamp, system, severity, metrics, and resolution",
model=Incident
)
# Access the extracted data
for item in result.data:
print(f"Incident at {item.timestamp} on {item.system}")
print(f"Severity: {item.severity}")
for metric in item.metrics:
print(f"- {metric.name}: {metric.value} {metric.unit or ''}")
if item.resolution:
print(f"Resolution: {item.resolution}")
Reusing Generated Models¶
You can also reuse models generated from previous extractions:
# First extraction generates a model
result1 = extractor.extract(
data=df1,
query="extract incident dates and affected systems"
)
# Reuse the model for another extraction
result2 = extractor.extract(
data=df2,
query="extract incident information",
model=result1.model
)
Generating Models Without Extraction¶
You can generate a model without performing extraction using get_schema
:
# Generate a model based on a query and sample text
IncidentModel = extractor.get_schema(
query="extract incident dates, affected systems, and resolution steps",
sample_text="System check on 2024-01-15 detected high CPU usage (92%) on server-01."
)
# Inspect the model
print(IncidentModel.model_json_schema())
# Use the model for extraction
result = extractor.extract(
data=df,
query="extract incident information",
model=IncidentModel
)
Extending Generated Models¶
You can extend generated models with additional fields or validation:
For more advanced model modifications, you can also use the Model Refinement feature to update your models using natural language instructions:
# Generate a base model
IncidentModel = extractor.get_schema(
query="extract incident dates and affected systems",
sample_text="Sample text here"
)
# Refine it with natural language
EnhancedIncidentModel = extractor.refine_data_model(
model=IncidentModel,
instructions="""
1. Add a 'severity' field with allowed values: 'low', 'medium', 'high'
2. Make incident_date a required field
3. Add validation for affected_systems to ensure it's a non-empty list
"""
# check token usage
usage = EnhancedIncidentModel.usage
print(f"Total tokens used: {usage.total_tokens}")
print(f"By step: {[(s.name, s.tokens) for s in usage.steps]}")
)
## Model Validation
Pydantic models provide built-in validation:
```python
# Create an instance with validation
try:
incident = Incident(
timestamp="2024-01-15 14:30",
system="server-01",
severity="high",
metrics=[
{"name": "CPU Usage", "value": 92, "unit": "%"}
]
)
print("Valid incident:", incident)
except Exception as e:
print("Validation error:", e)
Best Practices¶
- Add Field Descriptions: Always include descriptions for your fields to guide the extraction
- Use Type Hints: Proper type hints help ensure correct extraction
- Set Default Values: Use defaults for optional fields
- Add Validation: Include validation rules for better data quality
- Keep Models Focused: Create models that focus on specific extraction tasks
Next Steps¶
- Learn about Model Refinement for updating models with natural language
- Explore Unstructured Text handling
- See how to use Multiple Queries for complex extractions
- Try Async Operations for better performance