Basic Extraction¶
This guide covers the fundamentals of data extraction with structx
.
Extraction Process¶
When you use structx
to extract data, the following happens:
- Query Refinement: The query is expanded and refined for better extraction
- Model Generation: A Pydantic model is dynamically generated based on the query
- Data Extraction: The model is used to extract structured data from the text
- Result Collection: Results are collected and returned as an
ExtractionResult
object
Standard Extraction Flow¶
View Standard Extraction Flow Diagram
graph LR
A[Raw Query] --> B[Query Refinement]
B --> C[Schema Generation]
C --> D[Model Creation]
D --> E[Data Extraction]
E --> F[Result Collection]
F --> G[ExtractionResult]
subgraph "LLM Operations"
B1[Analyze Query] --> B2[Generate Context]
B2 --> B3[Refine Query]
C1[Infer Schema] --> C2[Generate Fields]
C2 --> C3[Define Types]
end
B --> B1
C --> C1
Simplified Workflow with Provided Model¶
When you provide a data model, the workflow is optimized:
- Model-Driven Guide Generation: A guide is generated based on the model structure and available data columns
- Data Extraction: The provided model is used to extract structured data
- Result Collection: Results are collected and returned as an
ExtractionResult
object
This workflow is more efficient as it skips the query analysis step and uses the model structure to guide the extraction process. The guide generation focuses on mapping model fields to available data columns and ensuring proper data type handling.
Custom Model Flow¶
View Custom Model Flow Diagram
graph LR
A[Custom Model] --> B[Model Analysis]
B --> C[Field Mapping]
C --> D[Guide Generation]
D --> E[Direct Extraction]
E --> F[ExtractionResult]
subgraph "Model Processing"
B1[Schema Analysis] --> B2[Field Discovery]
B2 --> B3[Type Inference]
C1[Column Mapping] --> C2[Field Alignment]
C2 --> C3[Validation Rules]
end
B --> B1
C --> C1
from pydantic import BaseModel, Field
from typing import List
class Invoice(BaseModel):
invoice_number: str = Field(..., description="The invoice number")
total_amount: float = Field(..., description="The total amount of the invoice")
line_items: List[str] = Field(..., description="A list of line items from the invoice")
# Extract using the provided model
result = extractor.extract(
data="scripts/example_input/S0305SampleInvoice.pdf",
model=Invoice
)
Extraction Query¶
The extraction query is a natural language description of what you want to extract. Be as specific as possible:
# Simple query for a legal document
query = "extract the parties, effective date, and governing law"
# More specific query for an invoice
query = """
extract the following details from the invoice:
- invoice number
- total amount due
- list of all line items with their descriptions and costs
- payment due date
"""
Input Data Types¶
structx
supports various input data types, but for legal and financial
documents, you'll primarily use file paths.
Files¶
# PDF invoice
result = extractor.extract(
data="scripts/example_input/S0305SampleInvoice.pdf",
query="extract invoice number, total amount, and line items"
)
# DOCX contract
result = extractor.extract(
data="scripts/example_input/free-consultancy-agreement.docx",
query="extract the parties, effective date, and payment terms"
)
Output Formats¶
Model Instances (Default)¶
result = extractor.extract(
data="scripts/example_input/S0305SampleInvoice.pdf",
query="extract invoice number and total amount",
)
# Access as list of model instances
for item in result.data:
print(f"Invoice Number: {item.invoice_number}")
print(f"Total Amount: {item.total_amount}")
DataFrame¶
result = extractor.extract(
data="scripts/example_input/S0305SampleInvoice.pdf",
query="extract invoice number and total amount",
return_df=True
)
# Access as DataFrame
print(result.data.head())
Nested Data¶
For nested data structures, you can choose to flatten them:
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=True,
expand_nested=True # Flatten nested structures
)
Working with Results¶
The extract
method returns an ExtractionResult
object with:
data
: Extracted data (DataFrame or list of model instances)failed
: DataFrame with failed extractionsmodel
: Generated or provided model classsuccess_count
: Number of successful extractionsfailure_count
: Number of failed extractionssuccess_rate
: Success rate as a percentage
# Check extraction statistics
print(f"Extracted {result.success_count} items")
print(f"Failed {result.failure_count} items")
print(f"Success rate: {result.success_rate:.1f}%")
# Access the model schema
print(result.model.model_json_schema())
Error Handling¶
Failed extractions are collected in the failed
DataFrame:
Next Steps¶
- Learn about Custom Models for specific extraction needs
- Try Model Refinement to modify data models with natural language
- Learn about retry mechanisms for robust error handling
- Explore Unstructured Text handling for documents
- See how to use Multiple Queries for complex extractions