Basic Extraction¶
This guide covers the fundamentals of data extraction with structx
.
Extraction Process¶
When you use structx
to extract data, the following happens:
- Query Refinement: The query is expanded and refined for better extraction
- Model Generation: A Pydantic model is dynamically generated based on the query
- Data Extraction: The model is used to extract structured data from the text
- Result Collection: Results are collected and returned as an
ExtractionResult
object
Simplified Workflow with Provided Model¶
When you provide a data model, the workflow is optimized:
- Model-Driven Guide Generation: A guide is generated based on the model structure and available data columns
- Data Extraction: The provided model is used to extract structured data
- Result Collection: Results are collected and returned as an
ExtractionResult
object
This workflow is more efficient as it skips the query analysis step and uses the model structure to guide the extraction process. The guide generation focuses on mapping model fields to available data columns and ensuring proper data type handling.
from pydantic import BaseModel
class Incident(BaseModel):
date: str
system: str
severity: str
# Extract using the provided model
result = extractor.extract(
data=df,
query="extract incident details",
model=Incident
)
Extraction Query¶
The extraction query is a natural language description of what you want to extract. Be as specific as possible:
# Simple query
query = "extract incident dates and affected systems"
# More specific query
query = """
extract incident information including:
- date and time of occurrence
- affected system components
- severity level (high, medium, low)
- resolution steps taken
"""
Input Data Types¶
structx
supports various input data types:
DataFrames¶
import pandas as pd
df = pd.DataFrame({
"description": [
"System check on 2024-01-15 detected high CPU usage (92%) on server-01.",
"Database backup failure occurred on 2024-01-20 03:00."
]
})
result = extractor.extract(
data=df,
query="extract incident dates and affected systems"
)
Files¶
# CSV files
result = extractor.extract(
data="logs.csv",
query="extract incident dates and affected systems"
)
# Excel files
result = extractor.extract(
data="reports.xlsx",
query="extract incident dates and affected systems"
)
# JSON files
result = extractor.extract(
data="data.json",
query="extract incident dates and affected systems"
)
Raw Text¶
text = """
System check on 2024-01-15 detected high CPU usage (92%) on server-01.
Database backup failure occurred on 2024-01-20 03:00.
"""
result = extractor.extract(
data=text,
query="extract incident dates and affected systems"
)
Output Formats¶
Model Instances (Default)¶
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=False # Default
)
# Access as list of model instances
for item in result.data:
print(f"Date: {item.incident_date}")
print(f"System: {item.affected_system}")
DataFrame¶
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=True
)
# Access as DataFrame
print(result.data.head())
Nested Data¶
For nested data structures, you can choose to flatten them:
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=True,
expand_nested=True # Flatten nested structures
)
Working with Results¶
The extract
method returns an ExtractionResult
object with:
data
: Extracted data (DataFrame or list of model instances)failed
: DataFrame with failed extractionsmodel
: Generated or provided model classsuccess_count
: Number of successful extractionsfailure_count
: Number of failed extractionssuccess_rate
: Success rate as a percentage
# Check extraction statistics
print(f"Extracted {result.success_count} items")
print(f"Failed {result.failure_count} items")
print(f"Success rate: {result.success_rate:.1f}%")
# Access the model schema
print(result.model.model_json_schema())
Error Handling¶
Failed extractions are collected in the failed
DataFrame:
Next Steps¶
- Learn about Custom Models for specific extraction needs
- Try Model Refinement to modify data models with natural language
- Learn about retry mechanisms for robust error handling
- Explore Unstructured Text handling for documents
- See how to use Multiple Queries for complex extractions