Basic Extraction¶
This guide covers the fundamentals of data extraction with structx
.
Extraction Process¶
When you use structx
to extract data, the following happens:
- Query Analysis: The system analyzes your query to determine what to extract
- Query Refinement: The query is expanded and refined for better extraction
- Model Generation: A Pydantic model is dynamically generated based on the query
- Data Extraction: The model is used to extract structured data from the text
- Result Collection: Results are collected and returned as an
ExtractionResult
object
Extraction Query¶
The extraction query is a natural language description of what you want to extract. Be as specific as possible:
# Simple query
query = "extract incident dates and affected systems"
# More specific query
query = """
extract incident information including:
- date and time of occurrence
- affected system components
- severity level (high, medium, low)
- resolution steps taken
"""
Input Data Types¶
structx
supports various input data types:
DataFrames¶
import pandas as pd
df = pd.DataFrame({
"description": [
"System check on 2024-01-15 detected high CPU usage (92%) on server-01.",
"Database backup failure occurred on 2024-01-20 03:00."
]
})
result = extractor.extract(
data=df,
query="extract incident dates and affected systems"
)
Files¶
# CSV files
result = extractor.extract(
data="logs.csv",
query="extract incident dates and affected systems"
)
# Excel files
result = extractor.extract(
data="reports.xlsx",
query="extract incident dates and affected systems"
)
# JSON files
result = extractor.extract(
data="data.json",
query="extract incident dates and affected systems"
)
Raw Text¶
text = """
System check on 2024-01-15 detected high CPU usage (92%) on server-01.
Database backup failure occurred on 2024-01-20 03:00.
"""
result = extractor.extract(
data=text,
query="extract incident dates and affected systems"
)
Output Formats¶
Model Instances (Default)¶
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=False # Default
)
# Access as list of model instances
for item in result.data:
print(f"Date: {item.incident_date}")
print(f"System: {item.affected_system}")
DataFrame¶
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=True
)
# Access as DataFrame
print(result.data.head())
Nested Data¶
For nested data structures, you can choose to flatten them:
result = extractor.extract(
data=df,
query="extract incident dates and affected systems",
return_df=True,
expand_nested=True # Flatten nested structures
)
Working with Results¶
The extract
method returns an ExtractionResult
object with:
data
: Extracted data (DataFrame or list of model instances)failed
: DataFrame with failed extractionsmodel
: Generated or provided model classsuccess_count
: Number of successful extractionsfailure_count
: Number of failed extractionssuccess_rate
: Success rate as a percentage
# Check extraction statistics
print(f"Extracted {result.success_count} items")
print(f"Failed {result.failure_count} items")
print(f"Success rate: {result.success_rate:.1f}%")
# Access the model schema
print(result.model.model_json_schema())
Error Handling¶
Failed extractions are collected in the failed
DataFrame:
Next Steps¶
- Learn about Custom Models for specific extraction needs
- Explore Unstructured Text handling for documents
- See how to use Multiple Queries for complex extractions