Examples¶
This document contains examples of using the structx library for structured data extraction.
Generated on: 2025-03-05 12:56:41
Setup¶
from structx import Extractor
import pandas as pd
import os
# Initialize the extractor
extractor = Extractor.from_litellm(
model="gpt-4o-mini",
api_key="your-api-key"
)
Sample Data¶
The following examples use this synthetic dataset of technical system logs: | Log ID | Description | |---------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1 | System check on 2024-01-15 detected high CPU usage (92%) on server-01. Alert triggered at 14:30. Investigation revealed memory leak in application A. Patch applied on 2024-01-16 08:00, confirmed resolution at 09:15. | | 2 | Database backup failure occurred on 2024-01-20 03:00. Root cause: insufficient storage space. Emergency cleanup performed at 04:30. Backup reattempted successfully at 05:45. Added monitoring alert for storage capacity. | | 3 | Network connectivity drops reported on 2024-02-01 between 10:00-10:45. Affected: 3 application servers. Initial diagnosis at 10:15 identified router misconfiguration. Applied fix at 10:30, confirmed full restoration at 10:45. | | 4 | Two distinct performance issues on 2024-02-05: cache invalidation errors at 09:00 and slow query responses at 14:00. Cache system restarted at 09:30. Query optimization implemented at 15:00. Both issues resolved by EOD. | | 5 | Configuration update on 2024-02-10 09:00 caused unexpected API behavior. Detected through monitoring at 09:15. Immediate rollback initiated at 09:20. Root cause analysis completed at 11:00. New update scheduled with additional testing. |
Example 1: Extracting Incident Timing¶
In this example, we extract the main incident date and time, along with any additional timestamps and their significance.
# Extract incident timing
query = "extract the main incident date and time, and any additional timestamps with their significance"
result = extractor.extract(df, query)
# Access the extraction results
print(f"Extracted {result.success_count} items with {result.success_rate:.1f}% success rate")
print(result.data)
Results:¶
Extracted 5 items with 100.0% success rate
[
{
"main_incident_datetime": "2024-01-15 14:30:00",
"additional_timestamps": [
{
"timestamp": "2024-01-16 08:00:00",
"significance": "Patch applied to resolve memory leak in application A."
},
{
"timestamp": "2024-01-16 09:15:00",
"significance": "Resolution of the issue confirmed."
}
]
},
{
"main_incident_datetime": "2024-02-05 09:00:00",
"additional_timestamps": [
{
"timestamp": "2024-02-05 09:30:00",
"significance": "Cache system restarted."
},
{
"timestamp": "2024-02-05 14:00:00",
"significance": "Slow query responses issue."
},
{
"timestamp": "2024-02-05 15:00:00",
"significance": "Query optimization implemented."
}
]
},
{
"main_incident_datetime": "2024-02-01 10:00:00",
"additional_timestamps": [
{
"timestamp": "2024-02-01 10:15:00",
"significance": "Initial diagnosis identified router misconfiguration."
},
{
"timestamp": "2024-02-01 10:30:00",
"significance": "Applied fix to the router."
},
{
"timestamp": "2024-02-01 10:45:00",
"significance": "Confirmed full restoration of network connectivity."
}
]
},
{
"main_incident_datetime": "2024-01-20 03:00:00",
"additional_timestamps": [
{
"timestamp": "2024-01-20 04:30:00",
"significance": "Emergency cleanup performed due to insufficient storage space."
},
{
"timestamp": "2024-01-20 05:45:00",
"significance": "Backup reattempted successfully after cleanup."
},
{
"timestamp": "2024-01-20 05:45:00",
"significance": "Added monitoring alert for storage capacity."
}
]
},
{
"main_incident_datetime": "2024-02-10 09:00:00",
"additional_timestamps": [
{
"timestamp": "2024-02-10 09:15:00",
"significance": "Detected unexpected API behavior through monitoring."
},
{
"timestamp": "2024-02-10 09:20:00",
"significance": "Immediate rollback initiated."
},
{
"timestamp": "2024-02-10 11:00:00",
"significance": "Root cause analysis completed."
}
]
}
]
Generated Model:¶
Model name: IncidentTimestampExtraction
{
"$defs": {
"IncidentTimestampExtractionadditional_timestampsItem": {
"description": "A collection of additional timestamps related to the incident, each with its significance.",
"properties": {
"timestamp": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The specific date and time of the additional event.",
"title": "Timestamp"
},
"significance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "A description of the significance of the additional timestamp.",
"title": "Significance"
}
},
"title": "IncidentTimestampExtractionadditional_timestampsItem",
"type": "object"
}
},
"description": "Schema for extracting incident timestamps and their significance from reports.",
"properties": {
"main_incident_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The main date and time of the incident, clearly identified.",
"title": "Main Incident Datetime"
},
"additional_timestamps": {
"anyOf": [
{
"items": {
"$ref": "#/$defs/IncidentTimestampExtractionadditional_timestampsItem"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "A collection of additional timestamps related to the incident, each with its significance.",
"title": "Additional Timestamps"
}
},
"title": "IncidentTimestampExtraction",
"type": "object"
}
Example 2: Extracting Technical Details¶
This example extracts more complex information about each incident including system components, issue types, severity, and resolution steps.
# Extract technical details
query = """
extract incident information including:
- system component affected
- issue type
- severity
- resolution steps
"""
result = extractor.extract(df, query)
Results:¶
Extracted 5 items with 100.0% success rate
[
{
"incidents": [
{
"system_component_affected": "server-01",
"issue_type": "high CPU usage",
"severity": "High",
"resolution_steps": [
[
"Investigated the issue"
],
[
"Identified memory leak in application A"
],
[
"Applied patch"
],
[
"Confirmed resolution"
]
]
},
{
"system_component_affected": "application A",
"issue_type": "memory leak",
"severity": "High",
"resolution_steps": [
[
"Investigated the issue"
],
[
"Identified memory leak"
],
[
"Applied patch"
],
[
"Confirmed resolution"
]
]
}
]
},
{
"incidents": [
{
"system_component_affected": "Cache System",
"issue_type": "Cache Invalidation Error",
"severity": "High",
"resolution_steps": [
[
"Restarted cache system at 09:30"
],
[
"Resolved by end of day"
]
]
}
]
},
{
"incidents": [
{
"system_component_affected": "application servers",
"issue_type": "Network connectivity drops",
"severity": "High",
"resolution_steps": [
[
"Initial diagnosis at 10:15 identified router misconfiguration"
],
[
"Applied fix at 10:30"
],
[
"Confirmed full restoration at 10:45"
]
]
}
]
},
{
"incidents": [
{
"system_component_affected": "API",
"issue_type": "Unexpected Behavior",
"severity": "High",
"resolution_steps": [
[
"Rollback initiated",
"2024-02-10T09:20:00"
],
[
"Root cause analysis completed",
"2024-02-10T11:00:00"
],
[
"New update scheduled with additional testing",
"2024-02-10T11:00:00"
]
]
}
]
},
{
"incidents": [
{
"system_component_affected": "Database",
"issue_type": "Backup Failure",
"severity": "High",
"resolution_steps": [
[
"Identified root cause: insufficient storage space"
],
[
"Performed emergency cleanup"
],
[
"Reattempted backup successfully"
],
[
"Added monitoring alert for storage capacity"
]
]
}
]
}
]
Example 3: Complex Nested Extraction¶
This example demonstrates extracting complex nested structures with relationships between different elements.
# Extract complex nested structures
query = """
extract structured information where:
- each incident has a system component and timestamp
- actions have timing and outcome
- metrics include before and after values if available
"""
result = extractor.extract(df, query)
Results:¶
Extracted 6 items with 100.0% success rate
[
{
"incidents": [
{
"system_component": "Cache System",
"timestamp": "2024-02-05 09:00:00"
}
],
"actions": [
{
"timing": "2024-02-05 09:30:00",
"outcome": "Cache system restarted"
}
],
"metrics": []
},
{
"incidents": [
{
"system_component": "Database",
"timestamp": "2024-02-05 14:00:00"
}
],
"actions": [
{
"timing": "2024-02-05 15:00:00",
"outcome": "Query optimization implemented"
}
],
"metrics": []
},
{
"incidents": [
{
"system_component": "application servers",
"timestamp": "2024-02-01 10:00:00"
}
],
"actions": [
{
"timing": "2024-02-01 10:15:00",
"outcome": "initial diagnosis identified router misconfiguration"
},
{
"timing": "2024-02-01 10:30:00",
"outcome": "applied fix"
},
{
"timing": "2024-02-01 10:45:00",
"outcome": "confirmed full restoration"
}
],
"metrics": []
},
{
"incidents": [
{
"system_component": "Database",
"timestamp": "2024-01-20 03:00:00"
}
],
"actions": [
{
"timing": "2024-01-20 04:30:00",
"outcome": "Emergency cleanup performed"
},
{
"timing": "2024-01-20 05:45:00",
"outcome": "Backup reattempted successfully"
},
{
"timing": "2024-01-20 05:45:00",
"outcome": "Added monitoring alert for storage capacity"
}
],
"metrics": []
},
{
"incidents": [
{
"system_component": "server-01",
"timestamp": "2024-01-15 14:30:00"
}
],
"actions": [
{
"timing": "2024-01-16 08:00:00",
"outcome": "Patch applied"
},
{
"timing": "2024-01-16 09:15:00",
"outcome": "Confirmed resolution"
}
],
"metrics": [
{
"before_value": 92.0,
"after_value": null
}
]
},
{
"incidents": [
{
"system_component": "API",
"timestamp": "2024-02-10 09:15:00"
}
],
"actions": [
{
"timing": "2024-02-10 09:20:00",
"outcome": "Rollback initiated"
},
{
"timing": "2024-02-10 11:00:00",
"outcome": "Root cause analysis completed"
}
],
"metrics": []
}
]
Example 4: Preview Generated Schema¶
This example shows how to generate and inspect a schema without performing extraction.
# Generate schema without extraction
query = "extract system component, issue details, and resolution steps"
sample_text = df["Description"].iloc[0]
DataModel = extractor.get_schema(query=query, sample_text=sample_text)
# Print schema
print(DataModel.model_json_schema())
# Create an instance manually
instance = DataModel(
system_component="API Server",
issue_details="High latency in requests",
resolution_steps="Scaled up server resources and optimized database queries"
)
print(instance.model_dump_json())
Generated Schema:¶
{
"$defs": {
"SystemComponentIssueResolutionissue": {
"description": "Details about the issue encountered.",
"properties": {
"description": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Description of the issue.",
"title": "Description"
},
"severity_level": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Severity level of the issue (e.g., low, medium, high).",
"title": "Severity Level"
},
"timestamp": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Timestamp of when the issue was detected, in ISO 8601 format.",
"title": "Timestamp"
}
},
"title": "SystemComponentIssueResolutionissue",
"type": "object"
},
"SystemComponentIssueResolutionresolution": {
"description": "Details about the resolution of the issue.",
"properties": {
"action_taken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Action taken to resolve the issue.",
"title": "Action Taken"
},
"responsible_personnel": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Personnel responsible for the resolution.",
"title": "Responsible Personnel"
},
"time_taken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Time taken for resolution, in a standard format (e.g., HH:MM).",
"title": "Time Taken"
}
},
"title": "SystemComponentIssueResolutionresolution",
"type": "object"
},
"SystemComponentIssueResolutionsystem_component": {
"description": "Details about the system component.",
"properties": {
"name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the system component.",
"title": "Name"
},
"type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Type of the system component.",
"title": "Type"
},
"specifications": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Specifications of the system component.",
"title": "Specifications"
}
},
"title": "SystemComponentIssueResolutionsystem_component",
"type": "object"
}
},
"description": "Schema for extracting details about system components, issues, and resolutions.",
"properties": {
"system_component": {
"anyOf": [
{
"$ref": "#/$defs/SystemComponentIssueResolutionsystem_component"
},
{
"type": "null"
}
],
"default": null,
"description": "Details about the system component."
},
"issue": {
"anyOf": [
{
"$ref": "#/$defs/SystemComponentIssueResolutionissue"
},
{
"type": "null"
}
],
"default": null,
"description": "Details about the issue encountered."
},
"resolution": {
"anyOf": [
{
"$ref": "#/$defs/SystemComponentIssueResolutionresolution"
},
{
"type": "null"
}
],
"default": null,
"description": "Details about the resolution of the issue."
}
},
"title": "SystemComponentIssueResolution",
"type": "object"
}