Error Handling¶

structx provides comprehensive error handling for extraction processes.

Error Types¶

ExtractionError¶

The base exception for extraction errors:

try:
    result = extractor.extract(
        data=df,
        query="extract key information"
    )
except ExtractionError as e:
    print(f"Extraction failed: {e}")

ConfigurationError¶

Raised when there's an issue with the configuration:

try:
    extractor = Extractor.from_litellm(
        model="gpt-4o-mini",
        api_key="your-api-key",
        config="invalid_config.yaml"
    )
except ConfigurationError as e:
    print(f"Configuration error: {e}")

ValidationError¶

Raised when there's a validation issue with the extracted data:

try:
    result = extractor.extract(
        data=df,
        query="extract key information"
    )
except ValidationError as e:
    print(f"Validation error: {e}")

ModelGenerationError¶

Raised when there's an issue generating the extraction model:

try:
    model = extractor.get_schema(
        query="extract key information",
        sample_text="Sample text"
    )
except ModelGenerationError as e:
    print(f"Model generation error: {e}")

FileError¶

Raised when there's an issue with file operations:

try:
    result = extractor.extract(
        data="nonexistent_file.pdf",
        query="extract key information"
    )
except FileError as e:
    print(f"File error: {e}")

Handling Failed Extractions¶

Even when the overall extraction succeeds, individual items may fail. These are collected in the failed DataFrame:

result = extractor.extract(
    data=df,
    query="extract key information"
)

if result.failure_count > 0:
    print(f"Failed extractions: {result.failure_count}")
    print(result.failed)

Retry Mechanism¶

structx includes an automatic retry mechanism for handling transient failures:

extractor = Extractor.from_litellm(
    model="gpt-4o-mini",
    api_key="your-api-key",
    max_retries=5,      # Maximum number of retry attempts
    min_wait=2,         # Minimum seconds to wait between retries
    max_wait=30         # Maximum seconds to wait between retries
)

The retry mechanism uses exponential backoff, meaning the wait time between retries increases exponentially (but is capped at max_wait).

Logging¶

structx uses loguru for logging. You can configure the logging level:

from loguru import logger

# Set logging level
logger.remove()
logger.add(sys.stderr, level="INFO")  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

For detailed debugging:

logger.remove()
logger.add(sys.stderr, level="DEBUG")

Best Practices¶

Always Check Failures: Always check result.failure_count and result.failed for failed extractions
Use Try/Except: Wrap extraction calls in try/except blocks
Configure Retries: Adjust retry settings based on your API stability
Log Errors: Enable appropriate logging levels for debugging
Validate Results: Validate extracted data before using it