structx¶

Structured Data Extraction

Extract structured data from unstructured text using LLMs

Dynamic Model Generation

Automatically generate type-safe Pydantic models

Multiple File Formats

Support for CSV, Excel, JSON, PDF, TXT, and more

Async Support

Asynchronous operations for high-throughput processing

Overview¶

structx is a powerful Python library that extracts structured data from text using Large Language Models (LLMs). It dynamically generates type-safe data models and provides consistent, structured extraction with support for complex nested data structures.

Whether you're analyzing incident reports, processing documents, or extracting metrics from unstructured text, structx provides a simple, consistent interface with powerful capabilities.

Key Features¶

🔄 Dynamic model generation from natural language queries
🎯 Automatic schema inference and generation
📊 Support for complex nested data structures
🔄 Model refinement with natural language instructions
🚀 Multi-threaded processing for large datasets
⚡ Async support
🔧 Configurable extraction using OmegaConf
📁 Support for multiple file formats (CSV, Excel, JSON, Parquet, PDF, TXT, and more)
📄 Support for unstructured text and document processing
🏗️ Type-safe data models using Pydantic
🎮 Easy-to-use interface
🔌 Support for multiple LLM providers through litellm
🔄 Automatic retry mechanism with exponential backoff

Installation¶

pip install structx-llm

For PDF support:

pip install structx-llm[pdf]

For DOCX support:

pip install structx-llm[docx]

For all document formats:

pip install structx-llm[docs]

Quick Example¶

from structx import Extractor

# Initialize extractor
extractor = Extractor.from_litellm(
    model="gpt-4o-mini",
    api_key="your-api-key"
)

# Extract structured data
result = extractor.extract(
    data="incident_report.txt",
    query="extract incident dates, affected systems, and resolution steps"
)

# Access the extracted data
print(f"Extracted {result.success_count} items")
for item in result.data:
    print(f"Date: {item.incident_date}")
    print(f"System: {item.affected_system}")
    print(f"Resolution: {item.resolution_steps}")

License¶

This project is licensed under the MIT License - see the LICENSE file for details.