Getting Started¶

This guide will help you get started with structx for structured data extraction.

Installation¶

Package rename notice (PyPI)

The PyPI distribution has been renamed from structx-llm to structx (September 2025).

Imports are unchanged: import structx
Extras are unchanged: structx[docs], structx[pdf], structx[docx]

To upgrade:

pip uninstall -y structx-llm
pip install -U structx

If you pinned structx-llm in requirements or lock files, replace it with structx.

Install the core package:

pip install structx

For complete document processing capabilities (recommended):

# Install with full document support including PDF conversion
pip install structx[docs]

# Individual format support
pip install structx[pdf]   # PDF processing and conversion
pip install structx[docx]  # Advanced DOCX support

What You Get¶

Core Package: Basic structured data extraction from CSV, JSON, Excel
[docs] Extra: Advanced unstructured document processing with multimodal PDF support
Automatic document-to-PDF conversion
Instructor's multimodal vision capabilities
Enhanced extraction quality for all document types

Basic Usage¶

Initialize the Extractor¶

from structx import Extractor

# Using litellm (supports multiple providers)
extractor = Extractor.from_litellm(
    model="gpt-4o",  # or any other model supported by litellm
    api_key="your-api-key"
)

# Or with a custom client
import instructor
from openai import AzureOpenAI

client = instructor.patch(AzureOpenAI(
    api_key="your-api-key",
    api_version="2024-02-15-preview",
    azure_endpoint="your-endpoint"
))

extractor = Extractor(
    client=client,
    model_name="your-model-deployment"
)

API Requirements¶

Important: All extractor methods require keyword arguments. You cannot use positional arguments.

# ✅ Correct - using keyword arguments
result = extractor.extract(data="file.pdf", query="extract information")

# ❌ Incorrect - using positional arguments
result = extractor.extract("file.pdf", "extract information")  # This will fail

This applies to all methods:

extract(*, data, query, ...)
extract_async(*, data, query, ...)
extract_queries(*, data, queries, ...)
get_schema(*, data, query, ...)
refine_data_model(*, model, refinement_instructions, ...)

Extract Structured Data¶

# From a file (automatically detects format and uses optimal processing)
# Process a PDF invoice
result = extractor.extract(
    data="scripts/example_input/S0305SampleInvoice.pdf",         # Unstructured: multimodal PDF processing
    query="extract invoice number, total amount, and line items"
)

# Process a DOCX contract
result = extractor.extract(
    data="scripts/example_input/free-consultancy-agreement.docx", # Unstructured: converted to PDF then multimodal
    query="extract the parties, effective date, and payment terms"
)

Access Results¶

# Check extraction statistics
print(f"Extracted {result.success_count} items")
print(f"Failed {result.failure_count} items")
print(f"Success rate: {result.success_rate:.1f}%")

# Access as list of model instances
for item in result.data:
    print(item.model_dump_json(indent=2))

# Or convert to DataFrame
import pandas as pd
df = pd.DataFrame([item.model_dump() for item in result.data])
print(df)

# Access the generated model
print(f"Model: {result.model.__name__}")
print(result.model.model_json_schema())

Check Token Usage¶

structx automatically tracks token usage for all operations, helping you monitor costs:

# Check token usage
usage = result.get_token_usage()
print(f"Total tokens used: {usage.total_tokens}")

# See usage breakdown by step
for step in usage.steps:
    print(f"{step.name}: {step.tokens} tokens")

Configure Extraction¶

# With a YAML file
extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    config="config.yaml"
)

# With a dictionary
config = {
    "refinement": {
        "temperature": 0.1,
        "top_p": 0.05
    },
    "extraction": {
        "temperature": 0.0,
        "top_p": 0.1
    }
}

extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    config=config
)

# With retry settings
extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    max_retries=5,      # Maximum retry attempts
    min_wait=2,         # Minimum seconds between retries
    max_wait=30         # Maximum seconds between retries
)

Next Steps¶

Learn about Basic Extraction techniques
Explore Custom Models for specific extraction needs
Learn about the Retry Mechanism for handling transient errors
See how to Refine Data Models with natural language instructions
Learn how to handle Unstructured Text like PDFs and documents
Check out the API Reference for detailed documentation
Explore Token Usage Tracking for monitoring costs
Discover Async Operations for better performance
Understand Multiple Queries for complex extraction scenarios