Skip to content

Getting Started

This guide will help you get started with structx for structured data extraction.

Installation

Package rename notice (PyPI)

The PyPI distribution has been renamed from structx-llm to structx (September 2025).

  • Imports are unchanged: import structx
  • Extras are unchanged: structx[docs], structx[pdf], structx[docx]
  • To upgrade:

    pip uninstall -y structx-llm
    pip install -U structx
    

If you pinned structx-llm in requirements or lock files, replace it with structx.

Install the core package:

pip install structx

For complete document processing capabilities (recommended):

# Install with full document support including PDF conversion
pip install structx[docs]

# Individual format support
pip install structx[pdf]   # PDF processing and conversion
pip install structx[docx]  # Advanced DOCX support

What You Get

  • Core Package: Basic structured data extraction from CSV, JSON, Excel
  • [docs] Extra: Advanced unstructured document processing with multimodal PDF support
  • Automatic document-to-PDF conversion
  • Instructor's multimodal vision capabilities
  • Enhanced extraction quality for all document types

Basic Usage

Initialize the Extractor

from structx import Extractor

# Using litellm (supports multiple providers)
extractor = Extractor.from_litellm(
    model="gpt-4o",  # or any other model supported by litellm
    api_key="your-api-key"
)

# Or with a custom client
import instructor
from openai import AzureOpenAI

client = instructor.patch(AzureOpenAI(
    api_key="your-api-key",
    api_version="2024-02-15-preview",
    azure_endpoint="your-endpoint"
))

extractor = Extractor(
    client=client,
    model_name="your-model-deployment"
)

API Requirements

Important: All extractor methods require keyword arguments. You cannot use positional arguments.

# ✅ Correct - using keyword arguments
result = extractor.extract(data="file.pdf", query="extract information")

# ❌ Incorrect - using positional arguments
result = extractor.extract("file.pdf", "extract information")  # This will fail

This applies to all methods:

  • extract(*, data, query, ...)
  • extract_async(*, data, query, ...)
  • extract_queries(*, data, queries, ...)
  • get_schema(*, data, query, ...)
  • refine_data_model(*, model, refinement_instructions, ...)

Extract Structured Data

# From a file (automatically detects format and uses optimal processing)
# Process a PDF invoice
result = extractor.extract(
    data="scripts/example_input/S0305SampleInvoice.pdf",         # Unstructured: multimodal PDF processing
    query="extract invoice number, total amount, and line items"
)

# Process a DOCX contract
result = extractor.extract(
    data="scripts/example_input/free-consultancy-agreement.docx", # Unstructured: converted to PDF then multimodal
    query="extract the parties, effective date, and payment terms"
)

Access Results

# Check extraction statistics
print(f"Extracted {result.success_count} items")
print(f"Failed {result.failure_count} items")
print(f"Success rate: {result.success_rate:.1f}%")

# Access as list of model instances
for item in result.data:
    print(item.model_dump_json(indent=2))

# Or convert to DataFrame
import pandas as pd
df = pd.DataFrame([item.model_dump() for item in result.data])
print(df)

# Access the generated model
print(f"Model: {result.model.__name__}")
print(result.model.model_json_schema())

Check Token Usage

structx automatically tracks token usage for all operations, helping you monitor costs:

# Check token usage
usage = result.get_token_usage()
print(f"Total tokens used: {usage.total_tokens}")

# See usage breakdown by step
for step in usage.steps:
    print(f"{step.name}: {step.tokens} tokens")

Configure Extraction

# With a YAML file
extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    config="config.yaml"
)

# With a dictionary
config = {
    "refinement": {
        "temperature": 0.1,
        "top_p": 0.05
    },
    "extraction": {
        "temperature": 0.0,
        "top_p": 0.1
    }
}

extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    config=config
)

# With retry settings
extractor = Extractor.from_litellm(
    model="gpt-4o",
    api_key="your-api-key",
    max_retries=5,      # Maximum retry attempts
    min_wait=2,         # Minimum seconds between retries
    max_wait=30         # Maximum seconds between retries
)

Next Steps