Unstructured Text¶
structx
supports extracting structured data from various unstructured text
sources, including PDF documents, text files, and raw text.
Supported File Formats¶
Format | Extension | Requirements |
---|---|---|
CSV | .csv | Built-in |
Excel | .xlsx, .xls | Built-in |
JSON | .json | Built-in |
Parquet | .parquet | Built-in |
Feather | .feather | Built-in |
Text | .txt, .md, .log, etc. | Built-in |
pip install structx-llm[pdf] |
||
Word | .docx, .doc | pip install structx-llm[docx] |
Working with Text Files¶
# Extract from a text file
result = extractor.extract(
data="document.txt",
query="extract key information"
)
Working with PDF Documents¶
First, install the PDF dependencies:
Then extract data from PDF files:
Working with Word Documents¶
First, install the DOCX dependencies:
Then extract data from Word files:
Working with Raw Text¶
text = """
System check on 2024-01-15 detected high CPU usage (92%) on server-01.
Database backup failure occurred on 2024-01-20 03:00.
"""
result = extractor.extract(
data=text,
query="extract incident dates and affected systems"
)
Text Chunking¶
For large documents, structx
automatically chunks the text to ensure effective
processing. You can control the chunking behavior with these parameters:
result = extractor.extract(
data="large_document.pdf",
query="extract key information",
chunk_size=2000, # Size of text chunks
overlap=200 # Overlap between chunks
)
Chunking Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
chunk_size | int | 1000 | Size of text chunks |
overlap | int | 100 | Overlap between chunks |
encoding | str | 'utf-8' | Text encoding for file reading |
Handling Multi-Page Documents¶
For PDF documents, structx
processes each page and maintains page information:
result = extractor.extract(
data="multi_page.pdf",
query="extract key information"
)
# Access page information (if return_df=True)
if 'page' in result.data.columns:
page_counts = result.data['page'].value_counts()
print("Extractions by page:", page_counts)
Best Practices¶
- Be Specific in Queries: For unstructured text, specific queries yield better results
- Adjust Chunk Size: For very dense or sparse text, adjust the chunk size accordingly
- Use Appropriate Overlap: Ensure context is maintained between chunks
- Check Failed Extractions: Unstructured text may have more failures due to format variations
Next Steps¶
- Learn about Multiple Queries for extracting different types of information
- Explore Async Operations for processing large documents efficiently
- See the API Reference for all available options