Supported Formats¶
structx
supports a variety of file formats for data extraction.
Built-in Support¶
These formats are supported without additional dependencies:
Format | Extension | Description |
---|---|---|
CSV | .csv | Comma-separated values |
Excel | .xlsx, .xls | Microsoft Excel spreadsheets |
JSON | .json | JavaScript Object Notation |
Parquet | .parquet | Columnar storage format |
Feather | .feather | Fast on-disk format for data frames |
Text | .txt, .md, .log, etc. | Plain text files |
Optional Dependencies¶
These formats require additional dependencies:
Format | Extension | Dependencies |
---|---|---|
pip install structx-llm[pdf] |
||
Word | .docx, .doc | pip install structx-llm[docx] |
Input Types¶
structx
can extract data from:
- File Paths:
- DataFrames:
import pandas as pd
df = pd.DataFrame({"text": ["Sample text 1", "Sample text 2"]})
result = extractor.extract(
data=df,
query="extract key information"
)
- Lists of Dictionaries:
data = [
{"text": "Sample text 1"},
{"text": "Sample text 2"}
]
result = extractor.extract(
data=data,
query="extract key information"
)
- Raw Text:
text = """
Sample text with information to extract.
More text with additional information.
"""
result = extractor.extract(
data=text,
query="extract key information"
)
File Reading Options¶
When reading files, you can provide additional options:
result = extractor.extract(
data="document.pdf",
query="extract key information",
chunk_size=2000, # Size of text chunks
overlap=200, # Overlap between chunks
encoding="utf-8" # Text encoding
)
CSV Options¶
result = extractor.extract(
data="data.csv",
query="extract key information",
delimiter=";", # Custom delimiter
encoding="latin1", # Custom encoding
skiprows=1 # Skip header row
)
Excel Options¶
result = extractor.extract(
data="data.xlsx",
query="extract key information",
sheet_name="Sheet2", # Specific sheet
skiprows=3 # Skip header rows
)
PDF Options¶
result = extractor.extract(
data="document.pdf",
query="extract key information",
chunk_size=2000, # Size of text chunks
overlap=200 # Overlap between chunks
)
Output Types¶
structx
can return data in different formats:
- Model Instances (default):
result = extractor.extract(
data=df,
query="extract key information",
return_df=False # Default
)
# Access as list of model instances
for item in result.data:
print(item.field_name)
- DataFrame:
result = extractor.extract(
data=df,
query="extract key information",
return_df=True
)
# Access as DataFrame
print(result.data.head())
Nested Data Handling¶
For nested data structures, you can choose to flatten them: