Unstructured Document Analysis with Tensorlake and MotherDuck
2025/11/19 - 6 min read
BYMost business-critical data is trapped in PDFs, including SEC filings, contracts, invoices, and reports that contain valuable data but can't be queried directly with SQL. Until now.
Tensorlake cracks open the wide world of documents, turning verbose text into structured data with 91-98% accuracy. Combined with MotherDuck’s serverless data warehouse, data teams can instantly query complex documents using friendly, familiar SQL. Tensorlake is a unified runtime for data-centric agents, workflows, with a best-in-class document ingestion API. Companies like Sixt and BindHQ use Tensorlake to power critical business workflows and agentic applications.
Tensorlake's state-of-the-art document ingestion, combined with MotherDuck's serverless analytics, creates a powerful platform for extracting insights from unstructured data. In a recent benchmark, Tensorlake delivers best-in-class accuracy for document processing, achieving a 91.7% F1 score on complex JSON extraction–outperforming Azure, Textract, Gemini, and open-source document AI tools.
Tutorial: AI Risk Analysis
Let’s try a simple example for extracting and querying data from unstructured documents with a very in-vogue topic: AI risk. AI-related risk is fundamentally reshaping global economic outlooks. Specifically, how are publicly traded companies talking about the risks of AI to their businesses?
We can start to answer this question by reading SEC filings, but most of the data within is unstructured and inaccessible. Let’s use Tensorlake and MotherDuck to extract, classify, and analyze this data using Python and SQL.
In this tutorial, we'll walk you through a complete workflow: classifying pages in SEC filings that discuss AI risks, extracting structured data, loading it into MotherDuck, and querying trends across companies —all with just a few lines of code.
You can follow along using the Colab notebook here, where you’ll find all the code required for this tutorial. You’ll also need a Tensorlake API key and a MotherDuck token, both of which you can obtain as part of the products’ free plans.

First, source SEC filings for relevant NYSE companies
Before we can analyze AI risks, we need the source documents. We'll start by collecting SEC filings from major NYSE-listed companies. The 10-K (annual) and 10-Q (quarterly) reports describe a company’s financial performance and operations, including key risk factors.
Follow along in the Colab notebook for an example.
Classify pages where AI risks are mentioned
SEC filings can be hundreds of pages long, but AI risks are typically only discussed in a few specific sections where companies disclose material risks, including emerging concerns around AI. Rather than processing entire documents, we'll use Tensorlake's semantic page classification to find pages mentioning AI-related risks. This saves processing time and tokens, and ensures we're extracting data from the most relevant content.
We'll define a classification schema that looks for risk factor pages, then run it across all our SEC filings to build a map of where AI risks appear in each document.
Copy code
sec_filings = [] # URLs of SEC Filings PDFs
# Create a PageClassConfig object to describe classfication rules
page_classifications = [
PageClassConfig(
name="risk_factors",
description="Pages that contain risk factors related to AI."
)
]
# Call Tensorlake Page Classification API
for file_url in sec_filings:
parse_id = doc_ai.classify(
file_url=file_url,
page_classifications=page_classifications
)
result = doc_ai.wait_for_completion(parse_id=parse_id)
# Save the page numbers where AI risk factors are for each file
for page_class in result.page_classes:
if(page_class.page_class == "risk_factors"):
document_ai_risk_pages[file_url] = page_class.page_numbers
Extract AI risk factors from each document
With the right pages identified, it's time to extract structured data. We'll define a schema that captures risk category, description, severity, and citation, then let Tensorlake's Document Ingestion API turn risk disclosures into queryable JSON.
Copy code
# Define our data schema
class AIRiskMention(BaseModel):
"""Individual AI-related risk mention"""
risk_category: str = Field(
description="Category: Operational, Regulatory, Competitive, Ethical, Security, Liability"
)
risk_description: str = Field(description="Description of the AI risk")
severity_indicator: Optional[str] = Field(None, description="Severity level if mentioned")
citation: str = Field(description="Page reference")
class AIRiskExtraction(BaseModel):
"""Complete AI risk data from a filing"""
company_name: str
ticker: str
filing_type: str
filing_date: str
fiscal_year: str
fiscal_quarter: Optional[str] = None
ai_risk_mentioned: bool
ai_risk_mentions: List[AIRiskMention] = []
num_ai_risk_mentions: int = 0
ai_strategy_mentioned: bool = False
ai_investment_mentioned: bool = False
ai_competition_mentioned: bool = False
regulatory_ai_risk: bool = False
doc_ai = DocumentAI()
results = {}
for file_url, page_numbers in document_ai_risk_pages.items():
print(f"File URL: {file_url}")
page_number_str_list = ",".join(str(i) for i in page_numbers)
print(f"Page Numbers: {page_number_str_list}")
result = doc_ai.parse_and_wait(
file=file_url,
page_range=page_number_str_list,
structured_extraction_options=[
StructuredExtractionOptions(
schema_name="AIRiskExtraction",
json_schema=AIRiskExtraction
)
]
)
results[file_url] = result
# Save results to a json file
filename = os.path.basename(file_url).replace('.pdf', '.json')
with open(json_filename, 'w') as f:
json.dump(result.structured_data[0].data, f, indent=2, default=str)
Load structured data into MotherDuck
Now we have structured JSON for each company's AI risks. Let's load this data into MotherDuck's serverless warehouse. Once it's in MotherDuck, we can query across all companies using SQL.
Copy code
# Load into MotherDuck
con = duckdb.connect('md:ai_risk_analytics')
for filename in json_files:
# Load JSON
with open(filename, 'r') as f:
data = json.load(f)
# Convert ai_risk_mentions to JSON string
data['ai_risk_mentions'] = json.dumps(data.get('ai_risk_mentions', []))
Analyze with SQL
With our data in MotherDuck, we can run SQL queries to uncover patterns across companies. Which risk categories are most common? How do tech giants describe operational AI risks differently from financial services firms? Let's explore.
For example, you can extract risk category distribution across all companies with a single query:
Copy code
risk_categories = con.execute("""
WITH parsed_risks AS (
SELECT
company_name,
unnest(CAST(json(ai_risk_mentions) AS JSON[])) as risk_item
FROM ai_risk_factors.ai_risk_filings
)
SELECT
risk_item->>'risk_category' as risk_category,
COUNT(*) as total_mentions,
COUNT(DISTINCT company_name) as companies_mentioning
FROM parsed_risks
WHERE risk_item->>'risk_category' IS NOT NULL
GROUP BY risk_category
ORDER BY total_mentions DESC
""").fetchdf()
print(risk_categories)
With this query, you get output like:

Or, query operational risks to see how different companies frame execution challenges:
Copy code
# Query: Extract one operational AI risk per company
operational_risks = con.execute("""
WITH parsed_risks AS (
SELECT
company_name,
ticker,
unnest(CAST(json(ai_risk_mentions) AS JSON[])) as risk_item
FROM ai_risk_factors.ai_risk_filings
),
operational_only AS (
SELECT
company_name,
ticker,
risk_item->>'risk_description' as risk_description,
risk_item->>'citation' as citation
FROM parsed_risks
WHERE risk_item->>'risk_category' = 'Operational'
)
SELECT
company_name,
ticker,
risk_description,
citation
FROM operational_only
ORDER BY company_name
""").fetchdf()
print(operational_risks)
The query will return output like:

In conclusion
Most of the effort in document analytics is getting the data into a database. Critical business information in financial services, logistics, and other industries still lives inside documents. Once that information is reliably extracted into a structured form, the analytics layer becomes dramatically simpler.
Document extraction, however, requires more than OCR - namely, page classification, layout understanding, and schema-driven structured extraction. Tensorlake’s Document Ingestion API bundles these capabilities into a single API.
Once the data is structured, DuckDB makes analysis effortless. Its query engine allows analytics queries over semi-structured JSON from documents using familiar SQL, and MotherDuck’s serverless architecture scales that to large workloads instantly.
Together, Tensorlake and MotherDuck turn unstructured documents into analytics-ready datasets. Beyond PDFs, Tensorlake also ingests Word, HTML, PowerPoint, and Excel files, unlocking even more enterprise data sources for DuckDB’s ecosystem.
TABLE OF CONTENTS
Start using MotherDuck now!



