Unstructured Document Analysis with Tensorlake and MotherDuck

2025/11/19 - 6 min read

Most business-critical data is trapped in PDFs, including SEC filings, contracts, invoices, and reports that contain valuable data but can't be queried directly with SQL. Until now.

Tensorlake cracks open the wide world of documents, turning verbose text into structured data with 91-98% accuracy. Combined with MotherDuck’s serverless data warehouse, data teams can instantly query complex documents using friendly, familiar SQL. Tensorlake is a unified runtime for data-centric agents, workflows, with a best-in-class document ingestion API. Companies like Sixt and BindHQ use Tensorlake to power critical business workflows and agentic applications.

Tensorlake's state-of-the-art document ingestion, combined with MotherDuck's serverless analytics, creates a powerful platform for extracting insights from unstructured data. In a recent benchmark, Tensorlake delivers best-in-class accuracy for document processing, achieving a 91.7% F1 score on complex JSON extraction–outperforming Azure, Textract, Gemini, and open-source document AI tools.

Tutorial: AI Risk Analysis

Let’s try a simple example for extracting and querying data from unstructured documents with a very in-vogue topic: AI risk. AI-related risk is fundamentally reshaping global economic outlooks. Specifically, how are publicly traded companies talking about the risks of AI to their businesses?

We can start to answer this question by reading SEC filings, but most of the data within is unstructured and inaccessible. Let’s use Tensorlake and MotherDuck to extract, classify, and analyze this data using Python and SQL.

In this tutorial, we'll walk you through a complete workflow: classifying pages in SEC filings that discuss AI risks, extracting structured data, loading it into MotherDuck, and querying trends across companies —all with just a few lines of code.

You can follow along using the Colab notebook here, where you’ll find all the code required for this tutorial. You’ll also need a Tensorlake API key and a MotherDuck token, both of which you can obtain as part of the products’ free plans.

First, source SEC filings for relevant NYSE companies

Before we can analyze AI risks, we need the source documents. We'll start by collecting SEC filings from major NYSE-listed companies. The 10-K (annual) and 10-Q (quarterly) reports describe a company’s financial performance and operations, including key risk factors.

Follow along in the Colab notebook for an example.

Classify pages where AI risks are mentioned

SEC filings can be hundreds of pages long, but AI risks are typically only discussed in a few specific sections where companies disclose material risks, including emerging concerns around AI. Rather than processing entire documents, we'll use Tensorlake's semantic page classification to find pages mentioning AI-related risks. This saves processing time and tokens, and ensures we're extracting data from the most relevant content.

We'll define a classification schema that looks for risk factor pages, then run it across all our SEC filings to build a map of where AI risks appear in each document.

Copy code
sec_filings = [] # URLs of SEC Filings PDFs 

# Create a PageClassConfig object to describe classfication rules
page_classifications = [
  PageClassConfig(
    name="risk_factors",
    description="Pages that contain risk factors related to AI."
  )
]

# Call Tensorlake Page Classification API
for file_url in sec_filings:
  parse_id = doc_ai.classify(
    file_url=file_url,
    page_classifications=page_classifications
  )

  result = doc_ai.wait_for_completion(parse_id=parse_id)

  # Save the page numbers where AI risk factors are for each file
  for page_class in result.page_classes:
    if(page_class.page_class == "risk_factors"):
    	document_ai_risk_pages[file_url] = page_class.page_numbers

Extract AI risk factors from each document

With the right pages identified, it's time to extract structured data. We'll define a schema that captures risk category, description, severity, and citation, then let Tensorlake's Document Ingestion API turn risk disclosures into queryable JSON.

Copy code
# Define our data schema
class AIRiskMention(BaseModel):
    """Individual AI-related risk mention"""
    risk_category: str = Field(
        description="Category: Operational, Regulatory, Competitive, Ethical, Security, Liability"
    )
    risk_description: str = Field(description="Description of the AI risk")
    severity_indicator: Optional[str] = Field(None, description="Severity level if mentioned")
    citation: str = Field(description="Page reference")

class AIRiskExtraction(BaseModel):
    """Complete AI risk data from a filing"""
    company_name: str
    ticker: str
    filing_type: str
    filing_date: str
    fiscal_year: str
    fiscal_quarter: Optional[str] = None
    ai_risk_mentioned: bool
    ai_risk_mentions: List[AIRiskMention] = []
    num_ai_risk_mentions: int = 0
    ai_strategy_mentioned: bool = False
    ai_investment_mentioned: bool = False
    ai_competition_mentioned: bool = False
    regulatory_ai_risk: bool = False

doc_ai = DocumentAI()

results = {}

for file_url, page_numbers in document_ai_risk_pages.items():
  print(f"File URL: {file_url}")
  page_number_str_list = ",".join(str(i) for i in page_numbers)
  print(f"Page Numbers: {page_number_str_list}")

  result = doc_ai.parse_and_wait(
      file=file_url,
      page_range=page_number_str_list,
      structured_extraction_options=[
          StructuredExtractionOptions(
              schema_name="AIRiskExtraction",
              json_schema=AIRiskExtraction
          )
      ]
  )
  results[file_url] = result

  # Save results to a json file
  filename = os.path.basename(file_url).replace('.pdf', '.json')
  with open(json_filename, 'w') as f:
    json.dump(result.structured_data[0].data, f, indent=2, default=str)

Load structured data into MotherDuck

Now we have structured JSON for each company's AI risks. Let's load this data into MotherDuck's serverless warehouse. Once it's in MotherDuck, we can query across all companies using SQL.

Copy code
# Load into MotherDuck
con = duckdb.connect('md:ai_risk_analytics')

for filename in json_files:
    # Load JSON
    with open(filename, 'r') as f:
        data = json.load(f)
    
    # Convert ai_risk_mentions to JSON string
    data['ai_risk_mentions'] = json.dumps(data.get('ai_risk_mentions', []))

Analyze with SQL

With our data in MotherDuck, we can run SQL queries to uncover patterns across companies. Which risk categories are most common? How do tech giants describe operational AI risks differently from financial services firms? Let's explore.

For example, you can extract risk category distribution across all companies with a single query:

Copy code
risk_categories = con.execute("""
    WITH parsed_risks AS (
        SELECT 
            company_name,
            unnest(CAST(json(ai_risk_mentions) AS JSON[])) as risk_item
        FROM ai_risk_factors.ai_risk_filings
    )
    SELECT 
        risk_item->>'risk_category' as risk_category,
        COUNT(*) as total_mentions,
        COUNT(DISTINCT company_name) as companies_mentioning
    FROM parsed_risks
    WHERE risk_item->>'risk_category' IS NOT NULL
    GROUP BY risk_category
    ORDER BY total_mentions DESC
""").fetchdf()

print(risk_categories)

With this query, you get output like:

Or, query operational risks to see how different companies frame execution challenges:

Copy code
# Query: Extract one operational AI risk per company
operational_risks = con.execute("""
    WITH parsed_risks AS (
        SELECT 
            company_name,
            ticker,
            unnest(CAST(json(ai_risk_mentions) AS JSON[])) as risk_item
        FROM ai_risk_factors.ai_risk_filings
    ),
    operational_only AS (
        SELECT 
            company_name,
            ticker,
            risk_item->>'risk_description' as risk_description,
            risk_item->>'citation' as citation
        FROM parsed_risks
        WHERE risk_item->>'risk_category' = 'Operational'
    )
    SELECT 
        company_name,
        ticker,
        risk_description,
        citation
    FROM operational_only
    ORDER BY company_name
""").fetchdf()

print(operational_risks)

The query will return output like:

In conclusion

Most of the effort in document analytics is getting the data into a database. Critical business information in financial services, logistics, and other industries still lives inside documents. Once that information is reliably extracted into a structured form, the analytics layer becomes dramatically simpler.

Document extraction, however, requires more than OCR - namely, page classification, layout understanding, and schema-driven structured extraction. Tensorlake’s Document Ingestion API bundles these capabilities into a single API.

Once the data is structured, DuckDB makes analysis effortless. Its query engine allows analytics queries over semi-structured JSON from documents using familiar SQL, and MotherDuck’s serverless architecture scales that to large workloads instantly.

Together, Tensorlake and MotherDuck turn unstructured documents into analytics-ready datasets. Beyond PDFs, Tensorlake also ingests Word, HTML, PowerPoint, and Excel files, unlocking even more enterprise data sources for DuckDB’s ecosystem.

Tutorial: AI Risk Analysis

In conclusion

Start using MotherDuck now!

Try 21 Days Free