YouTube

Build Bigger With Small Ai: Running Small Models Locally

2024/09/24

What Are Small AI Models?

Small AI models are compact versions of large language models that can run on ordinary hardware like laptops or even phones. While cloud-based models from providers like OpenAI or Anthropic typically have hundreds of billions or even trillions of parameters, small models range from 0.5 to 70 billion parameters and are only a few gigabytes in size.

These models share the same underlying architecture and research foundations as their larger counterparts - they're based on the transformer architecture that powers most modern AI systems. The key difference is their size, which makes them practical for local deployment without expensive GPU clusters.

Why Small Models Matter for Local Development

Faster Performance Through Local Execution

Running models locally provides surprising speed advantages. Small models execute faster due to their reduced parameter count - since transformer compute time scales quadratically with parameters, a model with 1 billion parameters runs dramatically faster than one with hundreds of billions. Additionally, eliminating network round trips means zero latency for API calls, making the entire inference process remarkably quick.

Data Privacy and Freedom to Experiment

Local models keep data on your machine, eliminating concerns about sharing sensitive information with cloud providers. This isn't just about privacy paranoia - it liberates developers to experiment freely without worrying about security controls, approval processes, or compliance requirements. Teams can prototype and test ideas without the friction of corporate security reviews.

Cost Structure Benefits

While local models aren't free (you still need hardware), they avoid the per-token pricing of cloud APIs. Modern hardware is increasingly optimized for AI workloads - Intel claims 100 million AI-capable computers will ship within a year, and Apple Silicon dedicates roughly a third of its chip area to neural processing. This hardware investment pays dividends across all your AI experiments without ongoing API costs.

Getting Started with Ollama

Ollama provides an easy way to run these models locally. Here's a simple example using Python:

Copy code
import ollama

# Chat with a model
response = ollama.chat(model='llama3.1', messages=[
    {'role': 'user', 'content': 'What is DuckDB? Keep it to two sentences.'}
])

The models run entirely on your local machine, providing responses as fast or faster than cloud providers. Popular models include:

Llama 3.1 from Meta
Gemma from Google
Phi from Microsoft Research
Qwen 2.5 from Alibaba

Combining Small Models with Local Data

Retrieval Augmented Generation (RAG)

Small models excel when combined with existing factual data through a technique called Retrieval Augmented Generation. Since smaller models may hallucinate when asked about specific facts, RAG compensates by providing relevant data snippets at runtime.

The process involves:

Pre-processing your data into a vector store
When queried, retrieving relevant data snippets
Augmenting the model's prompt with this factual information
Getting accurate responses grounded in your actual data

Tools like LlamaIndex and LangChain simplify implementing RAG patterns, allowing models to answer questions about your specific datasets accurately.

Tool Calling and Small Agents

A newer capability enables models to decide when and how to fetch data themselves through "tool calling." Instead of pre-building all the plumbing to connect data sources, the model can:

Analyze what information it needs
Write and execute database queries
Interpret results in context
Provide natural language responses

For example, newer models like Qwen 2.5 Coder can write SQL queries against DuckDB databases, execute them, and interpret the results - all based on natural language prompts.

Practical Use Cases

Internal Tooling and Back Office Applications

The most successful deployments start with internal tools rather than customer-facing applications:

IT help desk automation
Security questionnaire processing
Data engineering and reporting tasks
Engineering productivity tools for issue management and code review

Combining Small and Large Models

Small models don't replace cloud-scale models entirely. Like hot and cold data storage, you can use small models for most queries and escalate to larger models when needed. Apple and Microsoft already use this hybrid approach in their AI products.

The Future is Small and Local

Open source models are rapidly improving, with performance gaps between small and large models shrinking. New models emerge weekly with better capabilities, specialized functions, and permissive licenses. Combined with hardware improvements specifically targeting AI workloads, local model deployment is becoming increasingly practical for production use cases.

The ecosystem around small models continues to expand, with thousands of fine-tuned variants available for specific tasks. As these models improve and hardware accelerates, the ability to run powerful AI locally transforms from a nice-to-have into a competitive advantage for development teams.

Related Videos

"From Curiosity to Impact How DoSomething Democratized Data" video thumbnail

2025-09-10

From Curiosity to Impact How DoSomething Democratized Data

Hear how DoSomething's data team escaped the enterprise data trap, achieving 20X cost reduction and transforming hours-long queries into seconds with MotherDuck.

YouTube

"How to Efficiently Load Data into DuckLake with Estuary" video thumbnail

2025-07-26

How to Efficiently Load Data into DuckLake with Estuary

Learn how DuckLake, MotherDuck, and Estuary enable fast, real-time data integration and analytics with modern open table formats, cloud data warehousing, and no-code streaming pipelines.

YouTube

"What can Postgres learn from DuckDB? (PGConf.dev 2025)" video thumbnail

20:44

2025-06-13

What can Postgres learn from DuckDB? (PGConf.dev 2025)

DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?

YouTube

Ecosystem

Talk