Build Bigger With Small Ai: Running Small Models Locally
2024/09/24What Are Small AI Models?
Small AI models are compact versions of large language models that can run on ordinary hardware like laptops or even phones. While cloud-based models from providers like OpenAI or Anthropic typically have hundreds of billions or even trillions of parameters, small models range from 0.5 to 70 billion parameters and are only a few gigabytes in size.
These models share the same underlying architecture and research foundations as their larger counterparts - they're based on the transformer architecture that powers most modern AI systems. The key difference is their size, which makes them practical for local deployment without expensive GPU clusters.
Why Small Models Matter for Local Development
Faster Performance Through Local Execution
Running models locally provides surprising speed advantages. Small models execute faster due to their reduced parameter count - since transformer compute time scales quadratically with parameters, a model with 1 billion parameters runs dramatically faster than one with hundreds of billions. Additionally, eliminating network round trips means zero latency for API calls, making the entire inference process remarkably quick.
Data Privacy and Freedom to Experiment
Local models keep data on your machine, eliminating concerns about sharing sensitive information with cloud providers. This isn't just about privacy paranoia - it liberates developers to experiment freely without worrying about security controls, approval processes, or compliance requirements. Teams can prototype and test ideas without the friction of corporate security reviews.
Cost Structure Benefits
While local models aren't free (you still need hardware), they avoid the per-token pricing of cloud APIs. Modern hardware is increasingly optimized for AI workloads - Intel claims 100 million AI-capable computers will ship within a year, and Apple Silicon dedicates roughly a third of its chip area to neural processing. This hardware investment pays dividends across all your AI experiments without ongoing API costs.
Getting Started with Ollama
Ollama provides an easy way to run these models locally. Here's a simple example using Python:
Copy code
import ollama
# Chat with a model
response = ollama.chat(model='llama3.1', messages=[
{'role': 'user', 'content': 'What is DuckDB? Keep it to two sentences.'}
])
The models run entirely on your local machine, providing responses as fast or faster than cloud providers. Popular models include:
- Llama 3.1 from Meta
- Gemma from Google
- Phi from Microsoft Research
- Qwen 2.5 from Alibaba
Combining Small Models with Local Data
Retrieval Augmented Generation (RAG)
Small models excel when combined with existing factual data through a technique called Retrieval Augmented Generation. Since smaller models may hallucinate when asked about specific facts, RAG compensates by providing relevant data snippets at runtime.
The process involves:
- Pre-processing your data into a vector store
- When queried, retrieving relevant data snippets
- Augmenting the model's prompt with this factual information
- Getting accurate responses grounded in your actual data
Tools like LlamaIndex and LangChain simplify implementing RAG patterns, allowing models to answer questions about your specific datasets accurately.
Tool Calling and Small Agents
A newer capability enables models to decide when and how to fetch data themselves through "tool calling." Instead of pre-building all the plumbing to connect data sources, the model can:
- Analyze what information it needs
- Write and execute database queries
- Interpret results in context
- Provide natural language responses
For example, newer models like Qwen 2.5 Coder can write SQL queries against DuckDB databases, execute them, and interpret the results - all based on natural language prompts.
Practical Use Cases
Internal Tooling and Back Office Applications
The most successful deployments start with internal tools rather than customer-facing applications:
- IT help desk automation
- Security questionnaire processing
- Data engineering and reporting tasks
- Engineering productivity tools for issue management and code review
Combining Small and Large Models
Small models don't replace cloud-scale models entirely. Like hot and cold data storage, you can use small models for most queries and escalate to larger models when needed. Apple and Microsoft already use this hybrid approach in their AI products.
The Future is Small and Local
Open source models are rapidly improving, with performance gaps between small and large models shrinking. New models emerge weekly with better capabilities, specialized functions, and permissive licenses. Combined with hardware improvements specifically targeting AI workloads, local model deployment is becoming increasingly practical for production use cases.
The ecosystem around small models continues to expand, with thousands of fine-tuned variants available for specific tasks. As these models improve and hardware accelerates, the ability to run powerful AI locally transforms from a nice-to-have into a competitive advantage for development teams.
CONTENT
- What Are Small AI Models?
- Why Small Models Matter for Local Development
- Getting Started with Ollama
- Combining Small Models with Local Data
- Practical Use Cases
- The Future is Small and Local
CONTENT
- What Are Small AI Models?
- Why Small Models Matter for Local Development
- Getting Started with Ollama
- Combining Small Models with Local Data
- Practical Use Cases
- The Future is Small and Local
Related Videos

2025-07-26
How to Efficiently Load Data into DuckLake with Estuary
Learn how DuckLake, MotherDuck, and Estuary enable fast, real-time data integration and analytics with modern open table formats, cloud data warehousing, and no-code streaming pipelines.
YouTube

20:44
2025-06-13
What can Postgres learn from DuckDB? (PGConf.dev 2025)
DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?
YouTube
Ecosystem
Talk

2025-06-12
pg_duckdb: Ducking awesome analytics in Postgres
Supercharge your Postgres analytics! This talk shows how the pg_duckdb extension accelerates your slowest queries instantly, often with zero code changes. Learn practical tips and how to use remote columnar storage for even more speed.
Talk
Sources