The Knowledge Base
DuckDB provides a convenient fixed array size and list (variable size) data type to store vector embeddings. LlamaIndex has a DuckDB integration that helps you store your compiled knowledge base and save it to disk for future use.
Next, let’s build our knowledge base by importing the necessary dependencies:
from llama_index.core import (
StorageContext,
ServiceContext,
VectorStoreIndex,
SimpleDirectoryReader,
)
from llama_index.vector_stores.duckdb import DuckDBVectorStore
In this project, we will load documents from a folder called local_documents using the ‘SimpleDirectoryReader.’
By using the ‘ServiceContext’ object, we can define the chunking strategy for the text in the documents:
documents = SimpleDirectoryReader("./local_documents").load_data()
documents_service_context = ServiceContext.from_defaults(chunk_size=512)
It’s finally time to build our knowledge base. When we initialize the DuckDBVectorStore and pass it to the StorageContext, LlamaIndex learns that DuckDB should be used for storage and retrieval. The initialization process also tells LlamaIndex how to use DuckDB.
By passing the embedding model, DuckDB storage context, and documents’ context to the VectorStoreIndex object, we can create our knowledge base.
In the following code snippet, the DuckDBVectorStore is initialized by passing a directory location to use to persist your knowledge base:
vector_store = DuckDBVectorStore(
database_name="knowledge_base",
persist_dir="./",
embed_dim=384,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
knowledge_base = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=embed_model,
service_context=documents_service_context,
)
This means that a database file with the specified database name ‘knowledge_base’ will be created in the listed directory. It’s important to note that our database file can be reused, which means you can add new documents to it. You can learn more about this here.
Note: It is important to specify the dimensions of the vector embeddings used, as this information will be required for the embedding field data type when we create the table to store the embeddings.