What is a Primary Key?

Back to DuckDB Data Engineering Glossary

Definition

A primary key is a column or set of columns in a database table that uniquely identifies each row, ensuring no two rows can have the same values in the primary key columns. Primary keys help maintain data integrity and create relationships between tables, acting as a foundation for database design and queries.

Implementation in DuckDB

In DuckDB, you can define a primary key when creating a table using either column-level or table-level syntax. The database will automatically create an index for the primary key and enforce uniqueness. This indexing is critical for performance when tables are joined via foreign keys.

Column-level syntax:

Copy code

CREATE TABLE users ( id INTEGER PRIMARY KEY, username VARCHAR NOT NULL );

Table-level syntax, useful for composite primary keys:

Copy code

CREATE TABLE order_items ( order_id INTEGER, item_id INTEGER, quantity INTEGER, PRIMARY KEY (order_id, item_id) );

Key Characteristics

Primary keys must contain unique values and cannot be NULL. DuckDB will automatically enforce these constraints by rejecting any insertions or updates that would violate them. While some databases allow nullable primary keys, DuckDB follows the SQL standard strictly in this regard.

Primary Key vs. Unique Key

While both primary keys and unique keys enforce uniqueness within a database table, they serve different operational purposes:

  • Nullability: A primary key strictly cannot contain NULL values. A unique key can typically allow one or more NULL values (depending on the database system).
  • Quantity: A table can only have one primary key (though it can be a composite key made of multiple columns), but it can have multiple unique keys.
  • Purpose: The primary key acts as the definitive identifier for a row and is often referenced by foreign keys in other tables. Unique keys are used to prevent duplicate entries in alternate columns (such as an email column).

Best Practices

When choosing a primary key, consider using:

  • Auto-incrementing integers (or identity columns) for simple tables. These are fast and require minimal storage (4-8 bytes).
  • UUIDs (Universally Unique Identifiers) for distributed databases or when merging data from multiple sources. While they take more storage (16 bytes), they prevent ID conflicts.
  • Natural unique identifiers like product codes or standard identifiers, but only if they are strictly guaranteed to remain unique and never change.
  • Composite keys when a single column isn't sufficient to uniquely identify rows (e.g., combining order_id and item_id).

For example:

Copy code

-- Using a sequence for auto-incrementing IDs CREATE SEQUENCE user_id_seq; CREATE TABLE users ( id INTEGER PRIMARY KEY DEFAULT nextval('user_id_seq'), email VARCHAR UNIQUE NOT NULL );

Troubleshooting Primary Key Constraint Violations

When inserting or updating data, you may encounter a "Violation of PRIMARY KEY constraint" or "Duplicate key" error. This happens when a query attempts to insert a value that already exists in the primary key column.

How to resolve it:

  1. Check for duplicates: Run a SELECT query grouped by your primary key column using HAVING COUNT(*) > 1 to identify the conflicting rows.
  2. Use UPSERTs: If you intentionally want to overwrite existing data, use an INSERT ... ON CONFLICT statement (supported in DuckDB and PostgreSQL) to update the record instead of failing.
  3. Verify sequences: If you are using an auto-incrementing integer, ensure the underlying sequence or identity column hasn't fallen out of sync with the table's maximum ID.