Building data-driven components and applications doesn't have to be so ducking hard

Python Faker for DuckDB Fake Data Generation

2023/01/31

BY

Subscribe to MotherDuck Blog

Why generate data?

There is a plethora of interesting public data out there. The DuckDB community regularly uses the NYC Taxi Data to demonstrate and test features as it’s a reasonably large set of data (billions of records) and it’s data the public understands. We’re very lucky to have this dataset, but like many data sources, the data is in need of cleaning.

You can see here that some taxi trips were taken seriously far in the future. Based on the fare_amount for the following 5 person trip in 2098, I’d say we can safely conclude that inflation will be on a downward or lateral trend over the next 60 years.

┌──────────────────────┬──────────┬─────────────────┬─────────────┐
│ tpep_pickup_datetime │ VendorID │ passenger_count │ fare_amount │
│      timestamp       │  int64   │      int64      │   double    │
├──────────────────────┼──────────┼─────────────────┼─────────────┤
│ 2098-09-11 02:23:31  │        2 │               5 │        22.5 │
│ 2090-12-31 06:41:26  │        2 │               2 │        52.0 │
│ 2088-01-24 00:25:39  │        2 │               1 │        14.5 │
│ 2088-01-24 00:15:42  │        2 │               1 │         4.5 │
│ 2084-11-04 12:32:24  │        2 │               1 │        10.0 │

Interestingly all trips with dates in the future are posted from a single vendor (see data dictionary). Others have documented additional issues with dirty data. Of course, I could clean these up, but using these records as-is makes me frequently question my SQL skills. I’d rather use generated data where analysts can focus on how ducking awesome DuckDB is instead of how unclean the data is.

As a bonus, using generated data allows us to create data that’s better aligned with real-world uses cases for the average analyst, as Anna Geller requests in a recent tweet.

Using Python Faker

Faker is a Python package for generating fake data, with a large number of providers for generating different types of data, such as people, credit cards, dates/times, cars, phone numbers, etc. Many of the included and community providers are even localized for different regions [where bank accounts, phone numbers, etc are different].

Keep in mind that the data we generate won’t be perfect [distributions, values, etc] unless we tune the out-of-the-box code. But oftentimes you just need someone who looks quacks like a dock, but is not an actual duck.

python_faker_duck.jpg

Here’s a simple example of using Python Faker to generate a person record, with a name, email, company, etc.:

import random
from faker import Faker

fake = Faker()

person = {}
# person['id'] = fake.ssn()
person['id'] = random.randrange(1000,9999999999999)
person['first_name'] = fake.first_name()
person['last_name'] = fake.last_name()
person['email'] = fake.unique.ascii_email()
person['company'] = fake.company()
person['phone'] = fake.phone_number()

You’ll notice I commented out generating the ID as a US Social Security Number (SSN), because that’s just scary and bad practice. Instead, I generated a random number in a specified range using random. This value is not guaranteed to be unique, so you might want to check for uniqueness in your python code. Alternatively, you could add a UNIQUE or PRIMARY KEY constraint in DuckDB (here are the internals and examples), but that could generate too much work during loading large amounts of data.

Additional caveat: there is no guarantee of consistency between company name, email, and the first/last name – you could easily end up with:

{'id': 7529464536979,
 'first_name': 'Vanessa',
  'last_name': 'Snyder',
  'email': 'margaret91@yahoo.com',
  'company': 'Garcia, James and Fisher',
  'phone': '902-906-4495x0016'}

Inserting Data into DuckDB

There are at least four different ways to insert this generated data into DuckDB:

  • SQL prepared statements
  • Pandas DataFrames inserted directly into DuckDB
  • CSV files copied into DuckDB
  • Parquet files copied into DuckDB

As it’s the least efficient way and not recommended, I’m not going to demonstrate how to use prepared statements with executemany. The DuckDB documentation explicitly warns against this method.

Pandas DataFrames directly into DuckDB

DuckDB in Python has access to Pandas DataFrame objects in the current scope. In order to insert the person dict into a DuckDB database, you can create a DataFrame from the dict and execute a DuckDB SQL query like CREATE TABLE persons AS SELECT * from df.

Here’s an example of inserting 10 generated persons into a table of the same name in DuckDB:

​​import random
import duckdb
import pandas as pd
from faker import Faker
import fastparquet
import sys

fake = Faker()

def get_person():
  person = {}
  person['id'] = random.randrange(1000,9999999999999)
  person['first_name'] = fake.first_name()
  person['last_name'] = fake.last_name()
  person['email'] = fake.unique.ascii_email()
  person['company'] = fake.company()
  person['phone'] = fake.phone_number()
  return person

personlist = []
for x in range(10):
  personlist.append(get_person())

df = pd.DataFrame.from_dict(personlist)

con = duckdb.connect()
con.execute("CREATE TABLE persons AS SELECT * FROM df")

CSV files copied into DuckDB

If you’ve worked with CSV files in Python, you’re probably already familiar with the csv module and perhaps the CSV DictWriter constructor.

Once we have a person it’s quite easy to write that person to a CSV file. You’ll notice that I append the the first command-line argument as a suffix on the filename. You’ll understand the reason for this when we go to parallelize the execution of this code later in the post.

pcsv = open('out/persons_%s.csv' % sys.argv[1], 'w')

pwriter = csv.DictWriter(pcsv, fieldnames=['id','first_name','last_name','email','company','phone'])

pwriter.writeheader()
pwriter.writerow(person)

Parquet files copied into DuckDB

Pandas DataFrames are actually a great way to create parquet files which can then be loaded into DuckDB. After we create a DataFrame containing 10 person records (in the code above), we can use the fastparquet library to write them to a parquet file:

# Write out pandas DataFrame df to parquet, using suffix passed on command-line
fastparquet.write('outfile_%s.parquet' % sys.argv[1], df)

Instead of only generating 10 records at a time, I changed the code to generate 100k person records and save them into a parquet file:

personlist = []
for x in range(10):
  personlist.append(get_person())

df = pd.DataFrame.from_dict(personlist)
fastparquet.write('outfile_%s.parquet' % sys.argv[1], df)

This only took 26 seconds to execute:

# time the execution of my python code
# use 6 as the suffix for the parquet file as i already have outfile_[1-5]
time python generate.py 6
python generate.py 6  28.35s user 0.22s system 108% cpu 26.241 total

Of course, we probably want to check that this worked well and that’s super easy to do using DuckDB to read the parquet files, using simple glob patterns:

$ echo "SELECT id,first_name,last_name FROM 'outfile_*.parquet'" | duckdb
┌───────────────┬─────────────┬───────────┐
│      id       │ first_name  │ last_name │
│     int64     │   varchar   │  varchar  │
├───────────────┼─────────────┼───────────┤
│ 6161138431505 │ Michael     │ Kane      │
│ 1867355902434 │ Jordan      │ Jarvis    │
│ 5655135874036 │ Arthur      │ Haley     │
│ 8004712047366 │ Kim         │ Welch     │
│       ·       │   ·         │   ·       │
│       ·       │   ·         │   ·       │
│       ·       │   ·         │   ·       │
│ 7479524472455 │ Justin      │ Carey     │
│ 1347469827969 │ Randy       │ Rosario   │
│ 7555403134688 │ Jessica     │ Morris    │
├───────────────┴─────────────┴───────────┤
│ 100000 rows (40 shown)        3 columns │
└─────────────────────────────────────────┘

Now we’re ready to load our parquet files into DuckDB. You can also do this in one line in the shell.

echo "CREATE TABLE persons AS SELECT * FROM 'outfile_*.parquet'" | duckdb ./people.ddb

At the time that I ran this query, I had 10M rows in the parquet files and it loaded in 2.8 seconds.

Easy Parallelization using GNU Parallel

I’m a big fan of using shell utilities – everything from sed and grep to my all-time favorite ParallelSSH, which I used almost 20 years ago to maintain a fleet of machines. In this case though (as long as we have fast enough machines, including I/O), we don’t need parallel execution across many machines, but can use GNU Parallel to execute the same python code many times in parallel.

The following code will execute generator.py 10 times, resulting in parquet files with a total of 1M person records, sharded into 10 different files. It takes only 33 seconds on the wall clock to execute on a 10 core machine.

# pass numbers 1 to 10 on separate lines to GNU parallel
seq 10 | time parallel python generate.py
parallel python generate.py  307.06s user 5.72s system 938% cpu 33.313 total

Note that this will result in calling:

python generate.py 1
python generate.py 2
…
python generate.py 10

I tried increasing the seq 10 to seq 100 and it scaled linearly. By default, GNU Parallel only runs 1 job in parallel for each CPU core. If you suspect that it’ll work faster for your use case, you can actually launch more than 1 job per core and let the scheduler optimize. Here’s how you launch 2 jobs for every core:

seq 100 | time parallel -j 200% python genera.py

For this particular job, it actually takes slightly longer to do this as, from the looks of top and the size of the generated files, we’re maxing out the CPU core for each python process.

Generating 1 Billion People

I used the GNU Parallel technique discussed above with a hefty m6i.32xlarge instance on Amazon EC2, though generated a billion people in 1k parquet files. This took about 2 hours to generate. After generation, executing a full table scan query (SELECT SUM(id) FROM '*.parquet' WHERE email LIKE '%gmail.com') took only 6 seconds.

I then into the data into DuckDB's native storage, producing a 36GB DuckDB file in about 2 minutes. My first full table scan query took 10.82 seconds, but subsequent queries (with different values, no caching) took only 1.03 seconds. Whoa!

Next Steps

In this example, we only generated fake records for person objects. In my code, I actually generated a person, along with a corresponding bank account and address. I then generated a random number of additional accounts and addresses for each person. These were populated into different tables in DuckDB, related to each other by the ID generated for a person.

My code initially used a CSV format for testing loading performance of CSVs. In other scenarios, I’d likely choose generating parquet files as they’re much more efficient on disk.

# generate 1.5B people and 1.5B+ addresses and accounts    
records = 1500000000

print("Generating %s random people records" % records)
   
for y in range(records):
  (person, address, bacct) = get_fake_data()
  pwriter.writerow(person)
  awriter.writerow(address)
  bwriter.writerow(bacct)
   
  for x in range(random.randrange(0, 2)):
    address = get_fake_address(person['id'])
    awriter.writerow(address)
   
  for x in range(random.randrange(0, 3)):
    bacct = get_fake_account(person['id'])
    bwriter.writerow(bacct)

What generated data will you make using Faker? Let us know on twitter via @motherduck.

CONTENT
  1. Why generate data?
  2. Using Python Faker
  3. Inserting Data into DuckDB
  4. Easy Parallelization using GNU Parallel
  5. Generating 1 Billion People
  6. Next Steps

Subscribe to MotherDuck Blog