Python Faker: How to Generate Synthetic Data

Looking for synthetic data for your next project? Learn how to use the Faker Python library to generate realistic synthetic data.

Whether you’re a developer or a data professional, you’ll often need to experiment with synthetic data when working on a project. This could be bootstrapping a database or creating pandas dataframe to run sample analysis.

In this tutorial, we’ll learn all about generating synthetic data with the Faker library. We’ll start by installing Faker in our working environment. Then, dive into the basics of data generation with Faker. We’ll also look at two practical examples of synthetic data generation:

Populating a database table with records
Creating a pandas dataframe for analysis

For all of this and more, let’s get started!

Introduction to Python Faker

Faker is a Python library for synthetic data generation. You can install Faker in your development environment using pip:

$ pip install Faker

As a good practice, install the library in a virtual environment instead of in the global system environment.

Now that we have installed Faker let’s see how we can generate synthetic data using it.

Note: Because the synthetic data generation is random, you may get different results when you run the following code snippets. If you want reproducibility across all runs of your program, set the seed like so: Faker.seed(random_seed).

Generating Basic Personal Info

First, let’s learn to generate basic information about a person, such as their name, address, and contact information.

Let’s instantiate a Faker object called fake. To generate fake data, call the following methods on the Faker object:

fake.name() for a fake name
fake.address() to get a fake address
fake.phone_number to get a fake phone number
fake.email() to get a fake email address

That’s how intuitive it is to generate synthetic data with Faker. These fields are very useful when you want to create databases with profile information of clients, customers, and more.

from faker import Faker

fake = Faker()

# Generate a fake name
fake_name = fake.name()
print("Fake Name:", fake_name)

# Generate a fake address
fake_address = fake.address()
print("Fake Address:", fake_address)

# Generate fake contact information
fake_email = fake.email()
fake_phone_number = fake.phone_number()
print("Fake Email:", fake_email)
print("Fake Phone Number:", fake_phone_number)

Here’s a sample output:

# Sample output

Fake Name: Shannon Martin
Fake Address: Unit 3135 Box 5789
DPO AA 86734
Fake Email: brownlarry@example.com
Fake Phone Number: +1-830-443-3886x793

Generating Dates and Times

You can also generate dates and times with Faker. This can be helpful when you need data to represent:

Product purchase dates
Time of placing an order
Date and time (for timestamp data)

The following code snippet shows how you can generate sample dates, times, and date time entries using Faker:

from faker import Faker

fake = Faker()

# Generate a fake date
fake_date = fake.date_of_birth()
print("Date of Birth:", fake_date)

# Generate a fake time
fake_time = fake.time()
print("Time:", fake_time)

# Generate a fake date and time
fake_datetime = fake.date_time()
print("Date and Time:", fake_datetime)

Here’s a sample output:

# Sample output

Date of Birth: 2012-08-04
Time: 12:39:37
Date and Time: 2008-05-03 20:41:44.007498

Generating Geographical Data

Faker also lets you generate geographical data. You can generate latitudes and longitudes as shown:

from faker import Faker

fake = Faker()

# Generate fake latitude and longitude
fake_latitude = fake.latitude()
fake_longitude = fake.longitude()

print("Latitude:", fake_latitude)
print("Longitude:", fake_longitude)

Running this snippet will give you a latitude and longitude.

# Sample output
Latitude: -61.8984755
Longitude: 52.984726

You can use this to spin up geographical data.

Now, let’s code a couple of practical examples to see how synthetic data generation with Faker is helpful.

Practical Examples of Python Faker

Example 1: Populating a Database with Faker

Now, let’s learn how to use Python Faker to populate a database table with records and run a sample query on it.

This includes the following steps:

Setting up the database and the database table
Generating and inserting fake data
Querying the database

Step 1: Set Up the Database and the Database Table

First, let’s create a database that we can connect to. We’ll use SQLite because we can use Python’s built-in SQLite module to work with SQLite databases.

We can use the connect() function to connect to the database (fake_data.db). This will create fake_data.db if it does not exist already.

In the database, we create a users table with the following fields:

id
name
email
address

import sqlite3

# Create or connect to the SQLite database
conn = sqlite3.connect('fake_data.db')

# Create a cursor object to interact with the database
cursor = conn.cursor()

# Create the 'users' table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT,
        email TEXT,
        address TEXT
    )
''')

# Commit the changes and close the connection
conn.commit()
conn.close()

To make persistent changes to the database, we commit the transaction.

Step 2: Generate and Insert Fake Data

Next, we’ll generate synthetic records with Faker and insert them into the users table in the fake_data.db database.

from faker import Faker
import sqlite3

# Create a Faker instance
fake = Faker()

# Connect to the SQLite database
conn = sqlite3.connect('fake_data.db')
cursor = conn.cursor()

# Generate and insert fake data into the 'users' table
for _ in range(10):  # Insert 10 fake records as an example
    name = fake.name()
    email = fake.email()
    address = fake.address()
    
    cursor.execute('''
        INSERT INTO users (name, email, address) VALUES (?, ?, ?)
    ''', (name, email, address))

# Commit the changes and close the connection
conn.commit()
conn.close()

As with the previous step, for transactions to persist in the database, we commit the transaction using conn.commit().

Step 3: Query the Database

Now, let’s run a simple select query to retrieve all the records from the users table.

We can pass in the query string as the argument to execute().
To fetch the results of the query, we can call the fetchall() method on the cursor object.

Run the following code:

import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('fake_data.db')
cursor = conn.cursor()

# Example SQL query to retrieve all users
cursor.execute('SELECT * FROM users')

# Fetch and display the results
for row in cursor.fetchall():
    print(f"ID: {row[0]}, Name: {row[1]}, Email: {row[2]}, Address: {row[3]}")

# Close the connection
conn.close()

This returns a result set with all the records in the users table:

# Sample Output
ID: 1, Name: Dennis Montgomery, Email: djohnson@example.com, Address: Unit 8516 Box 0119
DPO AA 34640
ID: 2, Name: Megan Jones, Email: timothyshepherd@example.com, Address: 163 Christopher Trafficway
Torresberg, CT 12056
ID: 3, Name: Martin Cowan, Email: jason14@example.com, Address: 481 Theresa Port Apt. 855
Alisonside, MA 04239
ID: 4, Name: Russell Spears, Email: lloydcarrie@example.net, Address: 3809 Mary Road
Port Markchester, AS 35796
ID: 5, Name: Mrs. Stephanie Davis MD, Email: orobertson@example.net, Address: 42189 Joseph Summit
Montesfurt, AK 49743
ID: 6, Name: Nathaniel Knox, Email: crystal52@example.org, Address: 77419 Bill Heights Apt. 392
South Victor, IL 36820
ID: 7, Name: Brenda Moore, Email: jeffreycarlson@example.com, Address: Unit 1415 Box 0020
DPO AE 19235
ID: 8, Name: Troy Robinson, Email: christinanguyen@example.net, Address: Unit 4696 Box 1935
DPO AE 12468
ID: 9, Name: Karla Kelly, Email: andersonheather@example.net, Address: 70435 Sabrina Ville Suite 049
Davisfurt, NY 04768
ID: 10, Name: Lindsay Wood, Email: david58@example.com, Address: 100 Gabriella Plaza
Powellland, NY 54270

Example 2: Populating a DataFrame with Faker

Let’s wrap up by discussing another practical example: creating a dataframe for analysis with Faker.

Create a Pandas Dataframe with Records

We’ll create a pandas data frame of employee records, with each record containing the following fields:

name
age
department
salary

We’ll then use Faker to create and insert 100 records into the employees dataframe:

# Create a Faker instance
fake = Faker()
Faker.seed(27)
# Generate fake employee data
num_employees = 100
employee_data = []

for _ in range(num_employees):
    name = fake.name()
    age = fake.random_int(min=22, max=60)
    department = fake.random_element(elements=('HR', 'Finance', 'Engineering', 'IT', 'Sales', 'Marketing'))
    salary = fake.random_int(min=30000, max=120000)
    
    employee_data.append([name, age, department, salary])

# Create a DataFrame with the generated data
columns = ['Name', 'Age', 'Department', 'Salary']
employee_df = pd.DataFrame(employee_data, columns=columns)

Because we set the seed, we’ll get the same records whenever we run this code. We can call the head() method with 10 as the argument to inspect the first 10 rows of the employees dataframe:

employee_df.head(10)

To get descriptive statistics of the numeric columns (age and salary), we can call the describe() method as shown:

employee_df.describe()

Visualizing Age Distribution

First, let’s plot a histogram to understand the distribution of ages of employees.

# Create a histogram for age distribution
plt.figure(figsize=(8, 6),dpi=150)
employee_df['Age'].plot(kind='hist', bins=20, title='Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Visualizing Department Distribution

We can also create a bar plot to understand the number of employees in each department. This helps get an idea of the representation of each department within the organization.

# Create a bar chart for department distribution
plt.figure(figsize=(8, 6),dpi=150)
department_counts = employee_df['Department'].value_counts()
department_counts.plot(kind='bar', title='Department Distribution')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Visualizing Salary Distribution by Department

Next, let’s visualize the salary distribution by the department. We’ve used a box plot as it helps understand the spread of values. If the spread is too large, you can use a log scale by setting the y-axis scale to “log”.

# Create a box plot to visualize salary distribution by department

plt.figure(figsize=(8, 6),dpi=150)
plt.title('Salary Distribution by Department')
ax = sns.boxplot(x='Department', y='Salary', data=employee_df)
ax.set_yscale("log")  
plt.xlabel('Department')
plt.ylabel('Salary (Log Scale)')
plt.xticks(rotation=45)
plt.show()

Conclusion

In this tutorial, we learned all about generating synthetic data with Faker.

We started by learning how to install Faker and generate different types of data, including text, profile information, dates, times, and more. Then, we looked at using Faker to populate databases and pandas data frames with sample data.

So, are you ready to use Faker in your next project?

More for you on Python

Bala Priya C
Contributor
Bala Priya is an experienced developer and technical writer who loves giving back to the developer community through writing technical tutorials, how-to guides, etc. Knowing the intricacies of the tech world, she is an active contributor who provides her readers with simple-to-follow content on extremely technical topics.