Looking for synthetic data for your next project? Learn how to use the Faker Python library to generate realistic synthetic data.
Whether you’re a developer or a data professional, you’ll often need to experiment with synthetic data when working on a project. This could be bootstrapping a database or creating pandas dataframe to run sample analysis.
In this tutorial, we’ll learn all about generating synthetic data with the Faker library. We’ll start by installing Faker in our working environment. Then, dive into the basics of data generation with Faker. We’ll also look at two practical examples of synthetic data generation:
- Populating a database table with records
- Creating a pandas dataframe for analysis
For all of this and more, let’s get started!
Introduction to Python Faker
Faker is a Python library for synthetic data generation. You can install Faker in your development environment using pip:
$ pip install Faker
As a good practice, install the library in a virtual environment instead of in the global system environment.
Now that we have installed Faker let’s see how we can generate synthetic data using it.
Note: Because the synthetic data generation is random, you may get different results when you run the following code snippets. If you want reproducibility across all runs of your program, set the seed like so:
Faker.seed(random_seed)
.
Generating Basic Personal Info
First, let’s learn to generate basic information about a person, such as their name, address, and contact information.
Let’s instantiate a Faker
object called fake
. To generate fake data, call the following methods on the Faker
object:
fake.name()
for a fake namefake.address()
to get a fake addressfake.phone_number
to get a fake phone numberfake.email()
to get a fake email address
That’s how intuitive it is to generate synthetic data with Faker. These fields are very useful when you want to create databases with profile information of clients, customers, and more.
from faker import Faker
fake = Faker()
# Generate a fake name
fake_name = fake.name()
print("Fake Name:", fake_name)
# Generate a fake address
fake_address = fake.address()
print("Fake Address:", fake_address)
# Generate fake contact information
fake_email = fake.email()
fake_phone_number = fake.phone_number()
print("Fake Email:", fake_email)
print("Fake Phone Number:", fake_phone_number)
Here’s a sample output:
# Sample output
Fake Name: Shannon Martin
Fake Address: Unit 3135 Box 5789
DPO AA 86734
Fake Email: brownlarry@example.com
Fake Phone Number: +1-830-443-3886x793
Generating Dates and Times
You can also generate dates and times with Faker. This can be helpful when you need data to represent:
- Product purchase dates
- Time of placing an order
- Date and time (for timestamp data)
The following code snippet shows how you can generate sample dates, times, and date time entries using Faker:
from faker import Faker
fake = Faker()
# Generate a fake date
fake_date = fake.date_of_birth()
print("Date of Birth:", fake_date)
# Generate a fake time
fake_time = fake.time()
print("Time:", fake_time)
# Generate a fake date and time
fake_datetime = fake.date_time()
print("Date and Time:", fake_datetime)
Here’s a sample output:
# Sample output
Date of Birth: 2012-08-04
Time: 12:39:37
Date and Time: 2008-05-03 20:41:44.007498
Generating Geographical Data
Faker also lets you generate geographical data. You can generate latitudes and longitudes as shown:
from faker import Faker
fake = Faker()
# Generate fake latitude and longitude
fake_latitude = fake.latitude()
fake_longitude = fake.longitude()
print("Latitude:", fake_latitude)
print("Longitude:", fake_longitude)
Running this snippet will give you a latitude and longitude.
# Sample output
Latitude: -61.8984755
Longitude: 52.984726
You can use this to spin up geographical data.
Now, let’s code a couple of practical examples to see how synthetic data generation with Faker is helpful.
Practical Examples of Python Faker
Example 1: Populating a Database with Faker
Now, let’s learn how to use Python Faker to populate a database table with records and run a sample query on it.
This includes the following steps:
- Setting up the database and the database table
- Generating and inserting fake data
- Querying the database
Step 1: Set Up the Database and the Database Table
First, let’s create a database that we can connect to. We’ll use SQLite because we can use Python’s built-in SQLite module to work with SQLite databases.
We can use the connect()
function to connect to the database (fake_data.db
). This will create fake_data.db
if it does not exist already.
In the database, we create a users
table with the following fields:
- id
- name
- address
import sqlite3
# Create or connect to the SQLite database
conn = sqlite3.connect('fake_data.db')
# Create a cursor object to interact with the database
cursor = conn.cursor()
# Create the 'users' table
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
email TEXT,
address TEXT
)
''')
# Commit the changes and close the connection
conn.commit()
conn.close()
To make persistent changes to the database, we commit the transaction.
Step 2: Generate and Insert Fake Data
Next, we’ll generate synthetic records with Faker and insert them into the users
table in the fake_data.db
database.
from faker import Faker
import sqlite3
# Create a Faker instance
fake = Faker()
# Connect to the SQLite database
conn = sqlite3.connect('fake_data.db')
cursor = conn.cursor()
# Generate and insert fake data into the 'users' table
for _ in range(10): # Insert 10 fake records as an example
name = fake.name()
email = fake.email()
address = fake.address()
cursor.execute('''
INSERT INTO users (name, email, address) VALUES (?, ?, ?)
''', (name, email, address))
# Commit the changes and close the connection
conn.commit()
conn.close()
As with the previous step, for transactions to persist in the database, we commit the transaction using conn.commit()
.
Step 3: Query the Database
Now, let’s run a simple select query to retrieve all the records from the users
table.
- We can pass in the query string as the argument to
execute()
. - To fetch the results of the query, we can call the
fetchall()
method on the cursor object.
Run the following code:
import sqlite3
# Connect to the SQLite database
conn = sqlite3.connect('fake_data.db')
cursor = conn.cursor()
# Example SQL query to retrieve all users
cursor.execute('SELECT * FROM users')
# Fetch and display the results
for row in cursor.fetchall():
print(f"ID: {row[0]}, Name: {row[1]}, Email: {row[2]}, Address: {row[3]}")
# Close the connection
conn.close()
This returns a result set with all the records in the users
table:
# Sample Output
ID: 1, Name: Dennis Montgomery, Email: djohnson@example.com, Address: Unit 8516 Box 0119
DPO AA 34640
ID: 2, Name: Megan Jones, Email: timothyshepherd@example.com, Address: 163 Christopher Trafficway
Torresberg, CT 12056
ID: 3, Name: Martin Cowan, Email: jason14@example.com, Address: 481 Theresa Port Apt. 855
Alisonside, MA 04239
ID: 4, Name: Russell Spears, Email: lloydcarrie@example.net, Address: 3809 Mary Road
Port Markchester, AS 35796
ID: 5, Name: Mrs. Stephanie Davis MD, Email: orobertson@example.net, Address: 42189 Joseph Summit
Montesfurt, AK 49743
ID: 6, Name: Nathaniel Knox, Email: crystal52@example.org, Address: 77419 Bill Heights Apt. 392
South Victor, IL 36820
ID: 7, Name: Brenda Moore, Email: jeffreycarlson@example.com, Address: Unit 1415 Box 0020
DPO AE 19235
ID: 8, Name: Troy Robinson, Email: christinanguyen@example.net, Address: Unit 4696 Box 1935
DPO AE 12468
ID: 9, Name: Karla Kelly, Email: andersonheather@example.net, Address: 70435 Sabrina Ville Suite 049
Davisfurt, NY 04768
ID: 10, Name: Lindsay Wood, Email: david58@example.com, Address: 100 Gabriella Plaza
Powellland, NY 54270
Example 2: Populating a DataFrame with Faker
Let’s wrap up by discussing another practical example: creating a dataframe for analysis with Faker.
Create a Pandas Dataframe with Records
We’ll create a pandas data frame of employee records, with each record containing the following fields:
- name
- age
- department
- salary
We’ll then use Faker to create and insert 100 records into the employees
dataframe:
# Create a Faker instance
fake = Faker()
Faker.seed(27)
# Generate fake employee data
num_employees = 100
employee_data = []
for _ in range(num_employees):
name = fake.name()
age = fake.random_int(min=22, max=60)
department = fake.random_element(elements=('HR', 'Finance', 'Engineering', 'IT', 'Sales', 'Marketing'))
salary = fake.random_int(min=30000, max=120000)
employee_data.append([name, age, department, salary])
# Create a DataFrame with the generated data
columns = ['Name', 'Age', 'Department', 'Salary']
employee_df = pd.DataFrame(employee_data, columns=columns)
Because we set the seed, we’ll get the same records whenever we run this code. We can call the head()
method with 10 as the argument to inspect the first 10 rows of the employees
dataframe:
employee_df.head(10)
To get descriptive statistics of the numeric columns (age and salary), we can call the describe()
method as shown:
employee_df.describe()
Visualizing Age Distribution
First, let’s plot a histogram to understand the distribution of ages of employees.
# Create a histogram for age distribution
plt.figure(figsize=(8, 6),dpi=150)
employee_df['Age'].plot(kind='hist', bins=20, title='Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Visualizing Department Distribution
We can also create a bar plot to understand the number of employees in each department. This helps get an idea of the representation of each department within the organization.
# Create a bar chart for department distribution
plt.figure(figsize=(8, 6),dpi=150)
department_counts = employee_df['Department'].value_counts()
department_counts.plot(kind='bar', title='Department Distribution')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
Visualizing Salary Distribution by Department
Next, let’s visualize the salary distribution by the department. We’ve used a box plot as it helps understand the spread of values. If the spread is too large, you can use a log scale by setting the y-axis scale to “log”.
# Create a box plot to visualize salary distribution by department
plt.figure(figsize=(8, 6),dpi=150)
plt.title('Salary Distribution by Department')
ax = sns.boxplot(x='Department', y='Salary', data=employee_df)
ax.set_yscale("log")
plt.xlabel('Department')
plt.ylabel('Salary (Log Scale)')
plt.xticks(rotation=45)
plt.show()
Conclusion
In this tutorial, we learned all about generating synthetic data with Faker.
We started by learning how to install Faker and generate different types of data, including text, profile information, dates, times, and more. Then, we looked at using Faker to populate databases and pandas data frames with sample data.
So, are you ready to use Faker in your next project?