Geekflare is supported by our audience. We may earn affiliate commissions from buying links on this site.
Share on:

How to Create a Pandas DataFrame [With Examples]

How-to-Create-a-Pandas-DataFrame
Invicti Web Application Security Scanner – the only solution that delivers automatic verification of vulnerabilities with Proof-Based Scanning™.

Learn the basics of working with pandas DataFrames: the basic data structure in pandas, a powerful data manipulation library.

If you’d like to get started with data analysis in Python, pandas is one of the first libraries you should learn to work with. From importing data from multiple sources such as CSV files and databases to handling missing data and analyzing it to gain insights – pandas lets, you do all of the above.

To start analyzing data with pandas, you should understand the fundamental data structure in pandas: data frames.

In this tutorial, you’ll learn the basics of pandas dataframes and common methods to create dataframes. You’ll then learn how to select rows and columns from the dataframe for retrieving subsets of data.

For all this and more, let’s get started.

Installing and Importing Pandas

As pandas is a third-party data analysis library, you should first install it. It’s recommended to install external packages in a virtual environment for your project.

If you use the Anaconda distribution of Python, you can use conda for package management.

conda install pandas

You can also install pandas using pip:

pip install pandas

⚙ The pandas library requires NumPy as a dependency. So if NumPy is not already installed, it will also be installed during the installation process.

After installing pandas, you can import it into your working environment. In general, pandas is imported under the alias pd:

import pandas as pd

What Is a DataFrame in Pandas?

What-Is-a-DataFrame-in-Pandas

The fundamental data structure in pandas is the data frame. A data frame is a two-dimensional array of data with labeled index and named columns. Each column in the data frame called a pandas series, shares a common index.

Here’s an example data frame that we’ll create from scratch over the next few minutes. This data frame contains data on how much six students spend in four weeks.

pd-dataframe-ex

The names of the students are the row labels. And the columns are named ‘Week1’ to ‘Week4’. Notice that all the columns share the same set of row labels, also called the index.

How to Create a Pandas DataFrame

There are several ways to create a pandas data frame. In this tutorial, we will discuss the following methods:

  • Creating a data frame from NumPy arrays
  • Creating a data frame from a Python dictionary
  • Creating a data frame by reading in CSV files

From NumPy Arrays

Let us create a data frame from a NumPy array.

Let’s create the data array of shape (6,4) assuming that in any given week, each student spends anywhere between $0 and $100. The randint() function from NumPy’s random module returns an array of random integers in a given interval, [low,high).

import numpy as np
np.random.seed(42)
data = np.random.randint(0,101,(6,4))
print(data)
array([[51, 92, 14, 71],
       [60, 20, 82, 86],
       [74, 74, 87, 99],
       [23,  2, 21, 52],
       [ 1, 87, 29, 37],
       [ 1, 63, 59, 20]])

To create a pandas data frame, you can use the DataFrame constructor and pass in the NumPy array as the data argument, as shown:

students_df = pd.DataFrame(data=data)

Now we can call the built-in type() function to check the type of students_df. We see that it is a DataFrame object.

type(students_df)
# pandas.core.frame.DataFrame
print(students_df)
pandas-dataframe-01

We see that by default, we have range indexing which goes from 0 to the numRows – 1, and the column labels are 0, 1, 2, …, numCols -1. However, this reduces readability. It will help to add descriptive column names and row labels to the data frame.

Let’s create two lists: one to store students’ names and another to store the column labels.

students = ['Amy','Bob','Chris','Dave','Evelyn','Fanny']
cols = ['Week1','Week2','Week3','Week4']

When calling the DataFrame constructor, you can set the index and columns to the lists of row labels and column labels to use, respectively.

students_df = pd.DataFrame(data = data,index = students,columns = cols)

We now have the students_df data frame with descriptive row and column labels.

print(students_df)
pandas-dataframe-02

To get some basic information on the data frame, such as missing values and data types, you can call the info() method on the data frame object.

students_df.info()
image-114

From a Python Dictionary

You can also create a pandas data frame from a Python dictionary.

Here, data_dict is the dictionary containing student data:

  • The names of the students are the keys.
  • Each value is a list of how much each student spends from weeks one through four.
data_dict = {}
students = ['Amy','Bob','Chris','Dave','Evelyn','Fanny']
for student,student_data in zip(students,data):
  data_dict[student] = student_data

To create a data frame from a Python dictionary, use from_dict, as shown below. The first argument corresponds to dictionary containing the data (data_dict). By default, the keys are used as the column names of the data frame. As we’d like to set the keys as the row labels, set orient= 'index'.

students_df = pd.DataFrame.from_dict(data_dict,orient='index')
print(students_df)
pandas-dataframe-from-dict

To change the column names to the week number, we set columns to the cols list:

students_df = pd.DataFrame.from_dict(data_dict,orient='index',columns=cols)
print(students_df)
pandas-dataframe-03

Read in a CSV File Into a Pandas DataFrame

Suppose the student data is available a CSV file. You can use the read_csv() function to read in the data from the file into a pandas data frame. pd.read_csv('file-path') is the general syntax, where file-path is the path to the CSV file. We can set the names parameter to the list of column names to use.

students_df = pd.read_csv('/content/students.csv',names=cols)

Now that we know how to create a data frame, let’s learn how to select rows and columns.

Select Columns from a Pandas DataFrame

There are several built-in methods that you can use to select rows and columns from a data frame. This tutorial will go over the most common ways to select columns, rows, and both rows and columns from a data frame.

Selecting a Single Column

To select a single column, you can use df_name[col_name] where col_name is the string denoting the name of the column.

Here, we select only the ‘Week1’ column.

week1_df = students_df['Week1']
print(week1_df)
single-column

Selecting Multiple Columns

To select multiple columns from the data frame, pass in the list of all the column names to select.

odd_weeks = students_df[['Week1','Week3']]
print(odd_weeks)
multiple-columns

In addition to this method, you can also use the iloc() and loc() methods to select columns. We’ll code an example later.

Select Rows from a Pandas DataFrame

4-2

Using the .iloc() Method

To select rows using the iloc() method, pass in the indices corresponding to all the rows as a list.

In this example, we select the rows at odd index.

odd_index_rows = students_df.iloc[[1,3,5]]
print(odd_index_rows)
odd-rows

Next, we select a subset of the data frame containing the rows at index 0 to 2, the end point 3 is excluded by default.

slice1 = students_df.iloc[0:3]
print(slice1)
select-rows

Using the .loc() Method

To select the rows of a data frame using the loc() method, you should specify the labels corresponding to the rows you’d like to select.

some_rows = students_df.loc[['Bob','Dave','Fanny']]
print(some_rows)
some-rows

If the rows of the data frame are indexed using the default range 0, 1, 2, up to numRows -1, then using iloc() and loc() are both equivalent.

Select Rows and Columns from a Pandas DataFrame

So far, you have learned how to select either rows or columns from a pandas data frame. However, you may sometimes need to select a subset of both rows and columns. So how do you do it? You can use the iloc() and loc() methods we’ve discussed.

For example, in the code snippet below, we select all rows and columns at index 2 and 3.

subset_df1 = students_df.iloc[:,[2,3]]
print(subset_df1)
dataframe-subset

Using start:stop creates a slice from start up to but not including stop. So when you ignore both the start and stop values, when you ignore the start and the stop values, the slice starts at the beginning—and extends up to the end of the data frame—selecting all the rows.

When using the loc() method, you have to pass in the labels of the rows and the columns you’d like to select, as shown:

subset_df2 = students_df.loc[['Amy','Evelyn'],['Week1','Week3']]
print(subset_df2)
df-subset

Here, the dataframe subset_df2 contains Amy and Evelyn’s record for Week1 and Week3.

Conclusion

Here is a quick review of what you have learned in this tutorial:

  • After installing pandas, you can import it under the alias pd. To create a pandas data frame object, you can use the pd.DataFrame(data) constructor, where data refers to the N-dimensional array or an iterable containing the data. You can specify the row and index, and column labels by setting the optional index and columns parameters, respectively.
  • Using pd.read_csv(path-to-the-file) reads the contents of the file into a data frame.
  • You can call the info() method on the data frame object to get information on the columns, the number of missing values, data types, and the size of the data frame.
  • To select a single column, use df_name[col_name], and to select multiple columns, particular column, df_name[[col1,col2,...,coln]].
  • You can also select columns and rows using the loc() and iloc() methods.
  • While the iloc() method takes in the index (or index slice) of the rows and columns to select, the loc() method takes in the row and column labels.

You can find the examples used in this tutorial in this Colab notebook.

Next, check out this listicle of collaborative data science notebooks.

Thanks to our Sponsors
More great readings on Development
Power Your Business
Some of the tools and services to help your business grow.
  • Invicti uses the Proof-Based Scanning™ to automatically verify the identified vulnerabilities and generate actionable results within just hours.
    Try Invicti
  • Web scraping, residential proxy, proxy manager, web unlocker, search engine crawler, and all you need to collect web data.
    Try Brightdata
  • Semrush is an all-in-one digital marketing solution with more than 50 tools in SEO, social media, and content marketing.
    Try Semrush
  • Intruder is an online vulnerability scanner that finds cyber security weaknesses in your infrastructure, to avoid costly data breaches.
    Try Intruder