Managing Data

Understanding Pandas DataFrame

A DataFrame is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet. It consists of a dictionary of lists in which the lists each have their own identifiers or keys, such as ‘last name’ or ‘GPA’ seen below.

Creating DataFrames

Let's begin by entering a dictionary of lists into the DataFrame() parameters. If you didn't enable the object inspector to display the documentation of a function when the function is called and would like to have this feature, go to Basic Syntax to see how to do it.

Student Last, Student First, Graduation, and GPA are the keys (columns), and each case (row) has assigned values; if a cell is empty, it is said to be null.

import pandas as pd 
import numpy as np

df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'], 
                    'Student First': ['John', 'Mary'], 
                    'Graduation': [2020, 2019],
                    'GPA': [3.45, 3.20]})

Note:

A dictionary is a list of values linked to keys. The keys are separated from their values with colons and brackets as shown below. In this case, the dictionary keys will become the column names for the DataFrame.
DataFrames will automatically be indexed 0 to n, with n being the number of values in the dictionary. We can override this indexing by using the “index = “ parameter after our dictionary in order to manually set what the row headers for our data will be.

df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'], 
                    'Student First': ['John', 'Mary'], 
                    'Graduation': [2020, 2019],
                    'GPA': [3.45, 3.20]}, 
                    index = ['Student 1', 'Student 2'])

Similarly, we can specify the order of the columns using the “columns =” parameter. If this is not specified, python will automatically order the columns in alphabetical order.

Rename Columns

If you would like to rename the columns, simply enter the desired names into the columns() function and it will update the names for the DataFrame.

df1.columns = ["Avg GPA", "Graduation Year","First Name", "Last Name"]
print(df1)

Output:  
Avg  GPA    Graduation Year  First Name  Last Name
0    3.45             2020        John      Smith
1    3.20             2019        Mary    Johnson

Read Values in Dataframes

One great feature of Pandas dataframes is the ability to selectively view parts of a dataframe. This can be extremely helpful if you’d like to only use part of the data for some operation or would simply like to view specific parts of the data to get a better feel for the values you are working with. Let’s pull up our student database with indices again.

import pandas as pd

student_index = ['Student 1', 'Student 2', 'Student 3', 'Student 4']
df = pd.DataFrame({'Student Last': ['Smith', 'Johnson','Evans', 'Waters'],
                  'Student First': ['John', 'Mary', 'Henry', 'Kate'],
                  'Graduation': [2020, 2019, 2017, 2020],
                  'GPA': [3.45, 3.20, 2.80, 3.15]},
                 index=student_index
                 )

First, let’s learn how to retrieve all the values from a single column. Say you would like to view only the GPA’s of all students. Just call the name of the dataframe, in this case, ‘df’, and then specify which column you would like to choose.

GPAs = df['GPA']
print(GPAs)

Output:
Student 1   3.45
Student 2   3.20
Student 3   2.80
Student 4   3.15
Name: GPA, dtype: float64

Next, let’s see how to get multiple columns. Follow the same process and just add the other column name you would like to see. The only difference is you must wrap another set of brackets around the column names because we are now creating a list of objects and, therefore, must be contained within brackets.

GPAs_Last = df[['GPA', 'Student Last']]
print(GPAs_Last)

Output             
             GPA  Student Last
Student 1   3.45         Smith
Student 2   3.20       Johnson
Student 3   2.80         Evans
Student 4   3.15        Waters

Finally, selecting parts of columns and rows instead of all of the values is a tad more complex. Let’s start by only selecting the first three out of four student records. You do so using the following syntax:

three_students = df[:3]
print(three_students)

Output:
             GPA  Graduation Student First Student Last
Student 1   3.45        2020          John        Smith
Student 2   3.20        2019          Mary      Johnson
Student 3   2.80        2017         Henry        Evans

Note: This can be interpreted as “select all record from dataframe ‘df’ up to but not including index 3”. Beginning the statement with the colon means that you will start from the beginning of the data.

Now, say we would only like the middle two records. In this case we specify the first index we would like to grab, colon, and then the ‘up to but not including index’.

middle_two = df[1:3]
print(middle_two)

Output:            
            GPA   Graduation Student First Student Last
Student 2   3.2         2019          Mary      Johnson
Student 3   2.8         2017         Henry        Evans

Lastly, grabbing a single value from a dataframe is also very important. Let’s find the GPA of the third student in our dataframe. Start by calling the column that contains the desired value just as we did before. Then, specify the index that you would like to call from the list of dataframe indices.

third_GPA = df['GPA'][ student_index[2] ]
print("{}'s GPA = {}".format(student_index[0], third_GPA))

Output:
Student 1's GPA = 2.8

Note: The print statement used here looks much more complex than the standard we have been using because it offers more customization of the printing format. You can learn about this type of print.

Change Data Types

There may be times when working with data where you would like to convert an array of values to a certain data type. Python provides a useful method called astype() that allows for easy data type changing.

Notice the initial type of the ‘GPA’ values below is float64, which is a floating point numeric value.

df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'], 
                    'Student First': ['John', 'Mary'], 
                    'Graduation': [2020, 2019],
                    'GPA': [3.45, 3.20]}) 
print(df1.dtypes)

Output:
GPA               float64
Graduation          int64
Student First      object
Student Last       object
dtype: object

Using the astype() method we can convert these to integers, strings, or any other data type.

Combining Datasets

df1['GPA'] = df1['GPA'].astype(np.int64)
print(df1.dtypes)

Output:
GPA                 int64
Graduation          int64
Student First      object
Student Last       object
dtype: object

df1['GPA'] = df1['GPA'].astype(np.str)
print(df1.dtypes)

Output:
GPA                object
Graduation          int64
Student First      object
Student Last       object
dtype: object

There will be times when you will have to combine two sets of data, and pandas makes this very easy with three functions: concatenate, merge, and join.

Concatenate

Concatenating two dataframes in python is essentially forcing the two sets to combine either vertically or horizontally. If the datasets have the same column headers, then it will group the data from each set into its respective column. Below we combine the records for four students.

df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'], 
                    'Student First': ['John', 'Mary'], 
                    'Graduation': [2020, 2019],
                    'GPA': [3.45, 3.20]}, 
                    index = ['Student 1', 'Student 2']) 

# construct second DataFrame
df2 = pd.DataFrame({'Student Last': ['Evans', 'Waters'],
                    'Student First': ['Henry', 'Kate'], 
                    'Graduation': [2017, 2020],
                    'GPA': [2.80, 3.15]}, 
                    index = ['Student 3', 'Student 4']) 

result = pd.concat([df1, df2]) # combines data on axis 0 (groups by columns)
print(result)

Output:
             GPA  Graduation Student First Student Last
Student 1   3.45        2020          John        Smith
Student 2   3.20        2019          Mary      Johnson
Student 3   2.80        2017         Henry        Evans
Student 4   3.15        2020          Kate       Waters

Note: If you do not specify how you would like to concatenate the data, python will stack the datasets horizontally by default.

However, if you would like the data to be grouped side-by-side, then specifying “axis = 1” in the concatenation function will simply put the second dataset to the right of the first.

result = pd.concat([df1, df2], axis = 1)  # combines data on axis 1 (groups by rows) 
print(result)

If there are some duplicate rows in one dataframe, you may want to merge two dataframes without duplicate rows. Here is a way to do it.

pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

The reset_index(drop=True) is used to fix up the index after the concat and drop_duplicates.

Joint Dataframes

Suppose you’re given a dataset such as the ones below where series list is a full record, unlike the previous datasets that had a different series for each variable which we then combined them on.

students_A = pd.DataFrame({'Student A': ['Smith', 'John', 2020, 3.45],
                           'Student B': ['Johnson', 'Mary', 2019, 3.20]},
                            index = ['last name', 'first name', 'graduation year', 'GPA'])

students_B = pd.DataFrame({'Student C': ['Evans', 'Henry', 2017, 2.80],
                           'Student D': ['Waters', 'Kate', 2020, 3.15]},
                            index = ['last name', 'first name', 'graduation year', 'GPA'])

The join method combines data sets that share the same indexes (row headers). In the example below, both data sets have the same four indexes so they are able to be combined into one set, resulting in vertically-oriented student records.

results = students_A.join(students_B)
print(results)

Output:
                Student A Student B Student C Student D
last name           Smith   Johnson     Evans    Waters
first name           John      Mary     Henry      Kate
graduation year      2020      2019      2017      2020
GPA                  3.45       3.2       2.8      3.15

Merge Dataframes

If you would like to join two data sets that share one distinct key, rather than every identifier, the merge() method allows you to specify the sets and the variable you would like to join them on. In the example below, two datasets are being combined using the shared key ‘student ID’.

df3 = pd.DataFrame({'student ID': [55678, 58679], # primary key 
                    'academic year': ['Freshman', 'Sohpomore']})

df4 = pd.DataFrame({'student ID': [55678, 58679], # primary key
                    'Residence Hall': ['McBride', 'Daniels']})

results = pd.merge(df3, df4, on = 'student ID') # uses the primary key to match data with a specific student ID
print(results)

Output:
  academic year  student ID Residence Hall
0      Freshman       55678        McBride
1     Sohpomore       58679        Daniels

Notice how the program recognizes the duplicate student ID values in the datasets and combines the two data points associated with each ID number.

Sort Data

Another very useful function when operating dataframes is the sort function. Sorting your data may not be necessary when performing functions on it, but is a good practice to follow to get a better feel for the data. Sorting data using python’s ‘sort()’ functions allows you to specify specific columns or all of the dataframe and then sort the values in each by either ascending or descending order. Besides understanding the data, sorting after performing functions on a dataframe can help to give more organized and readable results.

Here is the definition of the sort function with its parameters.

Definition : sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

For convenience, we use the dataframe df above to demonstrate how the sort function works. For example, if we would like to sort the data by GPA or by Graduation and then by GPA.

df.sort(columns="GPA")

Output:             
             GPA  Graduation Student First Student Last
Student 3   2.80        2017         Henry        Evans
Student 4   3.15        2020          Kate       Waters
Student 2   3.20        2019          Mary      Johnson
Student 1   3.45        2020          John        Smith

df.sort(columns=["Graduation","GPA"])

Output:             
             GPA  Graduation Student First Student Last
Student 3   2.80        2017         Henry        Evans
Student 2   3.20        2019          Mary      Johnson
Student 4   3.15        2020          Kate       Waters
Student 1   3.45        2020          John        Smith

Note: Sorting a dataframe won't change the dataframe itself unless we set the parameter inplace = True in the sort function.

Center for Analytics and Data Science

165 McVey Data Science Building

105 Tallawanda Rd

Oxford, OH 45056

513-529-2279 cads@MiamiOH.edu

Managing Data

Understanding Pandas DataFrame

Creating DataFrames

Rename Columns

Read Values in Dataframes

Change Data Types

Combining Datasets

Concatenate

Joint Dataframes

Merge Dataframes

Sort Data

Center for Analytics and Data Science

Contact Us

Initiatives

Find

Managing Data

Managing Data

Understanding Pandas DataFrame

Creating DataFrames

Rename Columns

Read Values in Dataframes

Change Data Types

Combining Datasets

Concatenate

Joint Dataframes

Merge Dataframes

Sort Data

Center for Analytics and Data Science

Contact Us

Follow Us

Initiatives

Find

Follow Us