Managing Data
Managing Data
Understanding Pandas DataFrame
A DataFrame is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet. It consists of a dictionary of lists in which the lists each have their own identifiers or keys, such as ‘last name’ or ‘GPA’ seen below.
Creating DataFrames
Let's begin by entering a dictionary of lists into the DataFrame() parameters. If you didn't enable the object inspector to display the documentation of a function when the function is called and would like to have this feature, go to Basic Syntax to see how to do it.
- Student Last, Student First, Graduation, and GPA are the keys (columns), and each case (row) has assigned values; if a cell is empty, it is said to be null.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'],
'Student First': ['John', 'Mary'],
'Graduation': [2020, 2019],
'GPA': [3.45, 3.20]})
Note:
- A dictionary is a list of values linked to keys. The keys are separated from their values with colons and brackets as shown below. In this case, the dictionary keys will become the column names for the DataFrame.
- DataFrames will automatically be indexed 0 to n, with n being the number of values in the dictionary. We can override this indexing by using the “index = “ parameter after our dictionary in order to manually set what the row headers for our data will be.
df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'],
'Student First': ['John', 'Mary'],
'Graduation': [2020, 2019],
'GPA': [3.45, 3.20]},
index = ['Student 1', 'Student 2'])
Similarly, we can specify the order of the columns using the “columns =” parameter. If this is not specified, python will automatically order the columns in alphabetical order.
Rename Columns
If you would like to rename the columns, simply enter the desired names into the columns() function and it will update the names for the DataFrame.
df1.columns = ["Avg GPA", "Graduation Year","First Name", "Last Name"]
print(df1)
Output:
Avg GPA Graduation Year First Name Last Name
0 3.45 2020 John Smith
1 3.20 2019 Mary Johnson
Read Values in Dataframes
One great feature of Pandas dataframes is the ability to selectively view parts of a dataframe. This can be extremely helpful if you’d like to only use part of the data for some operation or would simply like to view specific parts of the data to get a better feel for the values you are working with. Let’s pull up our student database with indices again.
import pandas as pd
student_index = ['Student 1', 'Student 2', 'Student 3', 'Student 4']
df = pd.DataFrame({'Student Last': ['Smith', 'Johnson','Evans', 'Waters'],
'Student First': ['John', 'Mary', 'Henry', 'Kate'],
'Graduation': [2020, 2019, 2017, 2020],
'GPA': [3.45, 3.20, 2.80, 3.15]},
index=student_index
)
First, let’s learn how to retrieve all the values from a single column. Say you would like to view only the GPA’s of all students. Just call the name of the dataframe, in this case, ‘df’, and then specify which column you would like to choose.
GPAs = df['GPA']
print(GPAs)
Output:
Student 1 3.45
Student 2 3.20
Student 3 2.80
Student 4 3.15
Name: GPA, dtype: float64
Next, let’s see how to get multiple columns. Follow the same process and just add the other column name you would like to see. The only difference is you must wrap another set of brackets around the column names because we are now creating a list of objects and, therefore, must be contained within brackets.
GPAs_Last = df[['GPA', 'Student Last']]
print(GPAs_Last)
Output
GPA Student Last
Student 1 3.45 Smith
Student 2 3.20 Johnson
Student 3 2.80 Evans
Student 4 3.15 Waters
Finally, selecting parts of columns and rows instead of all of the values is a tad more complex. Let’s start by only selecting the first three out of four student records. You do so using the following syntax:
three_students = df[:3]
print(three_students)
Output:
GPA Graduation Student First Student Last
Student 1 3.45 2020 John Smith
Student 2 3.20 2019 Mary Johnson
Student 3 2.80 2017 Henry Evans
Note: This can be interpreted as “select all record from dataframe ‘df’ up to but not including index 3”. Beginning the statement with the colon means that you will start from the beginning of the data.
Now, say we would only like the middle two records. In this case we specify the first index we would like to grab, colon, and then the ‘up to but not including index’.
middle_two = df[1:3]
print(middle_two)
Output:
GPA Graduation Student First Student Last
Student 2 3.2 2019 Mary Johnson
Student 3 2.8 2017 Henry Evans
Lastly, grabbing a single value from a dataframe is also very important. Let’s find the GPA of the third student in our dataframe. Start by calling the column that contains the desired value just as we did before. Then, specify the index that you would like to call from the list of dataframe indices.
third_GPA = df['GPA'][ student_index[2] ]
print("{}'s GPA = {}".format(student_index[0], third_GPA))
Output:
Student 1's GPA = 2.8
Note: The print statement used here looks much more complex than the standard we have been using because it offers more customization of the printing format. You can learn about this type of print.
Change Data Types
There may be times when working with data where you would like to convert an array of values to a certain data type. Python provides a useful method called astype() that allows for easy data type changing.
- Notice the initial type of the ‘GPA’ values below is float64, which is a floating point numeric value.
df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'],
'Student First': ['John', 'Mary'],
'Graduation': [2020, 2019],
'GPA': [3.45, 3.20]})
print(df1.dtypes)
Output:
GPA float64
Graduation int64
Student First object
Student Last object
dtype: object
Using the astype() method we can convert these to integers, strings, or any other data type.
Combining Datasets
df1['GPA'] = df1['GPA'].astype(np.int64)
print(df1.dtypes)
Output:
GPA int64
Graduation int64
Student First object
Student Last object
dtype: object
df1['GPA'] = df1['GPA'].astype(np.str)
print(df1.dtypes)
Output:
GPA object
Graduation int64
Student First object
Student Last object
dtype: object
There will be times when you will have to combine two sets of data, and pandas makes this very easy with three functions: concatenate, merge, and join.
Concatenate
Concatenating two dataframes in python is essentially forcing the two sets to combine either vertically or horizontally. If the datasets have the same column headers, then it will group the data from each set into its respective column. Below we combine the records for four students.
df1 = pd.DataFrame({'Student Last': ['Smith', 'Johnson'],
'Student First': ['John', 'Mary'],
'Graduation': [2020, 2019],
'GPA': [3.45, 3.20]},
index = ['Student 1', 'Student 2'])
# construct second DataFrame
df2 = pd.DataFrame({'Student Last': ['Evans', 'Waters'],
'Student First': ['Henry', 'Kate'],
'Graduation': [2017, 2020],
'GPA': [2.80, 3.15]},
index = ['Student 3', 'Student 4'])
result = pd.concat([df1, df2]) # combines data on axis 0 (groups by columns)
print(result)
Output:
GPA Graduation Student First Student Last
Student 1 3.45 2020 John Smith
Student 2 3.20 2019 Mary Johnson
Student 3 2.80 2017 Henry Evans
Student 4 3.15 2020 Kate Waters
Note: If you do not specify how you would like to concatenate the data, python will stack the datasets horizontally by default.
However, if you would like the data to be grouped side-by-side, then specifying “axis = 1” in the concatenation function will simply put the second dataset to the right of the first.
result = pd.concat([df1, df2], axis = 1) # combines data on axis 1 (groups by rows)
print(result)
If there are some duplicate rows in one dataframe, you may want to merge two dataframes without duplicate rows. Here is a way to do it.
pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
The reset_index(drop=True) is used to fix up the index after the concat and drop_duplicates.
Joint Dataframes
Suppose you’re given a dataset such as the ones below where series list is a full record, unlike the previous datasets that had a different series for each variable which we then combined them on.
students_A = pd.DataFrame({'Student A': ['Smith', 'John', 2020, 3.45],
'Student B': ['Johnson', 'Mary', 2019, 3.20]},
index = ['last name', 'first name', 'graduation year', 'GPA'])
students_B = pd.DataFrame({'Student C': ['Evans', 'Henry', 2017, 2.80],
'Student D': ['Waters', 'Kate', 2020, 3.15]},
index = ['last name', 'first name', 'graduation year', 'GPA'])
The join method combines data sets that share the same indexes (row headers). In the example below, both data sets have the same four indexes so they are able to be combined into one set, resulting in vertically-oriented student records.
results = students_A.join(students_B)
print(results)
Output:
Student A Student B Student C Student D
last name Smith Johnson Evans Waters
first name John Mary Henry Kate
graduation year 2020 2019 2017 2020
GPA 3.45 3.2 2.8 3.15
Merge Dataframes
If you would like to join two data sets that share one distinct key, rather than every identifier, the merge() method allows you to specify the sets and the variable you would like to join them on. In the example below, two datasets are being combined using the shared key ‘student ID’.
df3 = pd.DataFrame({'student ID': [55678, 58679], # primary key
'academic year': ['Freshman', 'Sohpomore']})
df4 = pd.DataFrame({'student ID': [55678, 58679], # primary key
'Residence Hall': ['McBride', 'Daniels']})
results = pd.merge(df3, df4, on = 'student ID') # uses the primary key to match data with a specific student ID
print(results)
Output:
academic year student ID Residence Hall
0 Freshman 55678 McBride
1 Sohpomore 58679 Daniels
Notice how the program recognizes the duplicate student ID values in the datasets and combines the two data points associated with each ID number.
Sort Data
Another very useful function when operating dataframes is the sort function. Sorting your data may not be necessary when performing functions on it, but is a good practice to follow to get a better feel for the data. Sorting data using python’s ‘sort()’ functions allows you to specify specific columns or all of the dataframe and then sort the values in each by either ascending or descending order. Besides understanding the data, sorting after performing functions on a dataframe can help to give more organized and readable results.
Here is the definition of the sort function with its parameters.
Definition : sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
For convenience, we use the dataframe df above to demonstrate how the sort function works. For example, if we would like to sort the data by GPA or by Graduation and then by GPA.
df.sort(columns="GPA")
Output:
GPA Graduation Student First Student Last
Student 3 2.80 2017 Henry Evans
Student 4 3.15 2020 Kate Waters
Student 2 3.20 2019 Mary Johnson
Student 1 3.45 2020 John Smith
df.sort(columns=["Graduation","GPA"])
Output:
GPA Graduation Student First Student Last
Student 3 2.80 2017 Henry Evans
Student 2 3.20 2019 Mary Johnson
Student 4 3.15 2020 Kate Waters
Student 1 3.45 2020 John Smith
Note: Sorting a dataframe won't change the dataframe itself unless we set the parameter inplace = True in the sort function.