A DataFrame is the primary data structure of the Pandas library in Python and is commonly used for storing and working with tabular data. A common operation that could be performed on such data is to apply a function to two columns of Pandas DataFrame in order to extract more information from it.
To start working with Pandas, we first need to import it :
import pandas as pd
Let us understand this operation with the help of an example. Consider the following DataFrame containing 3 students with names A, B, and C and their corresponding marks (out of 10) for two subjects, Mathematics and Physics.
Code snippet for generating the above DataFrame :
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Printing the DataFrame
print(df)
Here, data is a dictionary we created to initialize the DataFrame. For this, we use the DataFrame() function of the Pandas library which takes the dictionary as an argument and returns the required DataFrame.
Now, let’s say we need to find the total of the marks of two subjects mathematics and physics and we need to do this by applying a function to the columns representing the marks of these subjects, to add them up and assign the values to the corresponding entries in another column named Total. The resulting output would look like this :
Let us look at different ways of performing this operation on a given DataFrame :
In this method, we use the apply() function to apply the desired function to calculate the total of the two columns, Mathematics and Physics.
Along with this, we use a lambda function inside that is passed as a parameter. Another parameter, axis, is passed as 1 to specify the column dimension. x[‘Mathematics’] and x[‘Physics’] here are used to access these columns.
Let us take a look at the corresponding code snippet and generated output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Performing the operation
df['Total'] = df.apply(lambda x : x['Mathematics'] + x['Physics'], axis = 1)
# Printing
print(df)
Output :
Instead of using df[‘Physics’] to access the physics column of the DataFrame as shown earlier, we can also use df.Physics to access it instead and get the same results. The same applies to the Mathematics column as well. Let us take a look at the corresponding code snippet and generated output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Performing the operation
df['Total'] = df.apply(lambda x : x.Mathematics + x.Physics, axis = 1)
# Printing
print(df)
Output :
In this method, we use the itertuples() function to add up the values in two columns and save the total in another column named Total.
The sum function is useful for adding values up which we get from the DataFrame by accessing the columns in the manner shown previously.
The itertuples function is now applied on this and the index is set to False. Let us take a look at the corresponding code snippet and generated output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Performing the operation
df['Total'] = [sum(row) for row in df[['Mathematics', 'Physics']].itertuples(index = False)]
# Printing
print(df)
Output :
In this topic, we have learned to apply a function to two columns of an existing Pandas DataFrame, following a running example of test scores of students in different subjects, thus giving us an intuition of how this concept could be applied in real-world situations. Feel free to reach out to info.javaexercise@gmail.com in case of any suggestions.