A DataFrame is the primary data structure of the Pandas library and is commonly used for storing and working with tabular data. A common operation that could be performed on such data is to change the data type of a column in order to update existing information in the DataFrame.
To start working with Pandas, we first need to import this statement in the Python code :
import pandas as pd
Let us understand this operation with the help of an example. Consider the following DataFrame containing 3 students with names A, B and C and their corresponding marks (out of 10) for two subjects, Mathematics and Physics.
Python code snippet for generating the above DataFrame :
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Printing the DataFrame
print(df)
Here, data is a dictionary we created to initialize the DataFrame. For this, we use the DataFrame() function of the Pandas library which takes the dictionary as an argument and returns the required DataFrame.
The difference in output might not be visible but it is important to note that the 7 and 9 in Physics column are not integers but strings.
Now, let’s say that for some reason, we need to convert the marks in the physics column which are currently of type object, as they are strings, to integer data type for the sake of uniformity.
The resulting DataFrame would look like this :
Let us look at different ways of performing this operation on a given DataFrame :
This method is pretty straightforward and is the most commonly used one. In this method, we use the Pandas.to_numeric() function to convert the required column to a numeric data type, int here.
Here, df[‘Physics’] is used to access the column with label Physics in the DataFrame. The changes are not made in place so we need to reassign the dataframe.
Let us look at the python 3 code and corresponding output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Print original data types
print("Original data types :\n")
print(df.dtypes, "\n")
# Performing the operation
df['Physics'] = pd.to_numeric(df['Physics'])
# Printing the updated DataFrame
print(df)
# Printing the updated data types
print("\nUpdated datatypes :\n")
print(df.dtypes)
Output :
Another way to perform the same operation would be to use df.Physics to access the column labeled as Physics instead of using df[‘Physics’]. The changes are not made in place so we need to reassign the dataframe.
Let us look at the python 3 code and corresponding output for this method -
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Print original data types
print("Original data types :\n")
print(df.dtypes, "\n")
# Performing the operation
df.Physics = pd.to_numeric(df.Physics)
# Printing the updated DataFrame
print(df)
# Printing the updated data types
print("\nUpdated datatypes :\n")
print(df.dtypes)
Output :
This method is an alternative method to the previous one. In this method, we use the Series.apply() function to convert the data type of the column Physics from object to numeric, here int.
Here, df[‘Physics’] is used to access the column with label Physics in the DataFrame. The function to be applied as shown in the previous method, pd.to_numeric is passed as a parameter here. The changes are not made in place so we need to reassign the dataframe.
Let us look at the python 3 code and corresponding output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Print original data types
print("Original data types :\n")
print(df.dtypes, "\n")
# Performing the operation
df['Physics'] = df['Physics'].apply(pd.to_numeric)
# Printing the updated DataFrame
print(df)
# Printing the updated data types
print("\nUpdated datatypes :\n")
print(df.dtypes)
Output :
Another way to perform the same operation would be to use df.Physics to access the column labeled as Physics instead of using df[‘Physics’]. The changes are not made in place so we need to reassign the dataframe.
Let us look at the python 3 code and corresponding output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Print original data types
print("Original data types :\n")
print(df.dtypes, "\n")
# Performing the operation
df.Physics = df.Physics.apply(pd.to_numeric)
# Printing the updated DataFrame
print(df)
# Printing the updated data types
print("\nUpdated datatypes :\n")
print(df.dtypes)
Output :
This method is an alternative method to the previous one. In this method, we use the Series.astype() function to convert the data type of the column Physics from object to numeric, here int.
Here, df[‘Physics’] is used to access the column with label Physics in the DataFrame. The required data type, int here, is passed as a parameter. The changes are not made in place so we need to reassign the dataframe.
Let us look at the python 3 code and corresponding output for this method:
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Print original data types
print("Original data types :\n")
print(df.dtypes, "\n")
# Performing the operation
df['Physics'] = df['Physics'].astype(int)
# Printing the updated DataFrame
print(df)
# Printing the updated data types
print("\nUpdated datatypes :\n")
print(df.dtypes)
Output :
Another way to perform the same operation would be to use df.Physics to access the column labeled as Physics instead of using df[‘Physics’]. The changes are not made in place so we need to reassign the dataframe. Let us look at the python 3 code and corresponding output for this method -
# Importing pandas
import pandas as pd
# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : ["7", "9", 8]}
# DataFrame for the dictionary
df = pd.DataFrame(data)
# Print original datatypes
print("Original data types :\n")
print(df.dtypes, "\n")
# Performing the operation
df.Physics = df.Physics.astype(int)
# Printing the updated DataFrame
print(df)
# Printing the updated datatypes
print("\nUpdated datatypes :\n")
print(df.dtypes)
Output :
In this topic, we have learned how to change the data type of a column in an existing Pandas DataFrame, following a running example of test scores of students in different subjects, thus giving us an intuition of how this concept could be applied in real-world situations. Feel free to reach out to info.javaexercise@gmail.com in case of any suggestions