Javaexercise.com

Combine Two Columns Of Text In Pandas Dataframe

A DataFrame is the primary data structure of the Pandas library in Python and is commonly used for storing and working with tabular data.

A common operation that could be performed on such data is to combine two columns of text in the DataFrame to work with the information in a better manner. 

To start working with Pandas, we first need to import it :

import pandas as pd

Running Example

Let us understand this operation with the help of an example. Consider the following DataFrame containing 3 students with names A, B and C and their corresponding marks (out of 10) for two subjects, Mathematics and Physics.

Code snippet for generating the above DataFrame : 

import pandas as pd

# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}

# DataFrame for the dictionary
df = pd.DataFrame(data)

# Printing the DataFrame
print(df)

Here, data is a dictionary that we created to initialize the DataFrame. For this, we use the DataFrame() function of the Pandas library which takes the dictionary as an argument and returns the required DataFrame.

Now, let’s say we need to combine the contents of the column name and the column physics and store the result in the name column. The resulting output would look like this :

Let us look at different ways of performing this operation on a given DataFrame : 

1. Using the + operator

In this method, we use the + operator to combine the contents of two columns in a given pandas dataframe. We use square brackets to access the column of interest in the form of df[‘Physics’].

The same is done for the name column. One thing to note here is that the contents of the physics column are of type int and need to be converted to string type using the .astype() function with parameter str. The updates are not made in place so reassignment is required.

Let us take a look at the corresponding code snippet and generated output for this method:

# Importing required libraries
import pandas as pd

# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}

# DataFrame for the dictionary
df = pd.DataFrame(data=data)

# Performing the operation
df['Name'] = df['Name'] + df['Physics'].astype(str)

# Printing 
print(df)

Output : 

Instead of using df[‘Physics’] to access the physics column, we could also simply use the dot operator in the form of df.Physics to get the same result. The same is applicable to the name column.

Let us take a look at the corresponding code snippet and generated output for this method: 

# Importing required libraries
import pandas as pd

# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}

# DataFrame for the dictionary
df = pd.DataFrame(data=data)

# Performing the operation
df.Name = df.Name + df.Physics.astype(str)

# Printing 
print(df)

Output : 

2. Using the .agg() function

In this method, we use the .agg() function to combine the contents of two columns of a given pandas dataframe. We first need to convert the contents of the physics column to string type using the .astype() function with the parameter passed as str. Inside the .agg() function we pass the parameter as ‘’.join to specify that we do not need a separator between the contents of the two columns and the axis as 1.

The updates are not made in place so reassignment is required. Let us take a look at the corresponding code snippet and generated output for this method:

# Importing required libraries
import pandas as pd

# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}

# DataFrame for the dictionary
df = pd.DataFrame(data=data)

# Performing the operation
df['Physics'] = df['Physics'].astype(str)
df['Name'] = df[['Name', 'Physics']].agg(''.join, axis = 1)

# Printing 
print(df)

Output : 

Instead of using df[‘Physics’] to access the physics column, we could also simply use the dot operator in the form of df.Physics to get the same result. The same is applicable to the name column.

Let us take a look at the corresponding code snippet and generated output for this method: 

# Importing required libraries
import pandas as pd

# Dictionary for our data
data = {'Name' : ['A', 'B', 'C'], 'Mathematics' : [8, 5, 10], 'Physics' : [7, 9, 8]}

# DataFrame for the dictionary
df = pd.DataFrame(data=data)

# Performing the operation
df.Physics = df.Physics.astype(str)
df.Name = df[['Name', 'Physics']].agg(''.join, axis = 1)

# Printing 
print(df)

Output:

Conclusion

In this topic, we have learned to combine the contents of two columns of an existing Pandas DataFrame, following a running example of test scores of students in different subjects, thus giving us an intuition of how this concept could be applied in real-world situations. Feel free to reach out to info.javaexercise@gmail.com in case of any suggestions.