Python Pandas-II

Shubham Saket
Star Gazers
Published in
4 min readMay 1, 2021

--

Learn efficiently

If you have not read, the Pandas-Starter post in the series, here is the link.

A large chunk of a data scientist’s work revolves around data transformation. Pandas library provides various functions to transform data, aggregate data, pivot data etc. Thus making life of data scientists easier.

Let us have a look at the data we will be using as example in this post.

import pandas as pd

data = pd.read_csv("cm31MAR2021bhav.csv")
data = data.drop('Unnamed: 13',axis=1)
print("\n",data.columns)
print("\n",data.shape)
print("\n",data.head())

The data represents stock details of various companies for the date “31-Mar-2021”. It has 2019 rows and 13 columns.

for col in data.columns: #checking the data type for each column 
print(col,data[col].dtype)

Let us append another similar data set for date 01-APR-2021.

import pandas as pd

data = pd.read_csv("cm31MAR2021bhav.csv")
data = data.drop('Unnamed: 13',axis=1)
data2 = pd.read_csv("cm01APR2021bhav.csv")
data2 = data2.drop('Unnamed: 13',axis=1)

data = data.append(data2)
print(data.shape)

Let’s explore some pandas data transformation functions:

  1. Pandas groupby: This function helps us group same data points together. In the below example we grouped data points having the same symbol. The groupby function is a generator which returns a tuple of value of column used to group and a data frame with grouped data points.
for idx,data in data.groupby('SYMBOL'): 
print(idx,"\n",data)

2. Pandas aggregate function: It is used to apply operation over an axis. It can be row wise or column wise. In the example below we apply sum, max, min functions over OPEN and CLOSE columns in our dataset.

print(data.agg({'OPEN':["sum","min","max"],"CLOSE":["sum","min","max"]})

Applying custom function in aggregation pipeline:

# simple summation function
def sum_of_close(series):
sums = 0
for i in series:
sums += i
return sums

print(data.groupby('SYMBOL').agg({'CLOSE':['sum',sum_of_close]}))

In the example, a simple summation function over a series has been applied over the ‘CLOSE’ column along with python built-in sum function. Both columns result in same values confirming the success of the custom function .

3. Pandas join function: It is used to join two data frames together on basis of an axis and is similar to SQL joins.

import pandas as pd

data = pd.read_csv("cm31MAR2021bhav.csv")
data = data.drop('Unnamed: 13',axis=1)
data2 = pd.read_csv("cm01APR2021bhav.csv")
data2 = data2.drop('Unnamed: 13',axis=1)
data = data[['SYMBOL','OPEN']].join(data2[['SYMBOL','OPEN']].set_index('SYMBOL'),on='SYMBOL',how='left',lsuffix='_31MAR',rsuffix='_01APR')
print(data)

Note:- We used .set_index(“SYMBOL”) on the right dataset. It is needed because the join operation will happen on “SYMBOL” column of “data” and index column of “data2”. Since both are of different data types it will throw an error :
“ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat”

4. Pandas merge function: It is similar to join function but here columns of both frames are used instead of indexes.

import pandas as pd

data = pd.read_csv("cm31MAR2021bhav.csv")
data = data.drop('Unnamed: 13',axis=1)
data2 = pd.read_csv("cm01APR2021bhav.csv")
data2 = data2.drop('Unnamed: 13',axis=1)
data = data[['SYMBOL','OPEN']].merge(data2[['SYMBOL','OPEN']],on='SYMBOL',how='left',suffixes=['_31MAR','_01APR'])
print(data)

We can see that we have achieved the same result with merge function as with join function.

5. Pandas pivot function: It used to reshape the frame on basis of the column . It is similar to MS Excel pivot table.

data = data.pivot(columns='SYMBOL')
print(data)

Pandas has various other useful functions and their number keeps growing owing to the large opensource community. Also many other libraries, using pandas, try to automate the process of preliminary data analysis one such example would be pandas-profiling, you can find it here.

Stay tuned…

--

--