Python- Pandas Starter

Code effortlessly.

Pandas is a python library which is mainly used for tabular data operations. It is an important part of data science toolkit. The APIs in Pandas provide a wide range of functionality and they are simple to implement.

Pandas can be installed via pip:

pip install pandas

The two main components of pandas library are ‘series’ and ‘dataframe’. Series can be thought of as a column in a table where dataframe is the table.

Let us try to create a dataframe from scratch:

import pandas as pd

dc = {
"col1": [1,2,3,4,5],
"col2": [2,3,4,5,6]
}

df = pd.DataFrame(dc)
print(df)

If we check the datatype of the ‘df’ object we will get a class dataframe.

print(type(df))

Let us call the first column and check its datatype:

print(df['col1'])
print('\n',type(df['col1']))

From the above exercise we can see that every column in a dataframe is a series. Pandas series can contain more than one type of data unlike numpy arrays.

We can create a dataframe from python lists:

import pandas as pd

dc = [[1,2,3,4,5],[2,3,4,5,6]]

df = pd.DataFrame(dc).T
df.columns = ["col1","col2"]

print(df)

In the above example we would have created a 2 X 5 dimensional table but by adding .T after the dataframe we have transposed it so that it become 5 X 2 frame.
df.columns is a list which contains all the column names.

We can call a column of a dataframe by:

print(df.loc[:,"col1"]) # df.loc[rows,columns]
print(df["col1"])

.loc is used when we want to call the columns by their names.

We can also use .iloc method to call out columns with their index values.

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.iloc[:,0])

1.Loading data:

Pandas has several APIs to load data from different types of files (csv, excel, json).

a. Reading from csv:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
print(df)

By looking at the data we can see there are some extra columns.

print(df.columns)

We can check what are the unique values in the column ‘Unnamed: 13’ by using unique method.

print(df['Unnamed: 13'].unique())

The ‘Unnamed: 13’ is column that sometimes appears while loading data from file.

We can get rid of them by using the following statement:

df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

print(df.columns)

b. Reading from json:

import pandas as pd

df = pd.read_json('example.json')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
print(df)

c. Reading from excel:

import pandas as pd

df = pd.read_excel('example.xlsx')
df = df.drop('Unnamed: 0',axis=1)
print(df)

2. Description of data:

a. Getting information :

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.info())

b . Getting correlation:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.corr())

3. Viewing data:

a. Top n data:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.head(2))

b. Last n data:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.tail(2))

4. Appending dataframes:

a. We can append different data frames together one on top of another. Values get appended according to column names.

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.shape)
df = df.append(df)
print(df.shape)

To experiment more we can create a copy of the dataframe and rename a column.

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.shape)
df2 = df.copy()
df2.rename(columns={'SYMBOL':'SYM'},inplace=True)
df = df.append(df2)
print(df.shape)
print(df.columns)
print(df.head(4))
print('\n',df.tail(4))

If a column does not exist in one of the dataframes, the same column is created and null values are added to it to make dataframes dimensionally equal before appending them together.

b . attaching dataframe horizontally :- one dataframe on side of another dataframe.

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df.shape)
df2 = df.copy()
df = pd.concat([df,df2],axis=1 )
print(df.shape)
print(df.columns)
print(df.head(4))
print('\n',df.tail(4))

The concatenation happens according to row index if a row index is not in any of the dataframes it will be created and null values will be added.

We can also append dataframes on top of one another using concat by changing the axis values.

5. Applying python function to a column of dataframe:

import pandas as pd

def change_number(x):
return 2*x + 3

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df['OPEN'].head())
df['OPEN'] = df['OPEN'].apply(change_number)
print(df['OPEN'].head())

6. Sum of a column:

import pandas as pd


df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df['OPEN'].sum())

7. Imputation:

Imputation is the process where we replace null with non none values.

a. To check how many columns have null values:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
df2 = df.copy()
df2.rename(columns={'SYMBOL':'SYM'},inplace=True)
df = df.append(df2)
print(df.columns[df.isnull().any()])

b. Filling Null values with other values:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
df2 = df.copy()
df2.rename(columns={'SYMBOL':'SYM'},inplace=True)
df = df.append(df2)
print(df.columns[df.isnull().any()])

df['SYM'].fillna(value='VAL',inplace=True)
print(df.columns[df.isnull().any()])
print(df['SYM'])

We can see that after imputation ‘SYM’ column does not appear in names of columns with null values and values of null places are replaced with value “VAL”.

There can be various other logics to decide what should be the value with which one wishes to replace the ‘Null’ values. It really depends upon the use-case and data in hand.

8. Filtering data:

We can filter the data to get the subset of dataframe useful for our use-case. Let us say that we want values of column “OPEN” to be more than 101. We can write as below:

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df[df['OPEN'] >= 101])

If there are more than two conditions to be used on a column or separate columns we can use & (‘and’)operation or | (‘or’) .

import pandas as pd

df = pd.read_csv('cm31MAR2021bhav.csv')
df = df.drop('Unnamed: 13',axis=1)
print(df[(df['OPEN'] >= 101) & (df['OPEN'] <= 150)])

We have just explored the tip of the power of pandas library. It has made life easier for many data scientists.

Be friends with pandas.

That’s all folks.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store