Pandas

A common python package that makes everything easy when you’re dealing with csv/excel file

Load files

pd.read_{filetype} is a function that used to load the csv/exel files to the workspace as a data type called DataFrame (df in short)

import pandas as pd
# 1. load csv file
df = pd.read_csv('data.csv')
# 2. load excel file
df = pd.read_excel('data.xlsx')

Display data

Quick peek to the whole data distribution.

# 1. display first n data
df.head(n)
# 2. display last n data
df.tail(n)
# 3. display the column*length of dataframe
df.shape
# 4. display the name of the columns
df.columns
# 5. display the type of the data
df.dtypes
# 6. display the general info of dataframe
df.info()
# 7. display the statistics of the data
df.describe()

Data filtering

Similar to the filter function in excel!

# 1. Filtering colunms
df[['column1', 'column2']]
# 2. Filtering specific criterion
df[df['column'] == value]
# 3. Filtering with multiple criterions
df[(df['column1'] == value1) & (df['column2'] == value2)]

Data addition/removal

# 1. Concatenate multiple dataframes
df_merged = pd.concat([df1, df2, df3])
# 2. Remove column(s)
df.drop(columns=['column'])
# 3. Rename column(s)
df.rename(columns={'old_column': 'new_column'})

Data processing

# 1. sort the dataframe by value
df.sort_values(by='column')
# 2. Data grouping
df.groupby('column').mean()
# 3. Change the data type to time
df['date'] = pd.to_datetime(df['column'])
# 4.Make a new column based on processing other column
df['month'] = df['date'].apply(lambda x: x.month)

Data cleaning

# 1. Check any null data in dataframe
df.isnull().sum()
# 2. Remove all null data in dataframe
df.dropna(axis=1, inplace=True)
# 3. Remove duplicates
df.drop_duplicates(inplace=True)
# 4-1 Replace data null value in dataframe with maen value
df.fillna(df.mean(), inplace=True)
# 4-2 Replace data null value in dataframe with previous value
df.fillna(method='ffill', inplace=True)
# 4-3 Replace data null value in dataframe with next value
df.fillna(method='bfill', inplace=True)

Export DataFrame

# 1. Export to csv file
df.to_csv('output.csv', index=False)
# 2. Export to excel file
df.to_excel('output.xlsx', index=False)