Pandas

Series is like a column (has only one dimension), and a DataFrame (has two dimensions) is like a table or a spreadsheet with rows and columns

Create panda series

MethodSyntaxDescriptionExample
From a listpd.Series(data=list)Series from a Python listdata = [1, 2, 3] s = pd.Series(data) pd.Series([2]*5)
From a NumPy arraypd.Series(data=np.array)Series from a NumPy arraydata = np.array([1, 2, 3]) s = pd.Series(data)
From a dictionarypd.Series(data=dict)Series from a dictionary (keys become index)data = {‘a’: 1, ‘b’: 2, ‘c’: 3} s = pd.Series(data)
From a scalar valuepd.Series(data=scalar, index=index)Series with a single value and specified indexs = pd.Series([5,6,7], index=[‘a’, ‘b’, ‘c’])
String  pd.Series(list(“String”))
arrange  pd.Series(np.arange(1,2))
random  pd.Series(np.random.normal(size=3))

Panda series operations

MethodDescriptionExample
LengthLengthx=pd.Series(np.arange(1,10)) len(x)
ShapeSizex.shape (9,)
ValuesValuesx.values
IndexAccess indexx.index o/p: RangeIndex(start=0, stop=9, step=1)
Unique valuesValue countsz.value_counts() z.value_counts().head(2) z.value_counts().head(2).values

DataFrames

Series object contains two components – a set of values, and index labels

DataFrame has three components – column, index, and a NumPy array object that contains the values

MethodSyntaxDescriptionExample
From a dictionarypd.DataFrame(data=dict)from a dictionarydata = {‘col1’: [1, 2], ‘col2’: [3, 4]} df = pd.DataFrame(data)
From a list of listspd.DataFrame(data=list_of_lists, columns=column_names)from a list of listsdata = [[1, 2], [3, 4]] df = pd.DataFrame(data, columns=[‘col1’, ‘col2’])
From a NumPy arraypd.DataFrame(data=np_array, columns=column_names)from a NumPy arraydata = np.array([[1, 2], [3, 4]]) df = pd.DataFrame(data, columns=[‘col1’, ‘col2’]) pd.DataFrame(np.arange(1,9).reshape(2,4))
Tuples  combined_ages=pd.DataFrame([(22,24,20),(40,50,45)],columns=[‘class 1′,’class 2′,’class 3’])
From a Seriespd.DataFrame(data=series)from a Series (single column)s = pd.Series([1, 2, 3]) df = pd.DataFrame(s, columns=[‘col1’])
Combine series  student_ages=pd.Series([22,24,20]) #series 1 teacher_ages=pd.Series([40,50,45])#series 2   combined_ages=pd.DataFrame([student_ages, teacher_ages]) #DataFrame   combined_ages.columns=[‘class 1′,’class 2’, ‘class 3’]#naming columns  
From a CSV filepd.read_csv(‘file.csv’)Reads data from a CSVdf = pd.read_csv(‘data.csv’)
From an Excel filepd.read_excel(‘file.xlsx’)Reads data from an Exceldf = pd.read_excel(‘data.xlsx’)
Json, html  df = pd.read_json(‘data.json’) url=”https://www.@.com” table=pd.read_html(url) table[0]

Accessing attributes in a DataFrame

a_ages=pd.DataFrame({c1′:[10,2],c2′:[2,6],c3′:[1,8]})

  • index attribute: a.index
  • columns attribute: a.columns
  • value: a.values

Modifying Dataframe

OperationSyntaxExample
Adding a columndf[‘new_column’] = valuesdf = pd.DataFrame({‘A’: [1, 2, 3]}) df[‘B’] = [4, 5, 6]df.insert(2,’c0′,[18,35]) #here 2 is column numberdf.loc[:,’c4]=[20,40] #add new column at enddf = pd.Series([1,2]) df=pd.concat([df,C5],axis=1)
Adding a rowdf.loc[index] = valuesdf.loc[3] = [7, 8]df=df.append({‘C1′:11,’C2′: 22,’C3’:33},ignore_index=True)new_row=pd.DataFrame([{‘C1′:11,’C2′: 22,’C3’:33}]); pd.concat([df,new_row])
Removing a columndel df[‘column_name’] or df.drop(‘column_name’, axis=1)del df[‘B’]df.ages.pop(‘C2′)df.drop([‘C1’],axis=1,inplace=True)
Removing a rowdf.drop(index, axis=0)df.drop(2, axis=0)df[~(df.values<10)] #using boolean selection
Renaming columnsdf.rename(columns={‘old_name’: ‘new_name’})df.rename(columns={‘A’: ‘ColumnA’}, inplace=True)
   
Modifying valuesdf.loc[row_index, column_name] = new_valuedf.loc[0, ‘A’] = 10
Filtering rowsdf[condition]filtered_df = df[df[‘A’] > 2]
Sorting DataFramedf.sort_values(by=’column_name’)df.sort_values(by=’A’, ascending=False)
Replace df.replace({8:15})

Indexing

OperationDescriptionSyntaxExample
Creating a Custom IndexSets a new index for a DataFramedf.set_index(‘column_name’, inplace=True)df.set_index(‘Name’, inplace=True)
Searching Using an IndexAccesses data based on the indexdf.loc[index_value]df.loc[‘Alice’]
Searching Without Using an IndexAccesses data based on positiondf.iloc[position]df.iloc[0]
Immutability of an IndexModifying the index creates a new DataFramedf.index = new_indexdf.index = df.index + 1
Alignment of IndexesAligns DataFrames based on their indexesdf1.align(df2, join=’inner’, axis=0)df1.align(df2, join=’outer’, axis=1)
Set Operations on IndexesPerforms set operations on indexesdf1.index.union(df2.index)df1.index.intersection(df2.index)
Symmetric Difference OperationReturns elements in either set but not bothdf1.index.symmetric_difference(df2.index)df1.index ^ df2.index

import pandas as pd

data = {‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],

        ‘Age’: [25, 30, 28],

        ‘City’: [‘New York’, ‘Los Angeles’, ‘Chicago’]}

df = pd.DataFrame(data)

 NameAgeCity
0Alice25New York
1Bob30Los Angeles
2Charlie28Chicago

# Creating a custom index

df.set_index(‘Name’, inplace=True)

df

 AgeCity
Name  
Alice25New York
Bob30Los Angeles
Charlie28Chicago

# Searching using the index

print(df.loc[‘Alice’])

 AgeCity
Alice25New York

# Searching without using the index

print(df.iloc[1])

 AgeCity
Bob30Los Angeles

# Immutability of index

new_index = df.index + ‘_new’

df_new.index = new_index

print(df_new)

 AgeCity
Name  
Alice_new25New York
Bob_new30Los Angeles
Charlie_new28Chicago

# Alignment of indexes

# Create two DataFrames with different indexes

df1 = pd.DataFrame({‘A’: [1, 2], ‘B’: [3, 4]}, index=[‘x’, ‘y’])

A B
x 1 3
y 2 4

df2 = pd.DataFrame({‘C’: [5, 6], ‘D’: [7, 8]}, index=[‘x’, ‘z’])

C D
x 5 7
z 6 8

# Inner join

print(df1.join(df2, how=’inner’))

A B C D
x 1 3 5 7

# Outer join

print(df1.join(df2, how=’outer’))

A B C D
x 1.0 3.0 5.0 7.0
y 2.0 4.0 NaN NaN
z NaN NaN 6.0 8.0

# Left join

print(df1.join(df2, how=’left’))

A B C D
x 1 3 5.0 7.0
y 2 4 NaN NaN

# Right join

print(df1.join(df2, how=’right’))

A B C D
x 1.0 3.0 5 7
z NaN NaN 6 8

# Set operations on indexes

# Create two DataFrames with different indexes

df1 = pd.DataFrame({‘A’: [1, 2, 3]}, index=[‘x’, ‘y’, ‘z’])

df2 = pd.DataFrame({‘B’: [4, 5]}, index=[‘x’, ‘w’])

print(df1,”\n”,df2)

A
x 1
y 2
z 3


B
x 4
w 5

# Perform set operations on indexes

union_index = df1.index.union(df2.index)

intersection_index = df1.index.intersection(df2.index)

difference_index = df1.index.difference(df2.index)

symmetric_difference_index = df1.index.symmetric_difference(df2.index)

print(“Union:”, union_index)

Union: Index([‘w’, ‘x’, ‘y’, ‘z’], dtype=’object’)

print(“Intersection:”, intersection_index)

Intersection: Index([‘x’], dtype=’object’)

print(“Difference:”, difference_index)

Difference: Index([‘y’, ‘z’], dtype=’object’)

print(“Symmetric Difference:”, symmetric_difference_index)

Symmetric Difference: Index([‘w’, ‘y’, ‘z’], dtype=’object’)

Group by and Aggregation

grouped.agg(func, axis=0)

OperationDescriptionSyntaxExample
Group by one columnGroup data by a single columndf.groupby(‘column_name’)df.groupby(‘Department’)
Group by multiple columnsGroup data by multiple columnsdf.groupby([‘column1’, ‘column2’])df.groupby([‘Department’, ‘Location’])
Apply aggregation functionApply a function to each groupgrouped.agg(func)grouped.agg(‘mean’)
Apply multiple aggregation functionsApply multiple functions to each groupgrouped.agg([‘sum’, ‘mean’])grouped.agg({‘Sales’: [‘sum’, ‘mean’], ‘Profit’: ‘max’})

Aggregate functions

  • sum(): Calculates the sum.
  • mean(): Calculates the mean.
  • min(): Calculates the minimum.
  • max(): Calculates the maximum.
  • count(): Counts the number of non-null values.
  • std(): Calculates the standard deviation.
  • var(): Calculates the variance.
  • first(): Returns the first value.
  • last(): Returns the last value.
  • describe(): Generates descriptive statistics.
OperationDescriptionSyntaxExample
get_group()Retrieves a specific group from a GroupBy objectgrouped.get_group(name)grouped.get_group(‘Sales’)
Filtering groupsFilters groups based on conditionsgrouped.filter(lambda x: condition)grouped.filter(lambda x: x[‘Salary’].mean() > 50000)
.transform()Applies a function to each group and returns a Series or DataFrame with the same shape as the original groupsgrouped.transform(func)grouped[‘Salary’].transform(lambda x: x – x.mean())
.apply()Applies a function to each group and returns a DataFrame with the transformed valuesgrouped.apply(func)grouped.apply(lambda x: x.max() – x.min())
OperationDescriptionSyntaxExample
.append()Appends rows from another DataFrame or Seriesdf.append(other, ignore_index=False, verify_integrity=False)df = df.append(new_data, ignore_index=True)
pd.concat()Concatenates objects along a particular axispd.concat(objs, axis=0, join=’outer’, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)combined_df   1. stackoverflow.com stackoverflow.com = pd.concat([df1, df2], axis=1)
pd.join()Joins columns of two DataFrames based on their indexesdf1.join(df2, how=’inner’, lsuffix=”, rsuffix=”, sort=False)merged_df = df1.join(df2, how=’left’)
pd.merge()Merges DataFrames based on one or more columnspd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None)merged_df = pd.merge(df1, df2, on=’key_column’, how=’outer’)

data = {‘Department’: [‘Sales’, ‘Marketing’, ‘Sales’, ‘Marketing’, ‘HR’],

        ‘Salary’: [50000, 60000, 45000, 55000, 35000]}

df = pd.DataFrame(data)

 DepartmentSalary
0Sales50000
1Marketing60000
2Sales45000
3Marketing55000
4HR35000

# Groupby and get a group

grouped = df.groupby(‘Department’)

sales_group = grouped.get_group(‘Sales’)

 DepartmentSalary
0Sales50000
2Sales45000

# Filter groups

high_salary_groups = grouped.filter(lambda x: x[‘Salary’].mean() > 50000)

 DepartmentSalary
1Marketing60000
3Marketing55000

# Transform and apply

salary_diff = grouped[‘Salary’].transform(lambda x: x – x.mean())

summary_stats = grouped.apply(lambda x: pd.Series({‘min’: x.min(), ‘max’: x.max()}))

 minmax 
Department  Department
HRDepartment HR Salary 35000 dtype:…Department HR Salary 35000 dtype:…HR
MarketingDepartment Marketing Salary 5500…Department Marketing Salary 6000…Marketing
SalesDepartment Sales Salary 45000 dtype:…Department Sales Salary 50000 dtype:…Sales

# Append, concat (use concat instead of append)

new_data = {‘Department’: [‘IT’], ‘Salary’: [65000]}

df_new = pd.DataFrame(new_data)

df_appended = pd.concat([df, df_new], ignore_index=True)

 DepartmentSalary
0Sales50000
1Marketing60000
2Sales45000
3Marketing55000
4HR35000
5IT65000

df1 = pd.DataFrame({‘A’: [1, 2], ‘B’: [3, 4]}, index=[‘x’, ‘y’])

df2 = pd.DataFrame({‘C’: [5, 6], ‘D’: [7, 8]}, index=[‘x’, ‘z’])

   A  B

x  1  3

y  2  4

    C  D

x  5  7

z  6  8

df_joined = df1.join(df2, how=’outer’)

     A    B    C    D

x  1.0  3.0  5.0  7.0

y  2.0  4.0  NaN  NaN

z  NaN  NaN  6.0  8.0

df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how=’inner’)

     A    B    C    D

x    1    3    5    7

# Pivot table

pivot_table = df.pivot_table(index=’Department’, values=’Salary’, aggfunc=’mean’)

print(pivot_table)

Venu Kumar M
Venu Kumar M