Pandas
Series is like a column (has only one dimension), and a DataFrame (has two dimensions) is like a table or a spreadsheet with rows and columns
Create panda series
Method | Syntax | Description | Example |
From a list | pd.Series(data=list) | Series from a Python list | data = [1, 2, 3] s = pd.Series(data) pd.Series([2]*5) |
From a NumPy array | pd.Series(data=np.array) | Series from a NumPy array | data = np.array([1, 2, 3]) s = pd.Series(data) |
From a dictionary | pd.Series(data=dict) | Series from a dictionary (keys become index) | data = {‘a’: 1, ‘b’: 2, ‘c’: 3} s = pd.Series(data) |
From a scalar value | pd.Series(data=scalar, index=index) | Series with a single value and specified index | s = pd.Series([5,6,7], index=[‘a’, ‘b’, ‘c’]) |
String | pd.Series(list(“String”)) | ||
arrange | pd.Series(np.arange(1,2)) | ||
random | pd.Series(np.random.normal(size=3)) |
Panda series operations
Method | Description | Example |
Length | Length | x=pd.Series(np.arange(1,10)) len(x) |
Shape | Size | x.shape (9,) |
Values | Values | x.values |
Index | Access index | x.index o/p: RangeIndex(start=0, stop=9, step=1) |
Unique values | Value counts | z.value_counts() z.value_counts().head(2) z.value_counts().head(2).values |
DataFrames
Series object contains two components – a set of values, and index labels
DataFrame has three components – column, index, and a NumPy array object that contains the values
Method | Syntax | Description | Example |
From a dictionary | pd.DataFrame(data=dict) | from a dictionary | data = {‘col1’: [1, 2], ‘col2’: [3, 4]} df = pd.DataFrame(data) |
From a list of lists | pd.DataFrame(data=list_of_lists, columns=column_names) | from a list of lists | data = [[1, 2], [3, 4]] df = pd.DataFrame(data, columns=[‘col1’, ‘col2’]) |
From a NumPy array | pd.DataFrame(data=np_array, columns=column_names) | from a NumPy array | data = np.array([[1, 2], [3, 4]]) df = pd.DataFrame(data, columns=[‘col1’, ‘col2’]) pd.DataFrame(np.arange(1,9).reshape(2,4)) |
Tuples | combined_ages=pd.DataFrame([(22,24,20),(40,50,45)],columns=[‘class 1′,’class 2′,’class 3’]) | ||
From a Series | pd.DataFrame(data=series) | from a Series (single column) | s = pd.Series([1, 2, 3]) df = pd.DataFrame(s, columns=[‘col1’]) |
Combine series | student_ages=pd.Series([22,24,20]) #series 1 teacher_ages=pd.Series([40,50,45])#series 2 combined_ages=pd.DataFrame([student_ages, teacher_ages]) #DataFrame combined_ages.columns=[‘class 1′,’class 2’, ‘class 3’]#naming columns | ||
From a CSV file | pd.read_csv(‘file.csv’) | Reads data from a CSV | df = pd.read_csv(‘data.csv’) |
From an Excel file | pd.read_excel(‘file.xlsx’) | Reads data from an Excel | df = pd.read_excel(‘data.xlsx’) |
Json, html | df = pd.read_json(‘data.json’) url=”https://www.@.com” table=pd.read_html(url) table[0] |
Accessing attributes in a DataFrame
a_ages=pd.DataFrame({c1′:[10,2],c2′:[2,6],c3′:[1,8]})
- index attribute: a.index
- columns attribute: a.columns
- value: a.values
Modifying Dataframe
Operation | Syntax | Example |
Adding a column | df[‘new_column’] = values | df = pd.DataFrame({‘A’: [1, 2, 3]}) df[‘B’] = [4, 5, 6]df.insert(2,’c0′,[18,35]) #here 2 is column numberdf.loc[:,’c4]=[20,40] #add new column at enddf = pd.Series([1,2]) df=pd.concat([df,C5],axis=1) |
Adding a row | df.loc[index] = values | df.loc[3] = [7, 8]df=df.append({‘C1′:11,’C2′: 22,’C3’:33},ignore_index=True)new_row=pd.DataFrame([{‘C1′:11,’C2′: 22,’C3’:33}]); pd.concat([df,new_row]) |
Removing a column | del df[‘column_name’] or df.drop(‘column_name’, axis=1) | del df[‘B’]df.ages.pop(‘C2′)df.drop([‘C1’],axis=1,inplace=True) |
Removing a row | df.drop(index, axis=0) | df.drop(2, axis=0)df[~(df.values<10)] #using boolean selection |
Renaming columns | df.rename(columns={‘old_name’: ‘new_name’}) | df.rename(columns={‘A’: ‘ColumnA’}, inplace=True) |
Modifying values | df.loc[row_index, column_name] = new_value | df.loc[0, ‘A’] = 10 |
Filtering rows | df[condition] | filtered_df = df[df[‘A’] > 2] |
Sorting DataFrame | df.sort_values(by=’column_name’) | df.sort_values(by=’A’, ascending=False) |
Replace | df.replace({8:15}) |
Indexing
Operation | Description | Syntax | Example |
Creating a Custom Index | Sets a new index for a DataFrame | df.set_index(‘column_name’, inplace=True) | df.set_index(‘Name’, inplace=True) |
Searching Using an Index | Accesses data based on the index | df.loc[index_value] | df.loc[‘Alice’] |
Searching Without Using an Index | Accesses data based on position | df.iloc[position] | df.iloc[0] |
Immutability of an Index | Modifying the index creates a new DataFrame | df.index = new_index | df.index = df.index + 1 |
Alignment of Indexes | Aligns DataFrames based on their indexes | df1.align(df2, join=’inner’, axis=0) | df1.align(df2, join=’outer’, axis=1) |
Set Operations on Indexes | Performs set operations on indexes | df1.index.union(df2.index) | df1.index.intersection(df2.index) |
Symmetric Difference Operation | Returns elements in either set but not both | df1.index.symmetric_difference(df2.index) | df1.index ^ df2.index |
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘Age’: [25, 30, 28],
‘City’: [‘New York’, ‘Los Angeles’, ‘Chicago’]}
df = pd.DataFrame(data)
Name | Age | City | |
0 | Alice | 25 | New York |
1 | Bob | 30 | Los Angeles |
2 | Charlie | 28 | Chicago |
# Creating a custom index
df.set_index(‘Name’, inplace=True)
df
Age | City | |
Name | ||
Alice | 25 | New York |
Bob | 30 | Los Angeles |
Charlie | 28 | Chicago |
# Searching using the index
print(df.loc[‘Alice’])
Age | City | |
Alice | 25 | New York |
# Searching without using the index
print(df.iloc[1])
Age | City | |
Bob | 30 | Los Angeles |
# Immutability of index
new_index = df.index + ‘_new’
df_new.index = new_index
print(df_new)
Age | City | |
Name | ||
Alice_new | 25 | New York |
Bob_new | 30 | Los Angeles |
Charlie_new | 28 | Chicago |
# Alignment of indexes
# Create two DataFrames with different indexes
df1 = pd.DataFrame({‘A’: [1, 2], ‘B’: [3, 4]}, index=[‘x’, ‘y’])
A B
x 1 3
y 2 4
df2 = pd.DataFrame({‘C’: [5, 6], ‘D’: [7, 8]}, index=[‘x’, ‘z’])
C D
x 5 7
z 6 8
# Inner join
print(df1.join(df2, how=’inner’))
A B C D
x 1 3 5 7
# Outer join
print(df1.join(df2, how=’outer’))
A B C D
x 1.0 3.0 5.0 7.0
y 2.0 4.0 NaN NaN
z NaN NaN 6.0 8.0
# Left join
print(df1.join(df2, how=’left’))
A B C D
x 1 3 5.0 7.0
y 2 4 NaN NaN
# Right join
print(df1.join(df2, how=’right’))
A B C D
x 1.0 3.0 5 7
z NaN NaN 6 8
# Set operations on indexes
# Create two DataFrames with different indexes
df1 = pd.DataFrame({‘A’: [1, 2, 3]}, index=[‘x’, ‘y’, ‘z’])
df2 = pd.DataFrame({‘B’: [4, 5]}, index=[‘x’, ‘w’])
print(df1,”\n”,df2)
A
x 1
y 2
z 3
B
x 4
w 5
# Perform set operations on indexes
union_index = df1.index.union(df2.index)
intersection_index = df1.index.intersection(df2.index)
difference_index = df1.index.difference(df2.index)
symmetric_difference_index = df1.index.symmetric_difference(df2.index)
print(“Union:”, union_index)
Union: Index([‘w’, ‘x’, ‘y’, ‘z’], dtype=’object’)
print(“Intersection:”, intersection_index)
Intersection: Index([‘x’], dtype=’object’)
print(“Difference:”, difference_index)
Difference: Index([‘y’, ‘z’], dtype=’object’)
print(“Symmetric Difference:”, symmetric_difference_index)
Symmetric Difference: Index([‘w’, ‘y’, ‘z’], dtype=’object’)
Group by and Aggregation
grouped.agg(func, axis=0)
Operation | Description | Syntax | Example |
Group by one column | Group data by a single column | df.groupby(‘column_name’) | df.groupby(‘Department’) |
Group by multiple columns | Group data by multiple columns | df.groupby([‘column1’, ‘column2’]) | df.groupby([‘Department’, ‘Location’]) |
Apply aggregation function | Apply a function to each group | grouped.agg(func) | grouped.agg(‘mean’) |
Apply multiple aggregation functions | Apply multiple functions to each group | grouped.agg([‘sum’, ‘mean’]) | grouped.agg({‘Sales’: [‘sum’, ‘mean’], ‘Profit’: ‘max’}) |
Aggregate functions
- sum(): Calculates the sum.
- mean(): Calculates the mean.
- min(): Calculates the minimum.
- max(): Calculates the maximum.
- count(): Counts the number of non-null values.
- std(): Calculates the standard deviation.
- var(): Calculates the variance.
- first(): Returns the first value.
- last(): Returns the last value.
- describe(): Generates descriptive statistics.
Operation | Description | Syntax | Example |
get_group() | Retrieves a specific group from a GroupBy object | grouped.get_group(name) | grouped.get_group(‘Sales’) |
Filtering groups | Filters groups based on conditions | grouped.filter(lambda x: condition) | grouped.filter(lambda x: x[‘Salary’].mean() > 50000) |
.transform() | Applies a function to each group and returns a Series or DataFrame with the same shape as the original groups | grouped.transform(func) | grouped[‘Salary’].transform(lambda x: x – x.mean()) |
.apply() | Applies a function to each group and returns a DataFrame with the transformed values | grouped.apply(func) | grouped.apply(lambda x: x.max() – x.min()) |
Operation | Description | Syntax | Example |
.append() | Appends rows from another DataFrame or Series | df.append(other, ignore_index=False, verify_integrity=False) | df = df.append(new_data, ignore_index=True) |
pd.concat() | Concatenates objects along a particular axis | pd.concat(objs, axis=0, join=’outer’, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True) | combined_df 1. stackoverflow.com stackoverflow.com = pd.concat([df1, df2], axis=1) |
pd.join() | Joins columns of two DataFrames based on their indexes | df1.join(df2, how=’inner’, lsuffix=”, rsuffix=”, sort=False) | merged_df = df1.join(df2, how=’left’) |
pd.merge() | Merges DataFrames based on one or more columns | pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None) | merged_df = pd.merge(df1, df2, on=’key_column’, how=’outer’) |
data = {‘Department’: [‘Sales’, ‘Marketing’, ‘Sales’, ‘Marketing’, ‘HR’],
‘Salary’: [50000, 60000, 45000, 55000, 35000]}
df = pd.DataFrame(data)
Department | Salary | |
0 | Sales | 50000 |
1 | Marketing | 60000 |
2 | Sales | 45000 |
3 | Marketing | 55000 |
4 | HR | 35000 |
# Groupby and get a group
grouped = df.groupby(‘Department’)
sales_group = grouped.get_group(‘Sales’)
Department | Salary | |
0 | Sales | 50000 |
2 | Sales | 45000 |
# Filter groups
high_salary_groups = grouped.filter(lambda x: x[‘Salary’].mean() > 50000)
Department | Salary | |
1 | Marketing | 60000 |
3 | Marketing | 55000 |
# Transform and apply
salary_diff = grouped[‘Salary’].transform(lambda x: x – x.mean())
summary_stats = grouped.apply(lambda x: pd.Series({‘min’: x.min(), ‘max’: x.max()}))
min | max | ||
Department | Department | ||
HR | Department HR Salary 35000 dtype:… | Department HR Salary 35000 dtype:… | HR |
Marketing | Department Marketing Salary 5500… | Department Marketing Salary 6000… | Marketing |
Sales | Department Sales Salary 45000 dtype:… | Department Sales Salary 50000 dtype:… | Sales |
# Append, concat (use concat instead of append)
new_data = {‘Department’: [‘IT’], ‘Salary’: [65000]}
df_new = pd.DataFrame(new_data)
df_appended = pd.concat([df, df_new], ignore_index=True)
Department | Salary | |
0 | Sales | 50000 |
1 | Marketing | 60000 |
2 | Sales | 45000 |
3 | Marketing | 55000 |
4 | HR | 35000 |
5 | IT | 65000 |
df1 = pd.DataFrame({‘A’: [1, 2], ‘B’: [3, 4]}, index=[‘x’, ‘y’])
df2 = pd.DataFrame({‘C’: [5, 6], ‘D’: [7, 8]}, index=[‘x’, ‘z’])
A B
x 1 3
y 2 4
C D
x 5 7
z 6 8
df_joined = df1.join(df2, how=’outer’)
A B C D
x 1.0 3.0 5.0 7.0
y 2.0 4.0 NaN NaN
z NaN NaN 6.0 8.0
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how=’inner’)
A B C D
x 1 3 5 7
# Pivot table
pivot_table = df.pivot_table(index=’Department’, values=’Salary’, aggfunc=’mean’)
print(pivot_table)