What is statistics?

Statistics used to analyze the data to draw conclusions from it

What do we do using statistics?

Descriptive Statistics

We describe the data using statistical measures

Collect data
Organize data/clean
Analyze/summarize data
Draw conclusions

Inferential Statistics

We take further actions from descriptive statistics, here we predict/forecast, test hypothesis and make decisions

Predict/forecast
Test hypothesis
Draw conclusions

Scales of measurements

Nominal scale: Groups with no order
Ordinal scale: Ordered required for data
Interval scale: Data which are equidistant/noticeable distance from each other
Ratio scale: Data with certain ratio/proportion

Samples vs population

Sample (n): portion of population data

Population (N): entire universe of data

Why sample?

Convenience
Cost
Necessity
Impossible
Not practical

Quartile and Percentile

Quartile:

Data is divided in to 4 parts/ quarters Q1, Q2, Q3, Q4

Q1 = Lower quartile = 25^th percentile

Q2 = Second quartile = 50^th percentile = Median (most repeated)

Q3 = Third Quartile = 75^th percentile

Percentile

It is a order set of values, percentile refers to % of values below it

Order data in ascending order
(n + 1)P/100 = Position of P^th percentile

Where n is number of sample

P is the percentile value

What is IQR? Inter quartile range

IQR = Q1 – Q3;

Difference between the first and the third quartile

Measures of central tendency

Mean (Average)
Median(Most repeated)
Mode(Most frequent)

Measures of variability

Variance (Average of the squared difference from the mean)
Standard deviation (Sqrt of variance)
IQR(Q3-Q1)
Range (max-min)

Population variance: here N (number of sample) is taken as it is

Sample variance: here n is n-1 is taken (Bessel’s correction to minimize the bias)

Other Measures

Skewness (Degree of asymmetry)
- Left skew (Tail is in left)
- Right Skew (Tail in right)
- Symmetric
Kurtosis (measure of peak)
- Leptokurtic (Peak in middle)
- Mesokurtic (normal)
- Platykurtic (Like flat)

Frequency distribution

Frequency: number of times a value is repeated in group

Relative frequency: percentage of the repetition

Cumulative frequency: adding with the previous row

Value	Frequency	Relative frequency	Cumulative frequency
2	2	0.4	2
1	1	0.2	3
3	1	0.2	4
4	1	0.2	5

Relations between the mean and S.D (Empirical rule, Chebyshev’s theorem)

Empirical rule

Applies to normal distribution/symmetric
1 S.D from mean = 68% of data
2 S.D from mean = 95% of data

Chebyshev’s theorem

Applies to any shaped curve
Percentages (1-(1/k^2))% of elements lie within k S.D from mean
At least 75% of data lies within 2 S.D
At least 89% of data lies within 3 S.D
At least 94% of data lies within 4 S.D

Set

Empty set ∅ – a set containing no elements
Universal set S – set containing all possible elements
Complement (Not) – all elements of S not in A

Intersection (And) – common elements (A ∩ B)

Union (Or) – all elements of A and B (A ∪ B)

Probability basics

Complements – Probability of not A = P(A bar)= 1-P(A)
Intersection – Probability of both A and B = P(A ⋂ B) = n(A ⋂ B)/n(S);
- Where, n(S) is number of elements in set S
- P(A) = n(A)/n(S); P(B) = n(B)/n(S)
- n(A∩B) = n(A) + n(B) – n(A U B)
Mutually exclusive events/ Independent; P(A ⋂ B) = 0
Union : (A∪B) = P(A) + P(B) – P(A∩B), for mutually exclusive events P(A ⋂ B) = 0

Conditional Probability

Probability of A given B = P(A|B) = P(A ∩ B)/P(B)
For mutually exclusive events = P(A|B) = P(A),
and P(A ∩ B) = P(A) P(B) and P(A∪B) = P(A) + P(B) – P(A∩B)

Contingency Table

*Counts*	TCS	Infosys	Total
Marketing	40	10	50
Finance	20	30	50
Total	30	40	100

*Probability*	TCS	Infosys	Total
Marketing	0.4	0.1	0.5
Finance	0.2	0.3	0.5
Total	0.3	0.4	1

P (TCS |Marketing) = P (TCS ∩ Marketing)/P (Marketing) = 0.4/0.5 = 80%

Product rules for independent events

The probability of the intersection of several independent events is the product of their separate individual probabilities
- P(A1 ∩ A2 ∩ A3 ∩ A4 ∩… An) = P(A1) P(A2) … P(An)
The probability of the union of several independent events is 1 minus the product of probabilities of their complements
- P(A1 ∪ A2 ∪ A3 ∪ A4 ∪… An) = 1 – (P(A1 bar) P(A2 bar) … P(An bar))

Factorial

7! = 1 × 2 × 3 × 4 × 5 × 6 × 7

Permutation

Number of ways in which a particular set of data can be arranged
P(n,r) = n! ÷ (n-r)!

Combination

Find the number of ways of selecting items from a collection
nCr = n! / (r! * (n-r)!)

The law of total probability

P(A) = P(A ∩ B) + = P(A ∩ B bar)
Conditional probabilities
- P(A)=∑iP(A∩Bi)=∑iP(A|Bi)P(Bi)

Bayes’ theorem

Bayes’ theorem enables, knowing just a little more than the probability of A given B, to find the probability of B given A
P(A|B) = P(B|A)P(A) ÷ P(B)
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)

Joint probability table

Joint probability table is similar to a contingency table, except that it has probabilities in place of frequencies

*Counts*	TCS	Infosys	Total
Marketing	40	10	50
Finance	20	30	50
Total	30	40	100

*Probability*	TCS	Infosys	Total
Marketing	0.4	0.1	0.5
Finance	0.2	0.3	0.5
Total	0.3	0.4	1

Row conditional probabilities

Represent the likelihood of an employee being in a specific company given the department
P(Company | Department) = P(Company and Department) / P(Department)
P(TCS | Marketing) = 0.4 / 0.5 = 0.8
P(Infosys | Marketing) = 0.1 / 0.5 = 0.2
P(TCS | Finance) = 0.2 / 0.5 = 0.4
P(Infosys | Finance) = 0.3 / 0.5 = 0.6

Column conditional probabilities

Represent the likelihood of an employee being in a specific department given the company
P(Department | Company) = P(Company and Department) / P(Company)
P(Marketing | TCS) = 0.4 / 0.3 = 1.33 (Value cannot be greater than1, here its only example)
P(Finance | TCS) = 0.2 / 0.3 = 0.67
P(Marketing | Infosys) = 0.1 / 0.4 = 0.25
P(Finance | Infosys) = 0.3 / 0.4 = 0.75

Binomial distribution

Probability of getting a specific number of successes (s) within that fixed number of trials (n)
P(x successes in n trials) = nCx * p^x * (1-p)^(n-x)
- P(x successes in n trials) represents the probability of getting x successes
- nCx (n choose x) is the number of ways to choose x successes out of n trials
- p is the probability of success in each trial (between 0 and 1).
- (1-p) is the probability of failure in each trial (1 minus the probability of success)

Poisson distribution

Poisson distribution focuses on the number of events occurring in a specific interval (predict)
AKA PDF – probability density function or probability mass function
P(x events) = (e^(-λ) * λ^x) / x!
- e – mathematical constant or Euler’s number (approximately 2.71828)
- λ (lambda) – average rate of events
- x – number of events (that we are interested)
- x! – factorial of x
Bakery receives an average of 5 customer per hour (λ = 5). What’s the probability of receiving exactly 3 customers in the next hour?
- λ = 5 (average customers per hour)
- x = 3 (number of customers we’re interested in)
- P(x events) = (e^(-λ) * λ^x) / x!
- P(3 events) = (e^(-5) * 5^3) / 3! = 0.201

So, 20.1% chances of 3 customers visiting in next hour

Normal distribution

Bell-shaped and symmetric distributions
Because the distribution is symmetric, one-half (.50 or 50%) lies on either side of the mean

Central limit theorem

Regardless of the shape of the original data distribution, as sample size increases, the distribution of sample means becomes more normal
The mean of the sample means is equal to the population mean
The standard deviation of the sample means (standard error) decreases as sample size increases
A sample size of 30 is often considered sufficient for the CLT to apply

Z score

We have already Empirical rule of normal distribution, that gives approximate percentile, but z score gives more accurate percentile
It tells us that how much percentage a particular data point above the rest of data points
z = (x – μ) / σ; x- specific point, μ is mean and σ is S.D
Z is +ve: data point is above the mean
Z is -ve: data point is below the mean
Z is 0: data point is on the mean
Percentile can be known form z table with z value

Student’s T test/ distribution

Typically we usually don’t get population standard deviation in real life examples,
Such that we cannot calculate the standard deviation of population (σ_population) and Z score, we use Student T test to mitigate this type of problem
Here we will have data like, μ = Population mean, n= Number of sample, x̄ = Sample mean, α (alpha) = Level of significance = 0.05, σ_ = Sample standard deviation (Can be manually calculated with sample data), We use σ_ = s instead of population standard deviation
Next we calculate t score
- t = (x̄ – μ) / σ_sample, σ_sample = s / √n
- Degrees of freedom = n-1 (used to look critical value in t table)
- Decide on 1 sided test or 2 sided test
- We use t-table to get critical value
  - If t < critical value (from table) then accept Null Hypothesis
  - If t > critical value (from table) then reject Null Hypothesis

Chebychev’s theorem

For any distribution regardless of shape the portion of data lying within k standard deviations (k > 1) of the mean is at least 1 – 1/(K^2)
The mean time in a women’s 400-meter dash is 52.4 seconds with a standard deviation of 2.2 sec. Apply Chebychev’s theorem for k = 2
At least 75% of the women’s 400-meter dash times will fall between 48 and 56.8 seconds

Hypothesis testing

Hypothesis is an educated guess or a tentative statement about the relationship between two or more variables
Null hypothesis (H0): There is no relationship or difference between the variables being studied
Alternative hypothesis (H1): It’s the opposite of the null hypothesis. There is specific relationship or difference between the variables
We either accept null hypothesis or reject null hypothesis
We reject null hypothesis when
- α (alpha) at lower range, commonly 0.05 (5%) or 0 .01 (1%)
- If P-value is less than α (alpha), then we reject null hypothesis and accept alternate hypothesis

Possible Hypothesis combination
Null Hypothesis (H0)	Alternate Hypothesis (H1)
H0: p= (2 side tail)	H1: p≠ (2 side tail)
H0: p≤ (1 side tail)	H1: p> (1 side tail)
H0: p≥ (1 side tail)	H1: p< (1 side tail)

Errors

What if we accept wrong hypothesis?

It is an error, we have

Types of error – Type I & Type II Errors

Types of error – Type I & Type II Errors
Null hypothesis (H0)	Null hypothesis (H0) = True	Null hypothesis (H0) = False
Null hypothesis (H0) = Rejected	Type I Error (α – alpha)	No Error = True Positive Probability = 1- β
Null hypothesis (H0) = Accepted	No Error = True negative Probability = 1- α	Type II Error (β – beta)

Note = α & β are inversely proportional

Way to reduce type I & II errors is by

Reducing the Level of significance (say ~5%)
Increase the sample size of the test

Power of test = 1 – β

Confidence interval and Margin of error

Confidence Interval (CI): Range of values that is to contain the true population parameter (mean, proportion) with certain level of confidence
Margin of Error (ME): Amount of uncertainty/ potential error our estimate or Confidence Interval
CI = Sample statistic ± Margin of error

Comparison of 2 population

Choosing the right statistical test

Comparing means
1. Independent t-test: Compares the means of two independent group
  1. Data should be normally distributed within each group
  1. Variances (spread of data) in the two groups should be similar (use Welch’s t-test if they are not)
1. Paired t-test: Compares means within the same group measured at two different times
  1. Differences between paired observations should be normally distributed
1. Mann-Whitney U test: non-parametric test, compares differences between two independent groups when the data is not normally distributed
1. Wilcoxon signed-rank test: A non-parametric test for comparing two related group
Kruskal-Wallis test
1. An alternative to one-way ANOVA when data does not meet ANOVA assumptions. Compares three or more independent groups
Friedman test
1. An alternative to repeated measures ANOVA, used for comparing more than two related groups
Comparing proportions
1. Chi-Square test of independence:
  1. Tests the association between two categorical variables or the goodness of fit between observed and expected frequencies
    1. Chi-Square test of independence: Tests whether two categorical variables are independent
    1. Chi-Square goodness of fit test: Tests whether the observed distribution of a categorical variable matches an expected distribution
1. Fisher’s exact test: Used instead of the chi-square test when sample sizes are small
1. Z-test for two proportions: Compares the proportions of two independent groups
Comparing variances
1. F-test: Compares the variances of two populations
1. Levene’s test: Tests for equality of variances when data may not be normally distributed
1. Bartlett’s test: Tests for equal variances but is sensitive to deviations from normality
Understanding effect size
1. Cohen’s d: Measures the difference between two means in terms of standard deviations
1. Odds Ratio: Compares the odds of an event occurring in one group to the odds of it occurring in another
1. Relative Risk: Compares the probability of an event occurring in two different groups
Assumptions and diagnostics
1. Normality
  1. Q-Q plots: A visual tool that compares your data distribution to a normal distribution
  1. Shapiro-Wilk test: A formal statistical test for normality
1. Homogeneity of variances (Some tests assume that the variances of the two populations are equal. We can check this using)
  1. Levene’s test: More robust when data is not normally distributed
  1. Bartlett’s Test: Assumes normality

Additional key concepts for data analysts

ANOVA (Analysis of variance): Used to compare the means of three or more groups to see if at least one differs significantly
1. One-Way ANOVA: Compares means across one factor (e.g., comparing the average test scores of students from different classes)
1. Two-Way ANOVA: Compares means across two factors (e.g., comparing test scores across different classes and teaching methods)
1. Repeated measures ANOVA: Used when the same subjects are measured multiple times under different conditions (e.g., testing the effect of different diets on weight loss in the same group of people)
1. Assumptions:
  1. The data should be normally distributed
  1. Variances should be equal across groups (homogeneity of variances)
  1. Observations are independent
Post-Hoc Tests: If ANOVA shows a significant difference, post-hoc tests (like Tukey’s HSD) determine which groups differ from each other

Linear and Multiple regression

Regression analysis is used to model the relationship between a dependent variable and one or more independent variables
Linear regression:
1. Simple linear regression: Involves one independent variable. It aims to predict the value of a dependent variable based on the value of an independent variable (e.g., predicting a person’s weight based on their height)
  1. Equation:
  1. 𝑌 = 𝑎 + 𝑏 𝑋 +𝜖 (Dependent variable, independent variable, intercept, slope and error)
1. Multiple regression:
  1. 𝑌 = 𝑎 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + … + 𝑏𝑛 𝑋𝑛 + 𝜖

Correlation

Measures the strength and direction of a linear relationship between two continuous variables
Pearson correlation coefficientRange: -1 to 1, positive, negative and no relationship
Spearman Rank Correlation
- A non-parametric measure of rank correlation, used when data is not normally distributed or the relationship is not linear
Scatter Plots: Useful for visualizing the relationship between two variables before calculating correlation

Time series forecasting

Moving Average:
1. Smoothing technique that averages data points over a specified period to identify trends
Exponential Smoothing:
1. Gives more weight to recent observations for forecasting
ARIMA (AutoRegressive Integrated Moving Average):
1. A more complex model used to predict future points in a series

Multivariate analysis

Techniques used to analyze data that involves more than two variables simultaneously
Multiple Regression: Explains the relationship between one dependent variable and several independent variables
Principal Component Analysis (PCA):
- Reduces the dimensionality of large datasets while retaining as much variance as possible (reducing the number of variables in a dataset while preserving the essential information)
Factor Analysis: Identifies underlying relationships between variables, grouping them into factors
Cluster Analysis: Groups similar observations into clusters. Used for market segmentation, customer profiling, etc.
Discriminant Analysis: Classifies observations into predefined categories based on predictor variables
MANOVA (Multivariate Analysis of Variance): Extends ANOVA to compare multiple dependent