What is statistics?
Statistics used to analyze the data to draw conclusions from it
What do we do using statistics?
Descriptive Statistics
We describe the data using statistical measures
- Collect data
- Organize data/clean
- Analyze/summarize data
- Draw conclusions
Inferential Statistics
We take further actions from descriptive statistics, here we predict/forecast, test hypothesis and make decisions
- Predict/forecast
- Test hypothesis
- Draw conclusions
Scales of measurements
- Nominal scale: Groups with no order
- Ordinal scale: Ordered required for data
- Interval scale: Data which are equidistant/noticeable distance from each other
- Ratio scale: Data with certain ratio/proportion
Samples vs population
Sample (n): portion of population data
Population (N): entire universe of data
Why sample?
- Convenience
- Cost
- Necessity
- Impossible
- Not practical
Quartile and Percentile
Quartile:
Data is divided in to 4 parts/ quarters Q1, Q2, Q3, Q4
Q1 = Lower quartile = 25th percentile
Q2 = Second quartile = 50th percentile = Median (most repeated)
Q3 = Third Quartile = 75th percentile
Percentile
It is a order set of values, percentile refers to % of values below it
- Order data in ascending order
- (n + 1)P/100 = Position of Pth percentile
Where n is number of sample
P is the percentile value
What is IQR? Inter quartile range
IQR = Q1 – Q3;
Difference between the first and the third quartile
Measures of central tendency
- Mean (Average)
- Median(Most repeated)
- Mode(Most frequent)
Measures of variability
- Variance (Average of the squared difference from the mean)
- Standard deviation (Sqrt of variance)
- IQR(Q3-Q1)
- Range (max-min)
Population variance: here N (number of sample) is taken as it is
Sample variance: here n is n-1 is taken (Bessel’s correction to minimize the bias)
Other Measures
- Skewness (Degree of asymmetry)
- Left skew (Tail is in left)
- Right Skew (Tail in right)
- Symmetric
- Kurtosis (measure of peak)
- Leptokurtic (Peak in middle)
- Mesokurtic (normal)
- Platykurtic (Like flat)
Frequency distribution
Frequency: number of times a value is repeated in group
Relative frequency: percentage of the repetition
Cumulative frequency: adding with the previous row
Value | Frequency | Relative frequency | Cumulative frequency |
2 | 2 | 0.4 | 2 |
1 | 1 | 0.2 | 3 |
3 | 1 | 0.2 | 4 |
4 | 1 | 0.2 | 5 |
Relations between the mean and S.D (Empirical rule, Chebyshev’s theorem)
Empirical rule
- Applies to normal distribution/symmetric
- 1 S.D from mean = 68% of data
- 2 S.D from mean = 95% of data
Chebyshev’s theorem
- Applies to any shaped curve
- Percentages (1-(1/k^2))% of elements lie within k S.D from mean
- At least 75% of data lies within 2 S.D
- At least 89% of data lies within 3 S.D
- At least 94% of data lies within 4 S.D
Set
- Empty set ∅ – a set containing no elements
- Universal set S – set containing all possible elements
- Complement (Not) – all elements of S not in A
Intersection (And) – common elements (A ∩ B)
Union (Or) – all elements of A and B (A ∪ B)
Probability basics
- Complements – Probability of not A = P(A bar)= 1-P(A)
- Intersection – Probability of both A and B = P(A ⋂ B) = n(A ⋂ B)/n(S);
- Where, n(S) is number of elements in set S
- P(A) = n(A)/n(S); P(B) = n(B)/n(S)
- n(A∩B) = n(A) + n(B) – n(A U B)
- Mutually exclusive events/ Independent; P(A ⋂ B) = 0
- Union : (A∪B) = P(A) + P(B) – P(A∩B), for mutually exclusive events P(A ⋂ B) = 0
Conditional Probability
- Probability of A given B = P(A|B) = P(A ∩ B)/P(B)
- For mutually exclusive events = P(A|B) = P(A),
- and P(A ∩ B) = P(A) P(B) and P(A∪B) = P(A) + P(B) – P(A∩B)
Contingency Table
Counts | TCS | Infosys | Total |
Marketing | 40 | 10 | 50 |
Finance | 20 | 30 | 50 |
Total | 30 | 40 | 100 |
Probability | TCS | Infosys | Total |
Marketing | 0.4 | 0.1 | 0.5 |
Finance | 0.2 | 0.3 | 0.5 |
Total | 0.3 | 0.4 | 1 |
P (TCS |Marketing) = P (TCS ∩ Marketing)/P (Marketing) = 0.4/0.5 = 80%
Product rules for independent events
- The probability of the intersection of several independent events is the product of their separate individual probabilities
- P(A1 ∩ A2 ∩ A3 ∩ A4 ∩… An) = P(A1) P(A2) … P(An)
- The probability of the union of several independent events is 1 minus the product of probabilities of their complements
- P(A1 ∪ A2 ∪ A3 ∪ A4 ∪… An) = 1 – (P(A1 bar) P(A2 bar) … P(An bar))
Factorial
- 7! = 1 × 2 × 3 × 4 × 5 × 6 × 7
Permutation
- Number of ways in which a particular set of data can be arranged
- P(n,r) = n! ÷ (n-r)!
Combination
- Find the number of ways of selecting items from a collection
- nCr = n! / (r! * (n-r)!)
The law of total probability
- P(A) = P(A ∩ B) + = P(A ∩ B bar)
- Conditional probabilities
- P(A)=∑iP(A∩Bi)=∑iP(A|Bi)P(Bi)
Bayes’ theorem
- Bayes’ theorem enables, knowing just a little more than the probability of A given B, to find the probability of B given A
- P(A|B) = P(B|A)P(A) ÷ P(B)
- P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
Joint probability table
- Joint probability table is similar to a contingency table, except that it has probabilities in place of frequencies
Counts | TCS | Infosys | Total |
Marketing | 40 | 10 | 50 |
Finance | 20 | 30 | 50 |
Total | 30 | 40 | 100 |
Probability | TCS | Infosys | Total |
Marketing | 0.4 | 0.1 | 0.5 |
Finance | 0.2 | 0.3 | 0.5 |
Total | 0.3 | 0.4 | 1 |
Row conditional probabilities
- Represent the likelihood of an employee being in a specific company given the department
- P(Company | Department) = P(Company and Department) / P(Department)
- P(TCS | Marketing) = 0.4 / 0.5 = 0.8
- P(Infosys | Marketing) = 0.1 / 0.5 = 0.2
- P(TCS | Finance) = 0.2 / 0.5 = 0.4
- P(Infosys | Finance) = 0.3 / 0.5 = 0.6
Column conditional probabilities
- Represent the likelihood of an employee being in a specific department given the company
- P(Department | Company) = P(Company and Department) / P(Company)
- P(Marketing | TCS) = 0.4 / 0.3 = 1.33 (Value cannot be greater than1, here its only example)
- P(Finance | TCS) = 0.2 / 0.3 = 0.67
- P(Marketing | Infosys) = 0.1 / 0.4 = 0.25
- P(Finance | Infosys) = 0.3 / 0.4 = 0.75
Binomial distribution
- Probability of getting a specific number of successes (s) within that fixed number of trials (n)
- P(x successes in n trials) = nCx * p^x * (1-p)^(n-x)
- P(x successes in n trials) represents the probability of getting x successes
- nCx (n choose x) is the number of ways to choose x successes out of n trials
- p is the probability of success in each trial (between 0 and 1).
- (1-p) is the probability of failure in each trial (1 minus the probability of success)
Poisson distribution
- Poisson distribution focuses on the number of events occurring in a specific interval (predict)
- AKA PDF – probability density function or probability mass function
- P(x events) = (e^(-λ) * λ^x) / x!
- e – mathematical constant or Euler’s number (approximately 2.71828)
- λ (lambda) – average rate of events
- x – number of events (that we are interested)
- x! – factorial of x
- Bakery receives an average of 5 customer per hour (λ = 5). What’s the probability of receiving exactly 3 customers in the next hour?
- λ = 5 (average customers per hour)
- x = 3 (number of customers we’re interested in)
- P(x events) = (e^(-λ) * λ^x) / x!
- P(3 events) = (e^(-5) * 5^3) / 3! = 0.201
So, 20.1% chances of 3 customers visiting in next hour
Normal distribution
- Bell-shaped and symmetric distributions
- Because the distribution is symmetric, one-half (.50 or 50%) lies on either side of the mean
Central limit theorem
- Regardless of the shape of the original data distribution, as sample size increases, the distribution of sample means becomes more normal
- The mean of the sample means is equal to the population mean
- The standard deviation of the sample means (standard error) decreases as sample size increases
- A sample size of 30 is often considered sufficient for the CLT to apply
Z score
- We have already Empirical rule of normal distribution, that gives approximate percentile, but z score gives more accurate percentile
- It tells us that how much percentage a particular data point above the rest of data points
- z = (x – μ) / σ; x- specific point, μ is mean and σ is S.D
- Z is +ve: data point is above the mean
- Z is -ve: data point is below the mean
- Z is 0: data point is on the mean
- Percentile can be known form z table with z value
Student’s T test/ distribution
- Typically we usually don’t get population standard deviation in real life examples,
- Such that we cannot calculate the standard deviation of population (σ_population) and Z score, we use Student T test to mitigate this type of problem
- Here we will have data like, μ = Population mean, n= Number of sample, x̄ = Sample mean, α (alpha) = Level of significance = 0.05, σ_ = Sample standard deviation (Can be manually calculated with sample data), We use σ_ = s instead of population standard deviation
- Next we calculate t score
- t = (x̄ – μ) / σ_sample, σ_sample = s / √n
- Degrees of freedom = n-1 (used to look critical value in t table)
- Decide on 1 sided test or 2 sided test
- We use t-table to get critical value
- If t < critical value (from table) then accept Null Hypothesis
- If t > critical value (from table) then reject Null Hypothesis
Chebychev’s theorem
- For any distribution regardless of shape the portion of data lying within k standard deviations (k > 1) of the mean is at least 1 – 1/(K^2)
- The mean time in a women’s 400-meter dash is 52.4 seconds with a standard deviation of 2.2 sec. Apply Chebychev’s theorem for k = 2
- At least 75% of the women’s 400-meter dash times will fall between 48 and 56.8 seconds
Hypothesis testing
- Hypothesis is an educated guess or a tentative statement about the relationship between two or more variables
- Null hypothesis (H0): There is no relationship or difference between the variables being studied
- Alternative hypothesis (H1): It’s the opposite of the null hypothesis. There is specific relationship or difference between the variables
- We either accept null hypothesis or reject null hypothesis
- We reject null hypothesis when
- α (alpha) at lower range, commonly 0.05 (5%) or 0 .01 (1%)
- If P-value is less than α (alpha), then we reject null hypothesis and accept alternate hypothesis
Possible Hypothesis combination | |
Null Hypothesis (H0) | Alternate Hypothesis (H1) |
H0: p= (2 side tail) | H1: p≠ (2 side tail) |
H0: p≤ (1 side tail) | H1: p> (1 side tail) |
H0: p≥ (1 side tail) | H1: p< (1 side tail) |
Errors
What if we accept wrong hypothesis?
It is an error, we have
Types of error – Type I & Type II Errors
Types of error – Type I & Type II Errors | ||
Null hypothesis (H0) | Null hypothesis (H0) = True | Null hypothesis (H0) = False |
Null hypothesis (H0) = Rejected | Type I Error (α – alpha) | No Error = True Positive Probability = 1- β |
Null hypothesis (H0) = Accepted | No Error = True negative Probability = 1- α | Type II Error (β – beta) |
Note = α & β are inversely proportional
Way to reduce type I & II errors is by
- Reducing the Level of significance (say ~5%)
- Increase the sample size of the test
Power of test = 1 – β
Confidence interval and Margin of error
- Confidence Interval (CI): Range of values that is to contain the true population parameter (mean, proportion) with certain level of confidence
- Margin of Error (ME): Amount of uncertainty/ potential error our estimate or Confidence Interval
- CI = Sample statistic ± Margin of error
Comparison of 2 population
Choosing the right statistical test
- Comparing means
- Independent t-test: Compares the means of two independent group
- Data should be normally distributed within each group
- Variances (spread of data) in the two groups should be similar (use Welch’s t-test if they are not)
- Paired t-test: Compares means within the same group measured at two different times
- Differences between paired observations should be normally distributed
- Mann-Whitney U test: non-parametric test, compares differences between two independent groups when the data is not normally distributed
- Wilcoxon signed-rank test: A non-parametric test for comparing two related group
- Independent t-test: Compares the means of two independent group
- Kruskal-Wallis test
- An alternative to one-way ANOVA when data does not meet ANOVA assumptions. Compares three or more independent groups
- Friedman test
- An alternative to repeated measures ANOVA, used for comparing more than two related groups
- Comparing proportions
- Chi-Square test of independence:
- Tests the association between two categorical variables or the goodness of fit between observed and expected frequencies
- Chi-Square test of independence: Tests whether two categorical variables are independent
- Chi-Square goodness of fit test: Tests whether the observed distribution of a categorical variable matches an expected distribution
- Tests the association between two categorical variables or the goodness of fit between observed and expected frequencies
- Fisher’s exact test: Used instead of the chi-square test when sample sizes are small
- Z-test for two proportions: Compares the proportions of two independent groups
- Chi-Square test of independence:
- Comparing variances
- F-test: Compares the variances of two populations
- Levene’s test: Tests for equality of variances when data may not be normally distributed
- Bartlett’s test: Tests for equal variances but is sensitive to deviations from normality
- Understanding effect size
- Cohen’s d: Measures the difference between two means in terms of standard deviations
- Odds Ratio: Compares the odds of an event occurring in one group to the odds of it occurring in another
- Relative Risk: Compares the probability of an event occurring in two different groups
- Assumptions and diagnostics
- Normality
- Q-Q plots: A visual tool that compares your data distribution to a normal distribution
- Shapiro-Wilk test: A formal statistical test for normality
- Homogeneity of variances (Some tests assume that the variances of the two populations are equal. We can check this using)
- Levene’s test: More robust when data is not normally distributed
- Bartlett’s Test: Assumes normality
- Normality
Additional key concepts for data analysts
- ANOVA (Analysis of variance): Used to compare the means of three or more groups to see if at least one differs significantly
- One-Way ANOVA: Compares means across one factor (e.g., comparing the average test scores of students from different classes)
- Two-Way ANOVA: Compares means across two factors (e.g., comparing test scores across different classes and teaching methods)
- Repeated measures ANOVA: Used when the same subjects are measured multiple times under different conditions (e.g., testing the effect of different diets on weight loss in the same group of people)
- Assumptions:
- The data should be normally distributed
- Variances should be equal across groups (homogeneity of variances)
- Observations are independent
- Post-Hoc Tests: If ANOVA shows a significant difference, post-hoc tests (like Tukey’s HSD) determine which groups differ from each other
Linear and Multiple regression
- Regression analysis is used to model the relationship between a dependent variable and one or more independent variables
- Linear regression:
- Simple linear regression: Involves one independent variable. It aims to predict the value of a dependent variable based on the value of an independent variable (e.g., predicting a person’s weight based on their height)
- Equation:
- 𝑌 = 𝑎 + 𝑏 𝑋 +𝜖 (Dependent variable, independent variable, intercept, slope and error)
- Multiple regression:
- 𝑌 = 𝑎 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + … + 𝑏𝑛 𝑋𝑛 + 𝜖
- Simple linear regression: Involves one independent variable. It aims to predict the value of a dependent variable based on the value of an independent variable (e.g., predicting a person’s weight based on their height)
Correlation
- Measures the strength and direction of a linear relationship between two continuous variables
- Pearson correlation coefficientRange: -1 to 1, positive, negative and no relationship
- Spearman Rank Correlation
- A non-parametric measure of rank correlation, used when data is not normally distributed or the relationship is not linear
- Scatter Plots: Useful for visualizing the relationship between two variables before calculating correlation
Time series forecasting
- Moving Average:
- Smoothing technique that averages data points over a specified period to identify trends
- Exponential Smoothing:
- Gives more weight to recent observations for forecasting
- ARIMA (AutoRegressive Integrated Moving Average):
- A more complex model used to predict future points in a series
Multivariate analysis
- Techniques used to analyze data that involves more than two variables simultaneously
- Multiple Regression: Explains the relationship between one dependent variable and several independent variables
- Principal Component Analysis (PCA):
- Reduces the dimensionality of large datasets while retaining as much variance as possible (reducing the number of variables in a dataset while preserving the essential information)
- Factor Analysis: Identifies underlying relationships between variables, grouping them into factors
- Cluster Analysis: Groups similar observations into clusters. Used for market segmentation, customer profiling, etc.
- Discriminant Analysis: Classifies observations into predefined categories based on predictor variables
- MANOVA (Multivariate Analysis of Variance): Extends ANOVA to compare multiple dependent