Statistics for analytics (all in one)

What is statistics?

Statistics used to analyze the data to draw conclusions from it

What do we do using statistics?

Descriptive Statistics

We describe the data using statistical measures

  • Collect data
  • Organize data/clean
  • Analyze/summarize data
  • Draw conclusions

Inferential Statistics

We take further actions from descriptive statistics, here we predict/forecast, test hypothesis and make decisions

  • Predict/forecast
  • Test hypothesis
  • Draw conclusions

Scales of measurements

  • Nominal scale: Groups with no order
  • Ordinal scale: Ordered required for data
  • Interval scale: Data which are equidistant/noticeable distance from each other
  • Ratio scale: Data with certain ratio/proportion

Samples vs population

Sample (n): portion of population data

Population (N): entire universe of data

Why sample?

  • Convenience
  • Cost
  • Necessity
  • Impossible
  • Not practical

Quartile and Percentile

Quartile:

Data is divided in to 4 parts/ quarters Q1, Q2, Q3, Q4

Q1 = Lower quartile = 25th percentile

Q2 = Second quartile = 50th percentile = Median (most repeated)

Q3 = Third Quartile = 75th percentile

Percentile

It is a order set of values, percentile refers to % of values below it

  • Order data in ascending order
  • (n + 1)P/100 = Position of Pth percentile

Where n is number of sample

P is the percentile value

What is IQR? Inter quartile range

IQR = Q1 – Q3;

Difference between the first and the third quartile

Measures of central tendency

  • Mean (Average)
  • Median(Most repeated)
  • Mode(Most frequent)

Measures of variability

  • Variance (Average of the squared difference from the mean)
  • Standard deviation (Sqrt of variance)
  • IQR(Q3-Q1)
  • Range (max-min)

Population variance: here N (number of sample) is taken as it is

Sample variance: here n is n-1 is taken (Bessel’s correction to minimize the bias)

Other Measures

  • Skewness (Degree of asymmetry)
    • Left skew (Tail is in left)
    • Right Skew (Tail in right)
    • Symmetric
  • Kurtosis (measure of peak)
    • Leptokurtic (Peak in middle)
    • Mesokurtic (normal)
    • Platykurtic (Like flat)

Frequency distribution

Frequency: number of times a value is repeated in group

Relative frequency: percentage of the repetition

Cumulative frequency: adding with the previous row

ValueFrequencyRelative frequencyCumulative frequency
220.42
110.23
310.24
410.25

Relations between the mean and S.D (Empirical rule, Chebyshev’s theorem)

Empirical rule

  • Applies to normal distribution/symmetric
  • 1 S.D from mean = 68% of data
  • 2 S.D from mean = 95% of data

Chebyshev’s theorem

  • Applies to any shaped curve
  • Percentages (1-(1/k^2))% of elements lie within k S.D from mean
  • At least 75% of data lies within 2 S.D
  • At least 89% of data lies within 3 S.D
  • At least 94% of data lies within 4 S.D

Set

  • Empty set – a set containing no elements
  • Universal set S – set containing all possible elements
  • Complement (Not) – all elements of S not in A

Intersection (And) – common elements (A ∩ B)

Union (Or) – all elements of A and B (A ∪ B)

Probability basics

  • ComplementsProbability of not A =   P(A bar)= 1-P(A)
  • IntersectionProbability of both A and B = P(A ⋂ B) = n(A ⋂ B)/n(S);
    • Where, n(S) is number of elements in set S
    • P(A) = n(A)/n(S); P(B) = n(B)/n(S)
    • n(A∩B) = n(A) + n(B) – n(A U B)
  • Mutually exclusive events/ Independent; P(A ⋂ B) = 0
  • Union : (A∪B) = P(A) + P(B) – P(A∩B), for mutually exclusive events P(A ⋂ B) = 0

Conditional Probability

  • Probability of A given B = P(A|B) = P(A ∩ B)/P(B)
  • For mutually exclusive events = P(A|B) = P(A),
  • and P(A ∩ B) = P(A) P(B) and P(A∪B) = P(A) + P(B) – P(A∩B)

Contingency Table

CountsTCSInfosysTotal
Marketing401050
Finance203050
Total3040100
ProbabilityTCSInfosysTotal
Marketing0.40.10.5
Finance0.20.30.5
Total0.30.41

P (TCS |Marketing) = P (TCS ∩ Marketing)/P (Marketing) = 0.4/0.5 = 80%

Product rules for independent events

  • The probability of the intersection of several independent events is the product of their separate individual probabilities
    • P(A1 ∩ A2 ∩ A3 ∩ A4 ∩… An) = P(A1) P(A2) … P(An)
  • The probability of the union of several independent events  is 1 minus the product of probabilities of their complements
    • P(A1 ∪ A2 ∪ A3 ∪ A4 ∪… An) = 1 – (P(A1 bar) P(A2 bar) … P(An bar))

Factorial

  • 7! =  1 × 2 × 3 × 4 × 5 × 6 × 7

Permutation

  • Number of ways in which a particular set of data can be arranged
  • P(n,r) = n! ÷ (n-r)!

Combination

  • Find the number of ways of selecting items from a collection
  • nCr = n! / (r! * (n-r)!)

The law of total probability

  • P(A) = P(A ∩ B) + = P(A ∩ B bar)
  • Conditional probabilities
    • P(A)=∑iP(A∩Bi)=∑iP(A|Bi)P(Bi)

Bayes’ theorem

  • Bayes’ theorem enables, knowing just a little more than the probability of A given B, to find the probability of B given A
  • P(A|B) = P(B|A)P(A) ÷ P(B)
  • P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)

Joint probability table

  • Joint probability table is similar to a contingency table, except that it has probabilities in place of frequencies
CountsTCSInfosysTotal
Marketing401050
Finance203050
Total3040100
ProbabilityTCSInfosysTotal
Marketing0.40.10.5
Finance0.20.30.5
Total0.30.41

Row conditional probabilities

  • Represent the likelihood of an employee being in a specific company given the department
  • P(Company | Department) = P(Company and Department) / P(Department)
  • P(TCS | Marketing) = 0.4 / 0.5 = 0.8
  • P(Infosys | Marketing) = 0.1 / 0.5 = 0.2
  • P(TCS | Finance) = 0.2 / 0.5 = 0.4
  • P(Infosys | Finance) = 0.3 / 0.5 = 0.6

Column conditional probabilities

  • Represent the likelihood of an employee being in a specific department given the company
  • P(Department | Company) = P(Company and Department) / P(Company)
  • P(Marketing | TCS) = 0.4 / 0.3 = 1.33 (Value cannot be greater than1, here its only example)
  • P(Finance | TCS) = 0.2 / 0.3 = 0.67
  • P(Marketing | Infosys) = 0.1 / 0.4 = 0.25
  • P(Finance | Infosys) = 0.3 / 0.4 = 0.75

Binomial distribution

  • Probability of getting a specific number of successes (s) within that fixed number of trials (n)
  • P(x successes in n trials) = nCx * p^x * (1-p)^(n-x)
    • P(x successes in n trials) represents the probability of getting x successes
    • nCx (n choose x) is the number of ways to choose x successes out of n trials
    • p is the probability of success in each trial (between 0 and 1).
    • (1-p) is the probability of failure in each trial (1 minus the probability of success)

Poisson distribution

  • Poisson distribution focuses on the number of events occurring in a specific interval (predict)
  • AKA PDF – probability density function or probability mass function
  • P(x events) = (e^(-λ) * λ^x) / x!
    • e – mathematical constant or Euler’s number (approximately 2.71828)
    • λ (lambda) – average rate of events
    • x – number of events (that we are interested)
    • x! – factorial of x
  • Bakery receives an average of 5 customer per hour (λ = 5). What’s the probability of receiving exactly 3 customers in the next hour?
    • λ = 5 (average customers per hour)
    • x = 3 (number of customers we’re interested in)
    • P(x events) = (e^(-λ) * λ^x) / x!
    • P(3 events) = (e^(-5) * 5^3) / 3! = 0.201

So, 20.1% chances of 3 customers visiting in next hour

Normal distribution

  • Bell-shaped and symmetric distributions
  • Because the distribution is symmetric, one-half (.50 or 50%) lies on either side of the mean

Central limit theorem

  • Regardless of the shape of the original data distribution, as sample size increases, the distribution of sample means becomes more normal
  • The mean of the sample means is equal to the population mean
  • The standard deviation of the sample means (standard error) decreases as sample size increases
  • A sample size of 30 is often considered sufficient for the CLT to apply

Z score

  • We have already Empirical rule of normal distribution, that gives approximate percentile, but z score gives more accurate percentile
  • It tells us that how much percentage a particular data point above the rest of data points
  • z = (x – μ) / σ; x- specific point, μ is mean and σ is S.D
  • Z is +ve: data point is above the mean
  • Z is -ve: data point is below the mean
  • Z is 0: data point is on the mean
  • Percentile can be known form z table with z value

Student’s T test/ distribution

  • Typically we usually don’t get population standard deviation in real life examples,
  • Such that we cannot calculate the standard deviation of population (σ_population) and Z score, we use Student T test to mitigate this type of problem
  • Here we will have data like, μ = Population mean, n= Number of sample, x̄ = Sample mean, α (alpha) = Level of significance = 0.05, σ_ = Sample standard deviation (Can be manually calculated with sample data), We use σ_ = s instead of population standard deviation
  • Next we calculate t score
    • t = (x̄ – μ) / σ_sample, σ_sample   = s / √n
    • Degrees of freedom = n-1 (used to look critical value in t table)
    • Decide on 1 sided test or 2 sided test
    • We use t-table to get critical value
      • If t < critical value (from table) then accept Null Hypothesis
      • If t > critical value (from table) then reject Null Hypothesis

Chebychev’s theorem

  • For any distribution regardless of shape the portion of data lying within k standard deviations (k > 1) of the mean is at least 1 – 1/(K^2)
  • The mean time in a women’s 400-meter dash is 52.4 seconds with a standard deviation of 2.2 sec. Apply Chebychev’s theorem for k = 2
  • At least 75% of the women’s 400-meter dash times will fall between 48 and 56.8 seconds

Hypothesis testing

  • Hypothesis is an educated guess or a tentative statement about the relationship between two or more variables
  • Null hypothesis (H0): There is no relationship or difference between the variables being studied
  • Alternative hypothesis (H1): It’s the opposite of the null hypothesis. There is specific relationship or difference between the variables
  • We either accept null hypothesis or reject null hypothesis
  • We reject null hypothesis when
    • α (alpha) at lower range, commonly 0.05 (5%) or 0 .01 (1%)
    • If P-value is less than α (alpha), then we reject null hypothesis and accept alternate hypothesis
Possible Hypothesis combination
Null Hypothesis (H0)Alternate Hypothesis (H1)
H0: p= (2 side tail)H1: p≠ (2 side tail)
H0: p≤ (1 side tail)H1: p> (1 side tail)
H0: p≥ (1 side tail)H1: p< (1 side tail)

Errors

What if we accept wrong hypothesis?

It is an error, we have

Types of error – Type I & Type II Errors

Types of error – Type I & Type II Errors
Null hypothesis (H0)Null hypothesis (H0) = TrueNull hypothesis (H0) = False
Null hypothesis (H0) = RejectedType I Error (α – alpha)No Error = True Positive Probability = 1- β
Null hypothesis (H0) = AcceptedNo Error = True negative Probability = 1- αType II Error (β – beta)

Note = α & β are inversely proportional

Way to reduce type I & II errors is by

  • Reducing the Level of significance (say ~5%)
  • Increase the sample size of the test

Power of test = 1 – β

Confidence interval and Margin of error

  • Confidence Interval (CI): Range of values that is to contain the true population parameter (mean, proportion) with certain level of confidence
  • Margin of Error (ME): Amount of uncertainty/ potential error our estimate or Confidence Interval
  • CI = Sample statistic ± Margin of error

Comparison of 2 population

Choosing the right statistical test

  1. Comparing means
    1. Independent t-test: Compares the means of two independent group
      1. Data should be normally distributed within each group
      1. Variances (spread of data) in the two groups should be similar (use Welch’s t-test if they are not)
    1. Paired t-test: Compares means within the same group measured at two different times
      1. Differences between paired observations should be normally distributed
    1. Mann-Whitney U test: non-parametric test, compares differences between two independent groups when the data is not normally distributed
    1. Wilcoxon signed-rank test: A non-parametric test for comparing two related group
  2. Kruskal-Wallis test
    1. An alternative to one-way ANOVA when data does not meet ANOVA assumptions. Compares three or more independent groups
  3. Friedman test
    1. An alternative to repeated measures ANOVA, used for comparing more than two related groups
  4. Comparing proportions
    1. Chi-Square test of independence:
      1. Tests the association between two categorical variables or the goodness of fit between observed and expected frequencies
        1. Chi-Square test of independence: Tests whether two categorical variables are independent
        1. Chi-Square goodness of fit test: Tests whether the observed distribution of a categorical variable matches an expected distribution
    1. Fisher’s exact test: Used instead of the chi-square test when sample sizes are small
    1. Z-test for two proportions: Compares the proportions of two independent groups
  5. Comparing variances
    1. F-test: Compares the variances of two populations
    1. Levene’s test: Tests for equality of variances when data may not be normally distributed
    1. Bartlett’s test: Tests for equal variances but is sensitive to deviations from normality
  6. Understanding effect size
    1. Cohen’s d: Measures the difference between two means in terms of standard deviations
    1. Odds Ratio: Compares the odds of an event occurring in one group to the odds of it occurring in another
    1. Relative Risk: Compares the probability of an event occurring in two different groups
  7. Assumptions and diagnostics
    1. Normality
      1. Q-Q plots: A visual tool that compares your data distribution to a normal distribution
      1. Shapiro-Wilk test: A formal statistical test for normality
    1. Homogeneity of variances (Some tests assume that the variances of the two populations are equal. We can check this using)
      1. Levene’s test: More robust when data is not normally distributed
      1. Bartlett’s Test: Assumes normality

Additional key concepts for data analysts

  1. ANOVA (Analysis of variance): Used to compare the means of three or more groups to see if at least one differs significantly
    1. One-Way ANOVA: Compares means across one factor (e.g., comparing the average test scores of students from different classes)
    1. Two-Way ANOVA: Compares means across two factors (e.g., comparing test scores across different classes and teaching methods)
    1. Repeated measures ANOVA: Used when the same subjects are measured multiple times under different conditions (e.g., testing the effect of different diets on weight loss in the same group of people)
    1. Assumptions:
      1. The data should be normally distributed
      1. Variances should be equal across groups (homogeneity of variances)
      1. Observations are independent
  2. Post-Hoc Tests: If ANOVA shows a significant difference, post-hoc tests (like Tukey’s HSD) determine which groups differ from each other

Linear and Multiple regression

  1. Regression analysis is used to model the relationship between a dependent variable and one or more independent variables
  2. Linear regression:
    1. Simple linear regression: Involves one independent variable. It aims to predict the value of a dependent variable based on the value of an independent variable (e.g., predicting a person’s weight based on their height)
      1. Equation:
      1. 𝑌 = 𝑎 + 𝑏 𝑋 +𝜖 (Dependent variable, independent variable, intercept, slope and error)
    1. Multiple regression:
      1. 𝑌 = 𝑎 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + … + 𝑏𝑛 𝑋𝑛 + 𝜖

Correlation

  • Measures the strength and direction of a linear relationship between two continuous variables
  • Pearson correlation coefficientRange: -1 to 1, positive, negative and no relationship
  • Spearman Rank Correlation
    • A non-parametric measure of rank correlation, used when data is not normally distributed or the relationship is not linear
  • Scatter Plots: Useful for visualizing the relationship between two variables before calculating correlation

Time series forecasting

  1. Moving Average:
    1. Smoothing technique that averages data points over a specified period to identify trends
  2. Exponential Smoothing:
    1. Gives more weight to recent observations for forecasting
  3. ARIMA (AutoRegressive Integrated Moving Average):
    1. A more complex model used to predict future points in a series

Multivariate analysis

  • Techniques used to analyze data that involves more than two variables simultaneously
  • Multiple Regression: Explains the relationship between one dependent variable and several independent variables
  • Principal Component Analysis (PCA):
    • Reduces the dimensionality of large datasets while retaining as much variance as possible (reducing the number of variables in a dataset while preserving the essential information)
  • Factor Analysis: Identifies underlying relationships between variables, grouping them into factors
  • Cluster Analysis: Groups similar observations into clusters. Used for market segmentation, customer profiling, etc.
  • Discriminant Analysis: Classifies observations into predefined categories based on predictor variables
  • MANOVA (Multivariate Analysis of Variance): Extends ANOVA to compare multiple dependent
Venu Kumar M
Venu Kumar M