Probability and Statistics

Statistics

Types of Data

Nominal/Categorical: no ordering
Ordinal: Have ranking/ordering, but not meaningful mathematically (e.g. disagree, neutral, agree)
Interval: Meaningful differences, though can’t say if something is e.g. twice as much, no true zero point.
Ratio: Meaningful differences and true zero point.

Summary statistics

Mean and Median

Geometric Mean

\[\overline{x} = \sqrt[n]{x_1 \cdot x_2 \cdot \ldots \cdot x_{n-1}}\]

Uses:

In finance to calculate average growth rate
As a filter to reduce image noise
Matthews Correlation Coeeficient (MCC) in deep learning

Harmonic Mean

Uses:

As F1 score, in deep learning, a frequently used metric for classifiers
- Note: recall and precision are necessary to evaluate deep learning models

Measures of Variation

Mean Deviation

\[MD = \frac{1}{n}\sum_{i}^{n-1}{|x_i-\overline{x}|}\]

Biased Sample Variance

\[s_n^2 = \frac{1}{n}\sum_{i}^{n-1}{(x_i-\overline{x})^2}\]

Unbiased Sample Variance

\[s_n^2 = \frac{1}{n-1}\sum_{i}^{n-1}{(x_i-\overline{x})^2}\]

Using $n-1$ instead of $n$ is known as Bessel’s correction.
Motivation: The true population variance ($\sigma^2$) is the scatter of the population around the true population mean ($\mu$). However, we don’t know $\sigma^2$ or $\mu$, so instead, we estimate them from the dataset we have (sample). The mean of the sample is $\overline{x}$. That’s our estimate for $\mu$. It’s then natural to calculate the mean of the squared deviations around $\overline{x}$ and call that our estimate for ($\sigma^2$). That’s $s^2$. The claim is that if we apply Bessel’s correction, the estimated varicance of the sample will be closer to the true population variance.

Missing Data

In many cases, if there’s enough of missing data that the dataset will become biased by dropping it, the safe thing to do is to replace the missing data with the mean or median. We need to consult with descriptive statistics, histogram of box plot to decide which.

Correlation

The assosiation between the features in a dataset
Positive: if one goes up, the other might go up as well
Negative: the inverse of positive
In traditional ML, highly corerelated features were undersirable, as they didn’t add new info
In Deep learning, where the network itself leands a new representation of data, it’s less critical to have uncorrelated features.
- In part, this is why images work well as inputs to deep networks, and not with traditional ML
The Pearson correlation coefficient returns a number $r \in [-1, +1]$, indicating the strength of linear correlation. A correlation of zero means no association, possibly independent features. We say ‘possibly’, because there might be nonlinear dependencies.

Hypothesis testing

To understand if two sets of data are from the same parent distribution or not, we might look into summary statistics.

In hypothesis testing, we have two hypotheses:

The null hypothesis ($H_0$). The two datasets are from the same parent distribution, nothing special to differentiate them.
The alternative hypothesis ($H_\alpha$). The two groups are not from the same distribution

Hypothesis testing doesn’t tell us definitely whether ($H_0$) is true, it only gives us evidence in favor of accepting or rejecting it.

The t-test

The t-test depends on t, the test statistic. This statistic is compared to the t-distribution and used to generate a p-value, a probability we’ll use to reach a conclusion about the null hypothesis.

The t-test is a parametric test that assumes:

The data is independent and identically distributed (i.i.d), i.e. the data is a fair random sample
The distribution of the data is normal

One suggestion is to use both the t-test and the Mann-Whitney U test together to decide if we accept the null hypothesis. In general, if the non parametric test is claiming evidence against the null hypothesis, then we should probably accept it. This process obviously has huge caveats and needs more careful consideration.

The t-test has different versions. Examples:

Welch’s test, assuming that the variance of the two datasets is the same

The t-score, and an associated variable known as degrees of freedom, generate the appropriate t-distribution curve. To get a p-value we calculate the area under the curve.

The p-value tells us the probability of seeing the diference between the two means we see, or larger, if the null hypothesis is true. We typically reject the null hypothesis if p < a threshold we’ve chosen.

When we reject ($H_0$), we say that the difference is statistically significant. A usual (and problematic) theshold is $\alpha = 0.05$. It’s problematic because it’s too generous, $0.001$ coul dbe better. At $p=0.05$ all we have is a suggestion and we should repeat the experiment. If all experiments have a p-value <= 0.05, then rejecting the null hypothesis might start to make sense.

If the p-value is small it can have two meanings

The null hypothesis is false
A random sampling error has given us samples that fall outside what we might expect

Confidence Intervals

The confidence interval gives bounts within which we believe the true population difference in the means will lie. We typically report 95% confidence intervals. Any CI that includes zero signals to us that we cannot reject the null hypothesis.

The 95% confidence interval is such that if we could draw repeated samples from the distrubution that produced the two datasets, 95% of the caclulated confidence intervals would contain the true difference between the means. It is not the range that includes the true difference in the means at 95% certainty.

Usefulness:

Checking if zero is a CI and make a call for $H_0$
Its width give info on the magnitude of the effect

The CI will be narrow when the effect is large because small CIs imply a narrow range encompassing the true effect.

A p-value less that $\alpha$ will also have $CI_{\alpha}$ that doesn’t include $H_0$. They will not contradict.

Effect size

It’s one thing to have a stat sig p-value, and another for the difference represented to be of meaningful magnitude. A popular measure for the effect size is Cohen’s d.

References

Math for Deep Learning by Ronald T. Kneusel