Guides

hubGraph View (new)
smart_toyML Guides
casinoProb and Stats Guides
memoryML models

Simple linear regression
Logistic regression
k-means clustering
Hierarchical clustering
DBSCAN
Naive bayes
Decision tree
k-nearest neighbors
Perceptrons
settings_suggestFeature engineering

Feature scaling
Log transformation
Feature encoding
Grid search
Random search
Principal component analysis
trending_downOptimization

Gradient descent
assignment_turned_inModel evaluation

Confusion matrix
ROC curves
Cross validation
Mean squared error
Mean absolute error
Root mean squared error
R-squared
gradePySpark

Getting started with PySpark
Resilient distributed dataset (RDD)
Getting started with PySpark on Databricks
Caching RDDs and DataFrames
Using SQL on DataFrames
User-defined functions
functionsMathematics

Odds and odds ratio
Orthogonal projections
check_circle

Mark as learned thumb_up

2

thumb_down

0

chat_bubble_outline

0

auto_stories new

settings

# Guides on probability and statistics

Probability and Statistics

*schedule*Nov 11, 2022

local_offer Probability and Statistics

Tags *toc*Table of Contents

*expand_more*

Check out the

**interactive map of data science**Heya guys ðŸ‘‹, welcome to our comprehensive probability and statistics series!

Unlike our ML series, this series has a chronological flow like a textbook since concepts in probability and statistics tend to be progressive.

The series is far from complete and we hope to publish comprehensive guides every week. As always, please feel free to email me at isshin@skytowner.com or join our Discord channel if you get stuck!

# Chapter 1 - Basics of statistics

1.1. Population, samples and sampling techniques

This guide covers common techniques to sample from a population such as random, stratified and convenience sampling.

1.2. Measures of central tendency

This guide covers three main measures of central tendency: the mean, median and mode.

1.3. Measures of spread

This guide covers three main measures of spread: variance, standard deviation and mean absolute deviation.

1.4. Quantiles, quartiles and percentiles

A q-quantile divides the data points into q equal portions. Quartiles are 4-quantiles and percentiles are 100-quantiles.

1.5.1 Visualizing data - histogram

A histogram is a diagram that illustrates the distribution of a given set of values.

1.5.2 Visualizing data - boxplot diagrams

A boxplot diagram, or box-whisker diagram, is a popular way to visualize the spread of a dataset using quartiles.

# Chapter 2 - Basics of probability theory

2.1. Basics of set theory and Venn diagrams

Set theory is a branch of mathematics that studies sets, which are simply a list of elements where ordering does not matter.

2.2. Counting with permutations

Permutation refers to the number of ways of ordering â€Œrâ€Œ elements from a total of â€Œnâ€Œ elements.

2.3. Counting with combinations

Combinations refer to the number of ways we can pick a set of â€Œkâ€Œ elements from a total â€Œnâ€Œ elements without regard to the ordering.

2.4. Sample space, events and probability axioms

This guide is about sample space, events (simple, compound and disjoint) and the three axioms of probability.

2.5. Conditional probability

Given two events A and B, the conditional probability of B given A is the probability that B occurs given that A has already occurred.

2.6. Multiplication and addition rule

The multiplication rule is the rearranged version of the definition of conditional probability, and the addition rule takes into account double-counting of events.

2.7. Law of total probability

The law of total probability partitions the sample space, allowing us to compute marginal probabilities.

2.8. Bayes' theorem

Bayes's theorem is a mathematical formula to compute conditional probabilities of events.

# Chapter 3 - Random variables

3.1. Random variables

A random variable X is a function that associates a value x to every possible outcome in an experiment (sample space).

3.2. Expected value

The expected value of random variable X is a number that tells us the average value of X we expect to see when we perform a large number of independent repetitions of an experiment.

3.3. Properties of expected value

This guide goes over all the main properties of the expected value of random variables along with their proofs.

3.4. Variance

Variance is the average squared distance between a random variable and its mean, measuring the spread of the random variable's distribution.

3.5. Properties of variance

This guide goes over all the main properties of the variance of random variables along with their proofs.

3.6. Covariance

The covariance of two random variables is a measure of the linear relationship between them.

3.7. Correlation

The correlation coefficient is used to determine the linear relationship between two variables. It normalizes covariance values to fall within the range 1 (strong positive linear relationship) and -1 (strong negative linear relationship).

# Chapter 4 - Point estimation

4.1. Sample mean

The sample mean, which is computed as the average of the sample observations, is an unbiased estimator of the population mean.

4.2. Sample variance

Sample variance is an unbiased estimator for the population variance that can be computed by dividing the sum of squared differences from the mean by n-1.

4.3. Sample covariance

Sample covariance is an unbiased estimator of the population covariance and measures the association between two variables.

4.4. Sample correlation

Sample correlation is a quantity between -1 and 1 that measures the level of association between two variables.

4.5.1. Properties of estimators - bias

The bias of an estimator tells us how off its estimates are on average from the true population parameter.

4.5.2. Properties of estimators - mean squared error

The mean squared error of an estimator represents the average squared difference between the computed estimates and the true population parameter.

4.6. Central limit theorem

The central limit theorem states that regardless of the population distribution, the sampling distribution of the sample mean is approximately normal given a large sample size.

# Chapter 5 - Discrete probability distributions

5.1. Probability mass function

The probability mass function (PMF) assigns probabilities to every possible value of a discrete random variable.

5.2. Binomial distribution

The binomial distribution is a discrete probability distribution of obtaining exactly n successes out of repeated Bernoulli trials.

5.3. Geometric distribution

The geometric distribution is the discrete distribution of the number of trials to observe the first success in repeated independent Bernoulli trials.

5.4. Negative binomial distribution

The negative binomial distribution is the discrete distribution of the number of trials to observe the first r successes in repeated independent Bernoulli trials.

5.5. Hyper-geometric distribution

The hypergeometric distribution is a discrete distribution of the number of successes in a sequence of trials without replacement.

5.6. Poisson distribution

The Poisson distribution models the number of events occurring within a given time interval.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Ask a question or leave a feedback...

thumb_up

2

thumb_down

0

chat_bubble_outline

0

settings

Enjoy our search

Hit / to insta-search docs and recipes!