# Guides on Machine Learning

*schedule*May 16, 2022

*toc*Table of Contents

*expand_more*

# What is this series about?

Welcome to our comprehensive machine learning series 👋! The goal of this series is to give you a deeper and intuitive understanding of important concepts in machine learning. Our guides are filled with simple but insightful examples, which is a style of teaching we strongly believe in.

We aim to publish a new guide per week, and we also routinely improve our existing guides with more examples and sections. Feel free to register an account to be notified when we do!

# Latest updates

Here are the updates in the last 14 days:

(Apr 8) 🚀 Added a new article: Comprehensive Guide to Principal Component Analysis.

(Apr 4) Added a section about implementing custom score functions for Scikit-learn's cross validation (Comprehensive Guide on Cross Validation).

(Apr 2) Added a detailed section about word embedding in Introduction to Text Vectorization.

(Apr 1) Added a section on implementation of one-hot encoding, dummy encoding in Introduction to Text Vectorization.

(Mar 26) Added a worked example under standardisation in Introduction to Feature Scaling.

(Mar 25) Added a new section on TF-IDF (with examples 🙂 of course) in Introduction to Text Vectorization.

# Comprehensive guides

## Machine learning models

Naive Bayes is a simple but powerful model based on the Bayes' theorem. It is often used for classification tasks in the area of natural language processing.

Decision trees are a suite of tree based-models used for classification and regression. They are one of the most commonly used models in data science due to their highly robust and transparent predictions.

Random forest is a machine learning model that involves building multiple decision trees in a random manner to perform classification or regression.

The objective of linear regression is to draw a line of best fit that can then be used for predictions and inferences.

## Feature engineering

Feature scaling is an important preprocessing step in machine learning that can help increase accuracy and training speed.

Grid search is a brute-force technique to find the optimal hyper-parameters for model building.

Machine learning models require numerical input and so we need to transform non-numeric data (e.g. text and categories) into vectors. This step is called text vectorization.

Principal Component Analysis (PCA)

Principal component analysis, or PCA, is one of a family of techniques for dimensionality reductions that uses the dependencies between the features to represent them in a lower dimensional form while trying to minimize information loss.

## Evaluating machine learning models

A confusion matrix is a simple table used to summarise the performance of a classification algorithm.

The ROC (Receiver Operating Characteristic) curve is a way to visualise the performance of a binary classifier.

Cross validation is a technique to measure the performance of a model through resampling.

The mean squared error, or MSE, is a performance metric that measures how well your model fits the target values. The mean squared error is defined as the average of all squared differences between the true and predicted values.

Mean absolute error, or MAE, measures the performance of a model, and is defined as the average of all the absolute differences between true and predicted values.

Root Mean Squared Error (RMSE)

The root mean squared error (RMSE) is defined as the square root of the average squared differences between the actual and predicted values.

## PySpark

PySpark is an API interface that allows you to write Python code to interact with Apache Spark, which is an open source distributing computing framework to handle big data.

Resilient Distributed Data (RDD)

RDD is the central data structure of Spark in which the data is partitioned across a number of worker nodes to facilitate parallel operations.

Getting Started with PySpark on Databricks

Databricks offer a platform to gain some hands-on experience with PySpark for free using the community edition.

# Reach out to us

Please feel free to hop onto our Discord if you:

have any questions about our guides

have any requests on machine learning topics

are passionate about data science and want to chill with like-minded people

We'll get back to you as soon as possible 🙂.