Guides on Machine Learning
What is this series about?
Welcome to our comprehensive machine learning series 👋! The goal of this series is to give you a deeper and intuitive understanding of important concepts in machine learning. Our guides are filled with simple but insightful examples, which is a style of teaching we strongly believe in.
We aim to publish a new guide per week, and we also routinely improve our existing guides with more examples and sections. Feel free to register an account to be notified when we do!
Here are the updates in the last 14 days:
(Apr 8) 🚀 Added a new article: Comprehensive Guide to Principal Component Analysis.
(Apr 4) Added a section about implementing custom score functions for Scikit-learn's cross validation (Comprehensive Guide on Cross Validation).
(Apr 2) Added a detailed section about word embedding in Introduction to Text Vectorization.
(Apr 1) Added a section on implementation of one-hot encoding, dummy encoding in Introduction to Text Vectorization.
(Mar 26) Added a worked example under standardisation in Introduction to Feature Scaling.
(Mar 25) Added a new section on TF-IDF (with examples 🙂 of course) in Introduction to Text Vectorization.
Machine learning models
Naive Bayes is a simple but powerful model based on the Bayes' theorem. It is often used for classification tasks in the area of natural language processing.
Decision trees are a suite of tree based-models used for classification and regression. They are one of the most commonly used models in data science due to their highly robust and transparent predictions.
Random forest is a machine learning model that involves building multiple decision trees in a random manner to perform classification or regression.
The objective of linear regression is to draw a line of best fit that can then be used for predictions and inferences.
Feature scaling is an important preprocessing step in machine learning that can help increase accuracy and training speed.
Grid search is a brute-force technique to find the optimal hyper-parameters for model building.
Machine learning models require numerical input and so we need to transform non-numeric data (e.g. text and categories) into vectors. This step is called text vectorization.
Principal component analysis, or PCA, is one of a family of techniques for dimensionality reductions that uses the dependencies between the features to represent them in a lower dimensional form while trying to minimize information loss.
Evaluating machine learning models
A confusion matrix is a simple table used to summarise the performance of a classification algorithm.
The ROC (Receiver Operating Characteristic) curve is a way to visualise the performance of a binary classifier.
Cross validation is a technique to measure the performance of a model through resampling.
The mean squared error, or MSE, is a performance metric that measures how well your model fits the target values. The mean squared error is defined as the average of all squared differences between the true and predicted values.
Mean absolute error, or MAE, measures the performance of a model, and is defined as the average of all the absolute differences between true and predicted values.
The root mean squared error (RMSE) is defined as the square root of the average squared differences between the actual and predicted values.
PySpark is an API interface that allows you to write Python code to interact with Apache Spark, which is an open source distributing computing framework to handle big data.
RDD is the central data structure of Spark in which the data is partitioned across a number of worker nodes to facilitate parallel operations.
Databricks offer a platform to gain some hands-on experience with PySpark for free using the community edition.
Reach out to us
Please feel free to hop onto our Discord if you:
have any questions about our guides
have any requests on machine learning topics
are passionate about data science and want to chill with like-minded people
We'll get back to you as soon as possible 🙂.