Syllabus Point
- Describe types of algorithms associated with ML
Including:
- linear regression
- logistic regression
- K-nearest neighbour
Describe types of algorithms associated with ML
Key Terms
- Bias: Not all models start from (0,0), therefore bias is an intercept or offset
- Feature: An input variable to a ML model - an example consists of one or more features
- Label: The answer or result portion of an example
- Example: The values of one row of features and possibly a label
- Parameter: The weights and biases that a model learns during training
- Weight: A value that a model multiplies by another value
Bias-Variance Trade-off
The bias variance trade off is a concept describing the relationship between a model's accuracy and its complexity.
Overview
- Bias: measures how far off a model's predictions are from the true values
- Variance: measures how much of those predictions fluctuate with different training data
Balance for Good Performance
- Achieving good prediction performance requires balancing of these areas
- High bias (overly simple models) leads to underfitting and poor accuracy
- High variance (overly complex models) leads to overfitting and poor performance on new data
Linear Regression
A supervised learning algorithm used for predicting continuous values by learning from labelled datasets.
Overview
- Assumes there is a linear relationship (output changes at a constant rate as the input changes)
- Used for predicted linear relationships (e.g. house prices, marks)
- Models the relationship between a dependent variable (also known as the target variable) and one or more independent variables (also known as features or predictors)
Residuals
- Residuals = Actual Y vs Predicted Y
- Small residuals mean the model is accurate
Advantages
- Simple to implement and interpret output coefficients
- Good for linear relationships
- Each variable's effect on the outcome can be seen
Disadvantages
- Outliers have big impact
- Assumes linear relationship and independence between attributes
- Struggles with complex, nonlinear data
Logistic Regression
Logistic regression is a supervised ML algorithm that predicts a probability by analysing the relationship between one or more independent variables, and classifying data into discrete classes.
Overview
- Used in predictive modeling, where the model estimates the mathematical probability of whether an instance belongs to a specific category or not
- Produces an s-shaped curve that maps values between 0 and 1 (sigmoid curve)
Use Cases
- Probability of heart attacks
- Probability of enrolling in university
- Identifying spam emails
Advantages
- Easier to implement than other methods of ML - training doesn't require high computational power
- Works well when the dataset is linearly separable - when straight line separates the two data classes
- Provides valuable insights - measures how relevant or appropriate an independent/predictor variable is + reveals direction of their relationship (positive or negative)
Sigmoid Curve/Function
- Takes numbers in, and the output is a number between 0 and 1
- Can use to predict probability
- Boundary/threshold is usually 0.5
K-Nearest Neighbour
KNN (k-nearest neighbour) is a supervised learning algorithm used for classification and regression.
Overview
- Function under the idea of "similar things exist in close proximity"
- It follows instance-based learning (no model is trained beforehand and all training data is stored)
- Non-parametric (makes no assumptions about underlying data distribution)
How It Works
- Stores all available examples: instances from the training dataset
- Picking K: K is number of points the algorithm looks at to make a decision; Should be odd to avoid ties for classification
- Calculating distance: Measure similarity between target and training data points; Calculated between data points in the dataset and target point; Distance metric e.g. Euclidean
- Find the neighbours: K data points with the closest distance
- Prediction: Classification - new point is assigned the most common class; Regression - value is predicted by taking the average of its K neighbour's values
Use Cases
- Healthcare (predict a patient's diagnosis based on past cases)
- Finance (categorise a transaction as fraudulent or not)
- Recommender systems (based on users with similar tastes)
Advantages
- Simple to understand and implement
- No training phase
- Non parametric (makes no assumptions about data distribution)
- Can use with 2+ class labels
Disadvantages
- Computationally expensive
- Performance decreases as data size increases
- Affected by choice of K
- Sensitive to irrelevant features
- No interpretable internal model
Summary
KNN is a supervised machine learning algorithm used for both classification and regression tasks. It is considered an instance-based learning algorithm, because it doesn't learn a model by training, and instead stores the training data to make predictions based on the similarity between data points. When making a prediction, KNN calculates the distance between a new input and all examples in the training dataset using a distance metric, then identifies the 'K' closest neighbours to make a decision. For classification, it assigns the most frequent class among the K neighbours, and for regression it averages the values of those neighbours. KNN demonstrates how automated decision-making can occur by analysing patterns in historical data to make predictions about new inputs.
Related Resources
Keep Progressing
Use the lesson navigation below to move through the module sequence.