Syllabus Point
- Describe types of algorithms associated with ML
Including:
- linear regression
- logistic regression
- K-nearest neighbour
Describe types of algorithms associated with ML
Key Terms
- Bias: Not all models start from (0,0), therefore bias is an intercept or offset
- Feature: An input variable to a ML model - an example consists of one or more features
- Label: The answer or result portion of an example
- Example: The values of one row of features and possibly a label
- Parameter: The weights and biases that a model learns during training
- Weight: A value that a model multiplies by another value
Bias-Variance Trade-off
The bias variance trade off is a concept describing the relationship between a model's accuracy and its complexity.
Overview
- Bias: measures how far off a model's predictions are from the true values
- Variance: measures how much of those predictions fluctuate with different training data
Balance for Good Performance
- Achieving good prediction performance requires balancing of these areas
- High bias (overly simple models) leads to underfitting and poor accuracy
- High variance (overly complex models) leads to overfitting and poor performance on new data
Linear Regression
A supervised learning algorithm used for predicting continuous values by learning from labelled datasets.
Overview
- Assumes there is a linear relationship (output changes at a constant rate as the input changes)
- Used for predicted linear relationships (e.g. house prices, marks)
- Models the relationship between a dependent variable (also known as the target variable) and one or more independent variables (also known as features or predictors)
Residuals
- Residuals = Actual Y vs Predicted Y
- Small residuals mean the model is accurate
Advantages
- Simple to implement and interpret output coefficients
- Good for linear relationships
- Each variable's effect on the outcome can be seen
Disadvantages
- Outliers have big impact
- Assumes linear relationship and independence between attributes
- Struggles with complex, nonlinear data
Logistic Regression
Logistic regression is a supervised ML algorithm that predicts a probability by analysing the relationship between one or more independent variables, and classifying data into discrete classes.
Overview
- Used in predictive modeling, where the model estimates the mathematical probability of whether an instance belongs to a specific category or not
- Produces an s-shaped curve that maps values between 0 and 1 (sigmoid curve)
Use Cases
- Probability of heart attacks
- Probability of enrolling in university
- Identifying spam emails
Advantages
- Easier to implement than other methods of ML - training doesn't require high computational power
- Works well when the dataset is linearly separable - when straight line separates the two data classes
- Provides valuable insights - measures how relevant or appropriate an independent/predictor variable is + reveals direction of their relationship (positive or negative)
Sigmoid Curve/Function
- Takes numbers in, and the output is a number between 0 and 1
- Can use to predict probability
- Boundary/threshold is usually 0.5
K-Nearest Neighbour
KNN (k-nearest neighbour) is a supervised learning algorithm used for classification and regression.
Overview
- Function under the idea of "similar things exist in close proximity"
- It follows instance-based learning (no model is trained beforehand and all training data is stored)
- Non-parametric (makes no assumptions about underlying data distribution)
How It Works
- Stores all available examples: instances from the training dataset
- Picking K: K is number of points the algorithm looks at to make a decision; Should be odd to avoid ties for classification
- Calculating distance: Measure similarity between target and training data points; Calculated between data points in the dataset and target point; Distance metric e.g. Euclidean
- Find the neighbours: K data points with the closest distance
- Prediction: Classification - new point is assigned the most common class; Regression - value is predicted by taking the average of its K neighbour's values
Use Cases
- Healthcare (predict a patient's diagnosis based on past cases)
- Finance (categorise a transaction as fraudulent or not)
- Recommender systems (based on users with similar tastes)
Advantages
- Simple to understand and implement
- No training phase
- Non parametric (makes no assumptions about data distribution)
- Can use with 2+ class labels
Disadvantages
- Computationally expensive
- Performance decreases as data size increases
- Affected by choice of K
- Sensitive to irrelevant features
- No interpretable internal model
Summary
KNN is a supervised machine learning algorithm used for both classification and regression tasks. It is considered an instance-based learning algorithm, because it doesn't learn a model by training, and instead stores the training data to make predictions based on the similarity between data points. When making a prediction, KNN calculates the distance between a new input and all examples in the training dataset using a distance metric, then identifies the 'K' closest neighbours to make a decision. For classification, it assigns the most frequent class among the K neighbours, and for regression it averages the values of those neighbours. KNN demonstrates how automated decision-making can occur by analysing patterns in historical data to make predictions about new inputs.
Sample Answers
No sample answers added yet.
Related Resources
Keep Progressing
Use the lesson navigation below to move through the module sequence.