Index:
What is ML:
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measure by P, improves with experience E.
- Tom Mitchell, 1997
Types of ML Systems
Trained with Human supervision?
Can it learn incrementally on the fly?
Building predictive models OR comparing new data points to know data points?
Instance-based learning
Based on Human Intervention:
The training set fed to the algorithm include the desired solutions (labels).
Unsupervised Learning
The training data fed to the algorithm does NOT contain labels.
Semi-supervised Learning
Data contains mostly unlabelled instances with very few labelled instances. Most semi-supervised algorithms are a combination of supervised + unsupervised learning.
An ideal example is Google Photos - it recognizes similar faces and groups them (clustering or unsupervised part). We can then label or tell the algorithm who these people are (supervised part). Adding these labels helps us for searching photos later.
Reinforcement Learning
An agent performs and action using a policy in a environment. Based on the actions, it is rewarded with points (or penalties). Based on the feedback, the policy is updated (the learning step). The aim is to find the optimal policy so that the agent can maximize the points over time.
If system can learn incrementally from a stream of incoming data:
Batch Learning
The system is incapable of learning incrementally. It is learnt on the entire data at once. When new data is added, the algorithm must be re-trained again on the complete data (old + new data). The updated algorithm is then launched into production.
The process of retraining a new model with fresh data can be automated even though it may take some time for getting trained.
Online Learning
The system can learn incrementally on-the-fly. We can feed new data sequentially, either individually or in small mini-batches. Such systems are great for situations when data is received as a continuous flow (e.g. stock prices).
The adaption of the algorithm to the new data can be adjusted by tuning the Learning Rate hyper-parameter. If learning rate is high, the algorithm will rapidly adapts and is more sensitive to the new data . Conversely, if the learning rate is low, algorithm adapts slowly, is less sensitive but might take more time to adapt to the new data.
If system can learn incrementally from a stream of incoming data:
Instance-based learning
One trivial way of "learning" could be to learn all the instances by heart. Say, for a spam filter, the algorithm learns by heart all the spam emails. It would then classify a new mail as spam if it is identical to anyone of the instances learnt by the algorithm by heart.
This algorithm can improved by measuring how close or similar the new instance with respect to the learnt instances, instead of checking only for the identical match. This is called Instance-based learning.
More formally, the system learns all the examples by heart, then generalizes to new cases by using a similarity measure.
Model-based learning
Another way to build a learning algorithm is to create a model from the instances and the use it make predictions. This is called model-based learning. One simple example is linear regression - we use the instances to find the best fitting line and then use it make predictions for new instances.
Here, we learn model-parameters in a way which gives the best performance. To measure the performance, we typically use a cost-function, which measures how bad is the model performing. We could also use a utility-function, which basically measures how good the performance is.
Main Challenges of Machine Learning
Two things can go wrong - "bad data" or "bad algorithm"
Data warnings:
Insufficient Quantity of Training Data
Non-representative Training Data: If sample is too small, you will have sampling noise. A very large set of samples can be non-representative if the sampling method is flawed, say you sampled too much data from any 1 given part of the data. This is called sampling bias.
Poor-Quality Data: Training data is full of errors, outliers and noise.
Irrelevant Features: Selecting the most useful features (Feature Selection). Combining existing features to produce a more useful one (Feature Extraction). Dimensionality reduction can help.
"garbage in, garbage out"
Algorithm warnings:
1. Overfitting the Training Data
- Model performs well on the training data but doesn't generalizes well.
Regularization: Constraining a model and making it simpler to reduce overfitting.
Consider Linear Regression - it has 2 degrees of freedom, namely, slope and the y-intercept. If we make slope = 0, then the algorithm will essentially have only 1 degree of freedom. In other words, we could only move the horizontal line up or down. On the other hand, if we could restrict the slope to be a small number then the model will have a degree of freedom somewhere between 0 & 1. This model will be more complex than the one with 0 degrees of freedom but simpler that the one with 2 degrees of freedom.
The amount of regularization to be applied can be controlled by a Hyper-parameter.
2. Underfitting the Training Data
- Occurs when the model is too simple to learn the underlying structure of data.
Testing and Validation
To evaluate how the model is performing on the real scenarios, we keep a test-set which helps us to check if the model has overfitted on the training dataset.
However, what if want to train multiple models and compare them against each other to select the best possible one. If we keep on tuning our models based on the performance on test-set, it will lead to over-fitting of the model on that particular test-set.
To deal with this, we take out another sample set from the training data and keep aside. This dataset is called Validation set.
We train multiple models with various hyper-parameters on the reduced training set (we removed a part of it to create the validation set) and select the one which performs best on the validation set. After this, the chosen model is trained on the full training set (include the validation set) to generate a final model. Lastly, this final model is evaluated on the test-set to get an estimate of the generalization error - how well the model generalizes.
The problem with the above approach:
How to know the appropriate size of the validation set?
If the validation set is too small, it may not give an actual performance comparison of the models.
On the other hand, if the validation set is too large, then the reduced training set will have a small size. This will be a problem because we will be selecting the best model trained on a smaller training set and then later on, re-training it on the full training set (containing a big size validation set). The re-trained model on the full-dataset may not be the best model. In other words, it may be possible that another model might outperforms the chosen model if trained on the full training set.
Solution: Cross-validation
Use many small validation sets. Each model is evaluated once per validation set after it is trained on the rest of the data. We can then average out the performance of all the models against different validation sets to get and accurate picture.
The drawback: Increased training time - proportional to number of validation sets.
Data Mismatch
What if a trained model is performing poorly on the validation set?
Is it due to the over fitting on the training set? OR
Is it due to data mismatch between train & validation set?
How do we know?
We a address this problem by chunking out yet another data set from the training set... train-dev set. We evaluate the model on this train-dev set to know whether it is overfitting problem or data mismatch problem.
If the model performs poorly on the train-dev set, it means the model is overfitted on the training set. Otherwise, if the model is performing well on the train-dev set, it means the validation data is different from the training data (data mismatch).
Comments