Summary of Designing Machine Learning Systems written by Chip Huyen.
Chapter 9. Continual Learning and Test in Production
- continual learning
- to adapt our models to data distribution shifts
- infrastructural problem
- test in production
- model is retrained to adapt to the changing environment
- evaluating it on a stationary set
- also test in production
- monitoring & test in production
- monitoring: passively keeping track of the outputs
- test in production: proactively choosing which model to produce outputs
- Goal: to understand a model’s performance and figure out when to update it
- model is retrained to adapt to the changing environment
- Goal of continual learning
- to safely and efficiently automate the update
Continual Learning
- Misunderstand in term of “Continual learning”
- A models updates itself with every incoming sample in production
- problems
- catastrophic forgetting
- tendency of a neural network to completely and abruptly forget previous learned information upon learning new information
- make training more expensive
- catastrophic forgetting
- problems
- Update their model in micro-batch (512, 1024)
- A models updates itself with every incoming sample in production
- updated model shouldn’t be deployed until it’s been evaluated
- existing model → champion model
- replica model → challenger model
- Don’t need to update model frequently
- don’t have enough traffic
- model don’t decay that fast
1. Stateless Retraining Versus Stateful Training
- continual learning isn’t about retraining frequency → manner in which model is retrained
1. Stateless retraining
the model is trained from scratch
for example
1 2 3
<--> (model v1) <--> (model v2) <--> (model v3)
require a lot more data
2. Stateful training
the model continues training on new data
fine-tuning or incremental learning
for example
1 2 3
<--> (model v1) <-> (model v1.1) <-> (model v1.2)
allows to update model with less data
3. Types of model updates
- Model iteration
- A new feature is added to an existing model architecture
- model architecture is changed
- to be stateful training
- knowledge transfer
- model surgery
- Data iteration
- model architecture and features remain the same
- but refresh this model with new data
- means stateful training
2. Why Continual Learning
- Continual Learning
- setting up infrastructure so that you can update your model and deploy these changes as fast as you want
- Use case
- to combat data distribution shifts, especially when the shifts happens suddenly
- to adapt to rare events
- Continuous cold start problem
- arises when your model has to make predictions for a new user without any historical data
3. Continual Learning Challenges
1. Fresh data access challenge
2. Evaluation challenged
- The biggest challenge of continual learning
- making sure that this update is good enough to be deployed
- The risk of catastrophic failures amplify with continual learning
- The more frequently you update your models → the more opportunities there are for updates to fail
- make your models more susceptible to coordinated manipulation and adversarial attack
- evaluation pipeline
- evaluation takes time → can be another bottleneck for model update frequency
3. Algorithm challenged
4. Four Stages of continual learning
- Stage 1: Manual, stateless retraining
- Stage 2: Automated retraining
- Stage 3: Automated, stateful retraining
- Stage 4: Continual learning
5. How Often to Update Your Model
1. Value of data freshness
- Q) How often to update a model? → How much the model performance will improve with updating?
- To figure out the gain
training your model on the data from different time windows in the past and see how the performance changes
for example
1 2 3 4
<--> model A <--> model B <--> model C Test data
2. Model iteration versus data iteration
- model iteration
- data iterating doesn’t give you much performance gain
- → spend resources on finding a better model
- data iteration
- finding a better model architecture requires 100X compute for training and gives 1% performance gain
- whereas data iteration requires 1X compute and also gives 1% performance gain
Test in Production
- To sufficiently evaluate models
- use mixture of offline evaluation and online evaluation
- Offline evaluation
- Good old test split to evaluate models
- → not sufficient to evaluate new model
- →backtest
- backtest
- method of testing a predictive model on data from a specific period of time in the past
- not quite sufficient
- method of testing a predictive model on data from a specific period of time in the past
1. Shadow Deployment
- the safest way to deploy model
- steps
- Deploy the candidate model in parallel with the existing model
- For each incoming request, route it both models to make predictions, but only serve the existing model’s prediction to user
- Log the predictions from the new model for analysis purpose
- Replace model when new model’s predictions are satisfactory
- But expensive: doubling cost
2. A/B Testing
- a way to compare two variants of an object
- testing responses to these two variants → determining which of two variants is more effective
- steps
- Deploy the candidate model alongside the existing model
- A percentage of traffic is routed to the new model, the rest is routed to the existing model
- Monitor and analyze the predictions and user feedback, if any, from both models to determine whether the difference in the two models’ performance is statistically significant
- A/B testing requires
- A/B testing consists of a randomized experiment
- A/B test should be run on a sufficient number of samples to gain enough confidence about the outcome
3. Canary Release
- technique to reduce the risk of introducing a new software version in production
- by slow rolling out the change to a small subset of users
- before rolling it out to everybody
- steps
- Deploy the candidate model alongside the existing model.
- candidate model is called canary
- A portion of the traffic is routed to the candidate model
- If its performance is satisfactory increase the traffic to the candidate model. If not, abort the canary and route all traffic to the existing model
- Stop when either the canary serves all the traffic or the canary is aborted
- Deploy the candidate model alongside the existing model.
4. Interleaving Experiments
- Reliably identifies the best algorithms with considerably smaller sample size compare to traditional A/B testomg
- A/B testing**********************************:********************************** core metrics are compared
- Interleaving: compared by measuring user preferences
5. Bandits
- A/B testing
- randomly route traffic
- stateless
- Bandits
- allow to determine how to route traffic
- stateful
- a lot more data-efficient that A/B testing
- require less data
- reduce opportunity cost as they route traffic to the better model more quickly
- A/B testing 630,000 to get a 95% confidence interval
- 12,000 samples to determine
- a lot more difficult to implement
- bandits requires computing and keeping track of models’ payoffs