Summary of Designing Machine Learning Systems written by Chip Huyen.
Chapter 7. Model Deployment and Prediction Service
- ML App Logic
- data engineering → feature engineering → model → metrics
- Deploy & Inference
- Deploy: a loose term that generally means making your model running and accessible
- Inference: the process of generating predictions
- To be deployed:
- model will have to leave the development environment
- model can be deployed to
- a staging environment for testing
- a production environment to be used by end users
1. Machine Learning Deployment Myths
1. Myth1: You Only Deploy One or Two ML Models at a Time
2. Myth2: If we Don’t Do Anything, Model Performance Remains the same.
- “software rot” or “bit rot”
- software program degrades over time even if nothing has been changed.
- ML Systems suffer from data distribution shifts
3. Myth3: You Won’t Need to Update Your Model as Much
- “How often SHOULD I update my models?” → “How often CAN I update my models?”
- Model’s performance decays over time → want to update model as fast as possible
4. Myth4: Most ML Engineers Don’t Need to Worry About Scale
- scale
- eg) a system that serves hundreds of queries per second of millions of users a month
2. Batch Prediction versus Online Prediction
- types of predictions
- Batch prediction, which use only batch features
- Online prediction that uses only batch features (eg. precomputed embeddings)
- Online prediction(Streaming prediction) that use both batch features and streaming features
1. Online Prediction
- when predictions are generated and returned as soon as requests for these predictions arrive
- on-demand prediction, synchronous prediction
2. Batch Prediction
- when predictions are generated periodically or whenever triggered.
- predictions are store somewhere like in-memory or SQL Tables → retrieved as needed
- asynchronous prediction
3. From Batch Prediction to Online Prediction
1. Online Prediction
- easy to start
- problem with online prediction:
- model might take too long to generate predictions
- to solve…
- compute predictions in advance → store in database → fetch then when request arrive
- → called batch prediction
2. Batch Prediction
- predictions are precomputed → trick to reduce the inference latency
- good to generate a lot of predictions and don’t need the results immediately
- problem of batch prediction:
- Less responsive to users’ change preferences
- Need to know what requests to generate predictions in advance
3. Online prediction becomes default
- As hardware becomes more powerful → Online prediction becomes default
- To overcome the latency challenge of online prediction:
- A (near) real-time pipeline that can work with incoming data:
- extract streaming features → input them into a model → return prediction in a near real time
- A model that can generate predictions at a speed acceptable to its end users
- A (near) real-time pipeline that can work with incoming data:
4. Unifying Batch Pipeline and Streaming Pipeline
- using sliding features
- In training this feature is computed in batch
- Whereas during inference this feature is computed in a streaming pipeline
- Apache Flink
5. Model Compression
- Deployed model takes too long to generate predictions:
- make it do inference faster
- → inference optimization
- make the model smaller
- → model compression
- originally, to make model fit on edge device
- → model compression
- make the hardware it’s deployed on run faster
- make it do inference faster
- model compression
- low-rank optimization
- knowledge distillation
- pruning
- quantization
1. Low-Rank Factorization
- key-idea
- replace high-dimensional tensors with low-dimensional tensors
- compact convolutional filters
- replace over-parameterized (having too many parameters) convolutional filters to compact convolutional filters
- compact blocks to both reduce the number of parameters and increase speed
- eg) 3x3 conv → 1x1 conv
2. Knowledge Distillation
- smaller model (student) is train to mimic a larger model or ensemble model (teacher)
- can work regardless of the architectural differences between teacher and student
- disadvantages
- highly dependent on the availability of a teacher network
3. Pruning
- in neural network, it means
- remove entire nodes of a neural network
- changing its architecture and reducing its number of parameters
- find parameters least useful to predictions and set them to zero(0).
- do not change architecture, only the number of nonzero parameters
- sparse architecture
- make a neural network more sparse
- require less storage than dense structure
- remove entire nodes of a neural network
4. Quantization
- most general and commonly used model compression method
- reduce model size by using fewer bits to represent its parameters
- advantage
- reduce memory size
- improves the computational speed
- allows to increase batch size
- less precision speeds up computation
- disadvantage
- rounding numbers → rounding errors
- small rounding errors → large performance change
- lower-precision training increasingly popular
- Fixed-point inference for edge device
ML on the Cloud and on the Edge
- where your model’s computation will happen?
- ⇒ due to cost of cloud, trend are moving to edge