Summary of Designing Machine Learning Systems written by Chip Huyen.
Chapter 3. Data Engineer Fundamentals
To retrive stored data
→ important to know
- not only how it’s formatted
- but also how it’s structured
Database: data sotrage engine
- transactional
- Analytical
Data in Production
- Pipeline
- Data → multiple processes and service
- Raw data → featur engineering service → predictions
- Types
- Historical data in data storage engines
- Streaming data in real-time transports → requires different processing paradigms
1. Data Sources
1. User input data
data explicitly input by users
- Text, image, videos, upload files, etc…
Users possibly can input wrong data → easily malformatted
- Eg) text is too long or too short
Users have little patience
→ expect to get result bach immediately
→ user input data tends to require fast processing
2. System-generated data
- data generated by different components of systems
- Types
- Various types of logs
- Eg) memory usage, …
- system outputs
- Eg) model predictions, …
- Various types of logs
- Property
- less likely to be malformatted
- volume of logs can grow very quickly → cause two problems
- Hard to know where to lokk because signals are lost in noise
- How to store? → discard old data and get only recent data
3. Internal databases
generated by various services and enterprise applications in a company
Manages inventory, customer relationship, users
→ can be used by ML Models
4. Third-party data
- first-party data
- Company already collects about users and customers
- second-party data
- collected by another compnay on their own customers that they make available to you
- Probably have to pay for it
- Third-party data
- companies collect data on the public who aren’t direct customers
- Eg) unique advertiser ID
- companies collect data on the public who aren’t direct customers
2. Data Formats
‘How to store data?’ Chech List
- how do I store multimodal data?
- how do I store my data so that it’s cheap and still fast to access?
- how do I store complex models so that the can be loaded and run correctly on different hardwares?
Data serialiazations
- Converting a
- data structure
- Object state
- Into a format that can be
- Stored
- Transmitted
- And reconstructed later
Format | Binary / Text | Human-Readable | Use case |
---|---|---|---|
JSON | Text | O | Everywhere |
CSV | Text | O | Everywhere |
Parquet | Binary | X | Hadoop |
Auro | Binary Primary | X | Hadoop |
Protobuf | Binary Primary | X | Google, TF Record |
Pickle | Binary | X | Python, Python serializations |
1. JSON (JavaScript Object Notation)
- Human-readable
- Key-value
- Simple
- Capable of handling data different levles of structedness
2. Row-Major versus Column-Major
- Row-Major
- CSV
- Use case) Accessing all the examples collected today
- Data is stored and retrived row by row
- Good for accessing samples
- Better when you have to do a lot of writes
- Column-Major
- Parquet
- Use case) accessing the timestamp of all examples
- Data is stored and retrived column by column
- Good for accessing features
- Better when you have to do a lot of column-based readsd
- Numpy(Row) vs Pandas(Column)
3. Text versus binary format
- human-readable vs compact
- Eg) “1000000”
- Text: 7byte
- Binary: int32 → 32bits or 4bytes
3. Data Models
How data is represented.
1.Relation Model
- data is organized into relations : each relation is a set of tuples
column: unordered Column1 Column2 … Tuple (row): unordered
2. NoSQL (Not Only SQL)
- document model
- graph model
3. Structured versus Unstructed data
- structured data
- follows a predefined data model → data schema
- make easier to analyze
- data warehouse
- to store data that has been processed into formats ready to be used
- unstructured data
- doesn’t adhere to a schema
- eg)
1 2 3
lisa,43 hack,23 xxx,xxx,xxx <- possible
4. Data Storage Engine and Processing
two types of workloads that database are optimized for
- transactional processing
- analytical processing
1. Transactional and Analytical Processing
Online Transaction Processing (OLTP)
- inserted, updated, deleted
- often involve users so need 2 requirements:
- low latency: processed fast
- high availability: processing system needs to be available any time
Transactional Database
Transactional Database
- designed: to process online transactions
- requirements: low latency and high availability
ACID
- Atomcity
- To guarantee that all steps in a transaction are completed successfully as a group.
- eg) User payment fails → do not assign driver
- Consistency
- To guarantee that all transactions coming through must follow predefined rules
- eg) a transaction must be made by a valid user
- Isolation
- To guarantee that two transaction happen at the same time as if they were isolated.
- eg) Do not book same driver to two users
- Durability
- To guarantee that once a transaction is committed, it will remain committed even in the case of a system failure.
- eg) often ordered a ride and phone dies, still want ride to come
- Atomcity
BASE (Basically Available, Soft state, and Eventually consistency)
- don’t necessarily need to be ACID
often row-major
- not be efficient for question like: “average of something in one month” → Online Analytical Processing (OLAP)
Both OLTP & OLAP → Outdated
- Separation of OLTP & OLAP was due to limitation of technology
- Storage and processing are tightly coupled
- store same data in different databases
- decouple storage and processing (Big Query, Snowflake, IBM, Teradata)
2. ETL: Extract, Transform, and Load
- Flow
- Different sources ⇒ Extract → transformed → loaded ⇒ target destination
- Difficult to keep data structured ⇒ ELT
- ELT → Inefficient to search data
- Hybrid solution (Databricks, Snowflake)
5. Models of Dataflow
- dataflow
- data is passed from one process to another
- Three main models of dataflow
- Data passing through databases
- Data passing through services using requests by REST and RPC APIs
- Data passing through a real-time transport like Apache Kafka and Amazon Kinesis
1. Data Processing Through Database
- The easiest way to pass data between two process
- To pass data from process A to process B
- eg) process A write data into database and process B just read it
- Doesn’t work
- Two process are running in different companies
- Requires both process to access data from databases
- read/write from database can be slow
- make it unsuitable for low latency
2. Data Passing Through Services
- send data directly through a network
- Request-driven
- process communicate through requests
- → microservice architecture
- styles of request
- REST (Representational State Transfer)
- designed for requests over network
- predominant style for public APIs
- RPC (Remote Procedure Call)
- tries to make a request to a remote network service look the same as calling a function or method in programming language
- requests between services owned by the same organization
- REST (Representational State Transfer)
3. Data Passing Through Real-Time Transport
- request-driven : synchronous
- target service has to listen to the request for the request to go through
- Broker
- coordinates data passing among services
- each service only has to communicate with the broker
- use in-memory storage
- Real-time transport: in-memory storage for data passing among services
- Event
- A piece of data broadcast to a real-time transport (a.k.a event-bus)
- Architecture
- Request-driven architecture
- rely more on logic than on data
- Event-driven architecture
- works better for system that are data-heavy
- Request-driven architecture
- Common types of real-time transports
- pubsub (publish-subscribe)
- overview
- any service can publish to different topics in a real-time transport
- any service that subscribes to a topic can read all events in that topic
- retention policy
- data will be retained in the real-time transport for a certain period time of time before
- deleted
- move to a permanent storage (Amazon S3)
- service produce don’t care about what services consume their data
- data will be retained in the real-time transport for a certain period time of time before
- Apache Kafka, Amazon Kinesis
- overview
- Message Queue
- responsible for getting the message to right consumer
- Apache RocketMQ, RabbitMQ
- pubsub (publish-subscribe)