Summary of Designing Machine Learning Systems written by Chip Huyen.

Chapter 3. Data Engineer Fundamentals

To retrive stored data

→ important to know

not only how it’s formatted
but also how it’s structured

Database: data sotrage engine

transactional
Analytical

Data in Production

Pipeline
- Data → multiple processes and service
- Raw data → featur engineering service → predictions
Types
- Historical data in data storage engines
- Streaming data in real-time transports → requires different processing paradigms

1. Data Sources

1. User input data

data explicitly input by users
- Text, image, videos, upload files, etc…
Users possibly can input wrong data → easily malformatted
- Eg) text is too long or too short
Users have little patience
→ expect to get result bach immediately
→ user input data tends to require fast processing

2. System-generated data

data generated by different components of systems
Types
- Various types of logs
  - Eg) memory usage, …
- system outputs
  - Eg) model predictions, …
Property
- less likely to be malformatted
- volume of logs can grow very quickly → cause two problems
  1. Hard to know where to lokk because signals are lost in noise
  2. How to store? → discard old data and get only recent data

3. Internal databases

generated by various services and enterprise applications in a company
Manages inventory, customer relationship, users
→ can be used by ML Models

4. Third-party data

first-party data
- Company already collects about users and customers
second-party data
- collected by another compnay on their own customers that they make available to you
- Probably have to pay for it
Third-party data
- companies collect data on the public who aren’t direct customers
  - Eg) unique advertiser ID

2. Data Formats

‘How to store data?’ Chech List

how do I store multimodal data?
how do I store my data so that it’s cheap and still fast to access?
how do I store complex models so that the can be loaded and run correctly on different hardwares?

Data serialiazations

Converting a
- data structure
- Object state
Into a format that can be
1. Stored
2. Transmitted
And reconstructed later

Format	Binary / Text	Human-Readable	Use case
JSON	Text	O	Everywhere
CSV	Text	O	Everywhere
Parquet	Binary	X	Hadoop
Auro	Binary Primary	X	Hadoop
Protobuf	Binary Primary	X	Google, TF Record
Pickle	Binary	X	Python, Python serializations

1. JSON (JavaScript Object Notation)

Human-readable
Key-value
- Simple
- Capable of handling data different levles of structedness

2. Row-Major versus Column-Major

Row-Major
- CSV
- Use case) Accessing all the examples collected today
- Data is stored and retrived row by row
- Good for accessing samples
- Better when you have to do a lot of writes
Column-Major
- Parquet
- Use case) accessing the timestamp of all examples
- Data is stored and retrived column by column
- Good for accessing features
- Better when you have to do a lot of column-based readsd
Numpy(Row) vs Pandas(Column)

3. Text versus binary format

human-readable vs compact
Eg) “1000000”
- Text: 7byte
- Binary: int32 → 32bits or 4bytes

3. Data Models

How data is represented.

1.Relation Model

data is organized into relations : each relation is a set of tuples
column: unordered
Column1 Column2 …
Tuple (row): unordered

2. NoSQL (Not Only SQL)

document model
graph model

3. Structured versus Unstructed data

structured data
- follows a predefined data model → data schema
- make easier to analyze
- data warehouse
  - to store data that has been processed into formats ready to be used

unstructured data

doesn’t adhere to a schema

eg)

1
2
3
lisa,43
hack,23
xxx,xxx,xxx <- possible

4. Data Storage Engine and Processing

two types of workloads that database are optimized for

transactional processing
analytical processing

1. Transactional and Analytical Processing

Online Transaction Processing (OLTP)

inserted, updated, deleted
often involve users so need 2 requirements:
- low latency: processed fast
- high availability: processing system needs to be available any time

Transactional Database

Transactional Database
- designed: to process online transactions
- requirements: low latency and high availability
ACID
1. Atomcity
  - To guarantee that all steps in a transaction are completed successfully as a group.
  - eg) User payment fails → do not assign driver
2. Consistency
  - To guarantee that all transactions coming through must follow predefined rules
  - eg) a transaction must be made by a valid user
3. Isolation
  - To guarantee that two transaction happen at the same time as if they were isolated.
  - eg) Do not book same driver to two users
4. Durability
  - To guarantee that once a transaction is committed, it will remain committed even in the case of a system failure.
  - eg) often ordered a ride and phone dies, still want ride to come
BASE (Basically Available, Soft state, and Eventually consistency)
- don’t necessarily need to be ACID
often row-major
- not be efficient for question like: “average of something in one month” → Online Analytical Processing (OLAP)
Both OLTP & OLAP → Outdated
1. Separation of OLTP & OLAP was due to limitation of technology
2. Storage and processing are tightly coupled
  - store same data in different databases
  - decouple storage and processing (Big Query, Snowflake, IBM, Teradata)

2. ETL: Extract, Transform, and Load

Flow
- Different sources ⇒ Extract → transformed → loaded ⇒ target destination
Difficult to keep data structured ⇒ ELT
ELT → Inefficient to search data
Hybrid solution (Databricks, Snowflake)

5. Models of Dataflow

dataflow
- data is passed from one process to another
Three main models of dataflow
1. Data passing through databases
2. Data passing through services using requests by REST and RPC APIs
3. Data passing through a real-time transport like Apache Kafka and Amazon Kinesis

1. Data Processing Through Database

The easiest way to pass data between two process
To pass data from process A to process B
- eg) process A write data into database and process B just read it
Doesn’t work
1. Two process are running in different companies
2. Requires both process to access data from databases
  - read/write from database can be slow
  - make it unsuitable for low latency

2. Data Passing Through Services

send data directly through a network
Request-driven
- process communicate through requests
- → microservice architecture
styles of request
1. REST (Representational State Transfer)
  - designed for requests over network
  - predominant style for public APIs
2. RPC (Remote Procedure Call)
  - tries to make a request to a remote network service look the same as calling a function or method in programming language
  - requests between services owned by the same organization

3. Data Passing Through Real-Time Transport

request-driven : synchronous
- target service has to listen to the request for the request to go through
Broker
- coordinates data passing among services
- each service only has to communicate with the broker
- use in-memory storage
Real-time transport: in-memory storage for data passing among services
Event
- A piece of data broadcast to a real-time transport (a.k.a event-bus)
Architecture
- Request-driven architecture
  - rely more on logic than on data
- Event-driven architecture
  - works better for system that are data-heavy
Common types of real-time transports
1. pubsub (publish-subscribe)
  - overview
    - any service can publish to different topics in a real-time transport
    - any service that subscribes to a topic can read all events in that topic
  - retention policy
    - data will be retained in the real-time transport for a certain period time of time before
      - deleted
      - move to a permanent storage (Amazon S3)
    - service produce don’t care about what services consume their data
  - Apache Kafka, Amazon Kinesis
2. Message Queue
  - responsible for getting the message to right consumer
  - Apache RocketMQ, RabbitMQ

	column: unordered
	Column1	Column2	…
Tuple (row): unordered