Chapter 3. Data Engineer Fundamentals

Summary of Designing Machine Learning Systems written by Chip Huyen.


Chapter 3. Data Engineer Fundamentals

To retrive stored data

→ important to know

  • not only how it’s formatted
  • but also how it’s structured

Database: data sotrage engine

  • transactional
  • Analytical

Data in Production

  • Pipeline
    • Data → multiple processes and service
    • Raw data → featur engineering service → predictions
  • Types
    • Historical data in data storage engines
    • Streaming data in real-time transports → requires different processing paradigms

1. Data Sources

1. User input data

  • data explicitly input by users

    • Text, image, videos, upload files, etc…
  • Users possibly can input wrong data → easily malformatted

    • Eg) text is too long or too short
  • Users have little patience

    → expect to get result bach immediately

    → user input data tends to require fast processing

2. System-generated data

  • data generated by different components of systems
  • Types
    • Various types of logs
      • Eg) memory usage, …
    • system outputs
      • Eg) model predictions, …
  • Property
    • less likely to be malformatted
    • volume of logs can grow very quickly → cause two problems
      1. Hard to know where to lokk because signals are lost in noise
      2. How to store? → discard old data and get only recent data

3. Internal databases

  • generated by various services and enterprise applications in a company

  • Manages inventory, customer relationship, users

    → can be used by ML Models

4. Third-party data

  1. first-party data
    • Company already collects about users and customers
  2. second-party data
    • collected by another compnay on their own customers that they make available to you
    • Probably have to pay for it
  3. Third-party data
    • companies collect data on the public who aren’t direct customers
      • Eg) unique advertiser ID

2. Data Formats

‘How to store data?’ Chech List

  • how do I store multimodal data?
  • how do I store my data so that it’s cheap and still fast to access?
  • how do I store complex models so that the can be loaded and run correctly on different hardwares?

Data serialiazations

  1. Converting a
    • data structure
    • Object state
  2. Into a format that can be
    1. Stored
    2. Transmitted
  3. And reconstructed later
FormatBinary / TextHuman-ReadableUse case
JSONTextOEverywhere
CSVTextOEverywhere
ParquetBinaryXHadoop
AuroBinary PrimaryXHadoop
ProtobufBinary PrimaryXGoogle, TF Record
PickleBinaryXPython, Python serializations

1. JSON (JavaScript Object Notation)

  • Human-readable
  • Key-value
    • Simple
    • Capable of handling data different levles of structedness

2. Row-Major versus Column-Major

  • Row-Major
    • CSV
    • Use case) Accessing all the examples collected today
    • Data is stored and retrived row by row
    • Good for accessing samples
    • Better when you have to do a lot of writes
  • Column-Major
    • Parquet
    • Use case) accessing the timestamp of all examples
    • Data is stored and retrived column by column
    • Good for accessing features
    • Better when you have to do a lot of column-based readsd
  • Numpy(Row) vs Pandas(Column)

3. Text versus binary format

  • human-readable vs compact
  • Eg) “1000000”
    • Text: 7byte
    • Binary: int32 → 32bits or 4bytes

3. Data Models

How data is represented.

1.Relation Model

  • data is organized into relations : each relation is a set of tuples
    column: unordered
    Column1Column2
    Tuple (row): unordered

2. NoSQL (Not Only SQL)

  • document model
  • graph model

3. Structured versus Unstructed data

  • structured data
    • follows a predefined data model → data schema
    • make easier to analyze
    • data warehouse
      • to store data that has been processed into formats ready to be used
  • unstructured data
    • doesn’t adhere to a schema
    • eg)
      1
      2
      3
      
      lisa,43
      hack,23
      xxx,xxx,xxx <- possible
      

4. Data Storage Engine and Processing

two types of workloads that database are optimized for

  • transactional processing
  • analytical processing

1. Transactional and Analytical Processing

Online Transaction Processing (OLTP)

  • inserted, updated, deleted
  • often involve users so need 2 requirements:
    • low latency: processed fast
    • high availability: processing system needs to be available any time

Transactional Database

  • Transactional Database

    • designed: to process online transactions
    • requirements: low latency and high availability
  • ACID

    1. Atomcity
      • To guarantee that all steps in a transaction are completed successfully as a group.
      • eg) User payment fails → do not assign driver
    2. Consistency
      • To guarantee that all transactions coming through must follow predefined rules
      • eg) a transaction must be made by a valid user
    3. Isolation
      • To guarantee that two transaction happen at the same time as if they were isolated.
      • eg) Do not book same driver to two users
    4. Durability
      • To guarantee that once a transaction is committed, it will remain committed even in the case of a system failure.
      • eg) often ordered a ride and phone dies, still want ride to come
  • BASE (Basically Available, Soft state, and Eventually consistency)

    • don’t necessarily need to be ACID
  • often row-major

    • not be efficient for question like: “average of something in one month” → Online Analytical Processing (OLAP)
  • Both OLTP & OLAP → Outdated

    1. Separation of OLTP & OLAP was due to limitation of technology
    2. Storage and processing are tightly coupled
      • store same data in different databases
      • decouple storage and processing (Big Query, Snowflake, IBM, Teradata)

2. ETL: Extract, Transform, and Load

  • Flow
    • Different sources ⇒ Extract → transformed → loaded ⇒ target destination
  • Difficult to keep data structured ⇒ ELT
  • ELT → Inefficient to search data
  • Hybrid solution (Databricks, Snowflake)

5. Models of Dataflow

  • dataflow
    • data is passed from one process to another
  • Three main models of dataflow
    1. Data passing through databases
    2. Data passing through services using requests by REST and RPC APIs
    3. Data passing through a real-time transport like Apache Kafka and Amazon Kinesis

1. Data Processing Through Database

  • The easiest way to pass data between two process
  • To pass data from process A to process B
    • eg) process A write data into database and process B just read it
  • Doesn’t work
    1. Two process are running in different companies
    2. Requires both process to access data from databases
      • read/write from database can be slow
      • make it unsuitable for low latency

2. Data Passing Through Services

  • send data directly through a network
  • Request-driven
    • process communicate through requests
    • → microservice architecture
  • styles of request
    1. REST (Representational State Transfer)
      • designed for requests over network
      • predominant style for public APIs
    2. RPC (Remote Procedure Call)
      • tries to make a request to a remote network service look the same as calling a function or method in programming language
      • requests between services owned by the same organization

3. Data Passing Through Real-Time Transport

  • request-driven : synchronous
    • target service has to listen to the request for the request to go through
  • Broker
    • coordinates data passing among services
    • each service only has to communicate with the broker
    • use in-memory storage
  • Real-time transport: in-memory storage for data passing among services
  • Event
    • A piece of data broadcast to a real-time transport (a.k.a event-bus)
  • Architecture
    • Request-driven architecture
      • rely more on logic than on data
    • Event-driven architecture
      • works better for system that are data-heavy
  • Common types of real-time transports
    1. pubsub (publish-subscribe)
      • overview
        • any service can publish to different topics in a real-time transport
        • any service that subscribes to a topic can read all events in that topic
      • retention policy
        • data will be retained in the real-time transport for a certain period time of time before
          • deleted
          • move to a permanent storage (Amazon S3)
        • service produce don’t care about what services consume their data
      • Apache Kafka, Amazon Kinesis
    2. Message Queue
      • responsible for getting the message to right consumer
      • Apache RocketMQ, RabbitMQ
Built with Hugo
Theme Stack designed by Jimmy