Designing Data-Intensive Applications

Reliable, Scalable, and Maintainable Applications
- Intro
  - applications today more data-intensive than compute intensive
  - need to store data (dbs), remember expensive operations (cache), allow users to search (indexes), send a message to another process async (stream processing), periodically crunch a large amount of data (batch)
- Thinking About Data Systems
  - wide-ranging requirements handled by stitching different tools using application code
  - application code keeps things correct and in-sync
  - 3 important concerns in a system: reliability, scalability, maintainability
- Reliability
  - roughly means “things continue to work correctly even when things go wrong”
  - hard to eliminate faults completely, so good to design fault-tolerance mechanisms to prevent them from causing failures
  - Hardware Faults
    - issues likely to occur eventually if using a lot of machines
    - add redundancy
    - move towards systems that can tolerate the loss of entire machines
    - also becomes easier to patch one node at a time
  - Software Errors
    - bugs can be dormant and systematic
    - think carefully about assumption and interactions
    - thorough testing, process isolation, allow process to crash and restart, measuring, monitoring
  - Human Errors
    - design system in way that minimizes errors
    - decouple places where people make mistakes from where they can cause failures
    - test throughly at all levels
    - allow quick and easy recovery
    - detailed and clear monitoring
- Scalability
  - Describing Load
    - load parameters
    - depends on architecture e.g request per second, ratio of read/writes, simultaneously active users, hit rate on cache
    - may care about average case or extreme
  - Describing Performance
    - what happens if load increases with same resources
    - how much do resources need to change, given some increased load to keep performance unchanged
    - requires performance numbers
    - think about in distributions and percentiles e.g. p95, p99, p999
    - tail latencies e.g. 99.9th percentile means 1 out of 1000 will have bad performance
    - percentiles often used in SLAs and SLOs
  - Approaches for Coping with Load
    - scaling up / vertical: expensive
    - scaling out / horizontal: shared-nothing architecture
    - elastic systems automatically add computing resources
    - distributing data adds a lot of complexity so may be good to only do it when you need it
    - no one size fits all
    - scaling with incorrect assumptions results in a lot of waste and potentially counter productive
- Maintainability
  - Able to adapt and modify
  - Operability: Making Life Easy for Operations
    - make sure things are running smoothly
    - provide visibility into behavior
    - good support for automation and integration
    - avoid dependency on individual machines
    - good docs
    - good default behavior
    - self-healing where appropriate
    - predictable
  - Simplicity: Managing Complexity
    - systems can become big balls of mud
    - risks of introducing bugs and makes maintaining hard and expensive
    - reducing complexity improves maintainability
    - best tool is good abstractions
  - Evolvability: Making Change Easy
    - system likely constantly in flux
    - learn new info, business priority changes, new user requests, new platforms, legal/regulatory requirements
Data Models and Query Languages
- Intro
  - most important part of developing software because it has a profound effect on how software is written and how we think about problem
  - layers hide complexity and these abstractions allow different groups of engineers to work together effectively
- Relational Model Versus Document Model
  - best-known model is SQL
  - relations (tables) and tuples (rows)
  - goal of relational dbs was to hide db implementation details behind a cleaner interface
  - The Birth of NoSQL
    - driving forces: need for scalability (large data or high write throughput), preference for free software, query operations not supported by relational model, restrictiveness of relational schemas
  - The Object-Relational Mismatch
    - OOP and relational model: impedance mismatch
    - in relational model, may need to query multiple tables to get everything you need vs potentially getting everything from 1 place i.e. JSON blob (better locality)
  - Many-to-One and Many-to-Many Relationships
    - use ids vs free text
    - advantages to normalization
    - joins relatively easy in relational model
    - may have to join in application code if joins not supported in db
  - Are Document Databases Repeating History?
    - IMS db in 1970’s was popular and used hierarchical model similar to JSON
    - made many-to-many relationships difficult and no joins
    - relational model was proposed to solve limitations
    - in relational model, query optimizer decides which parts of query to execute and what indexes to use
  - Relational Versus Document Databases Today
    - main argument in favor of document model is schema flexibility, performance due to locality and for some apps data structures more similar
    - moving joins to application level increases complexity there
    - for highly interconnected data, document model is awkward
    - document dbs “schema-on-read”
    - schema changes can be seen as cumbersome
    - if schema is heterogenous and/or you have no control, schemas may hurt
    - if app usually needs entire document, benefit from locality
    - if it uses a small part, can be wasteful
    - relational dbs getting JSON support
- Query Languages for Data
  - SQL declarative
  - declarative lends itself to parallel execution
  - MapReduce sort of in between declarative and imperative
- Graph-Like Data Models
  - If a lot of many-to-many relationships may want to model your data as a graph
  - Property Graphs
    - vertex consists of unique id, set of outgoing edges, set of incoming edges, collection of properties
    - a lot of flexibility and evolvability
  - The Cypher Query Language
    - declarative query language for property graphs created for Neo4j graph db
  - Graph Queries in SQL
    - could put graph in relational db, but may be hard to query
    - need to know joins beforehand
  - Triple-Store and SPARQL
    - triple-store similar to property graph model
    - subject, predicate, object
    - subject is vertex and object is vertex or a primitive datatype e.g. string, number
    - some links to semantic web
    - Sparql: query language for triple-stores using RDF data model
  - The Foundation: Datalog
    - datalog is older language used in datomic and cascalog (subset of prolog)
    - can combine and reuse rules
Storage and Retrieval
- Data Structures That Power Your Database
  - simplest db can be key value csv text file
  - indexes can allow for faster reads at cost of more expensive writes
  - Hash Indexes
    - simple index could be hashmap where value is a byte offset in the datafile
    - may be good if many writes, but if feasible to keep all keys in memory
    - to avoid running out of space for append only logs, can break into segments and perform compaction where you keep most recent set keys
    - can also merge segments
    - must consider: file format, deleting records, crash recovery, partially written records, concurrency
    - advantages of append only include: performance, simpler concurrency and crash recovery, merging old segments, avoids problems of data fragmentation
    - cons of hash table index: must fit in memory and range queries not efficient
  - SSTables and LSM-Trees
    - sorted-string table: advantages include 1) easy merging segments, 2) don’t need to keep index for all keys (sparse index in memory), 3) compress segments before writing to disk to save space and I/O bandwidth use
    - easier to maintain sorted structure in memory with red-black or AVL trees (insert in any order and read in sorted)
    - in-memory trees sometimes called memtable
    - when reaches some size write to disk as SSTable
    - to serve read requests, start with memtable then more recent sgements
    - run compaction in background
    - issue: if db crashes, most recent writes lost
    - can track separate log on disk
    - this algorithm used by levelDB and RocksDB, key-value storage engines, similar engines also used in Cassandra and HBase
    - often called LSM (Log-Structured Merge-Tree)
    - queries for keys not in db may be expensive / use bloom filter
    - compaction based on size-tiers or level
  - B-Trees
    - most common for indexes
    - break down db into fixed-size blocks or pages (4kb usually) and read/write 1 block at a time
    - each page identified by address and can reference other pages like pointers for disk
    - leaf nodes contain values
    - writes overwrite page / does not change location
    - to make B-Trees resilient to crash, common to have a write-ahead log (WAL)
    - append-only file where every B-Tree modification must be written before it can be applied
    - concurrently handled with latches (light-weight locks)
    - pages can be anywhere, but B-trees often try to lay out sequentially on disk
    - can have pointers to sibling nodes for easier scanning
  - Comparing B-Trees and LSM-Trees
    - LSM typically faster for writes and slower for reads
    - benchmarks however inconclusive and depends on workload
    - LSM trees generally can sustain higher write throughput and generally have lower write amplification
    - LSM trees can be compressed better
    - Downsides of LSM, compaction can interfere with ongoing read/writes
    - at high percentile, response time of queries can be high
    - possible compaction doesn’t keep up with writes (run out of disk space)
    - B-Trees keys exist in one place offering strong transactional semantics
  - Other Indexing Structures
    - primary key uniquely identifies a row
    - “value” in index can be actual row or reference to row
    - in latter case, data stored in heap file
    - “clustered index” store row directly within index
    - can have covering index which includes some columns
    - concatenated index (multi-column)
    - in memory dbs often used for caching
    - faster not because of memory because disk-based storage can also cache a lot of data if memory big enough
    - don’t need to encode in-memory data in disk form
- Transaction Processing or Analytics?
  - two types: transaction and analytics
  - transactions generally involve low-latency read/write and small number of rows
  - analytics scan over huge records and involve aggregations
  - Data Warehousing
    - separate database analysts can query without affecting OLTP
    - process of getting into warehouse ETL
    - can be optimized for analytic access patterns
    - vendors generally don’t support both transaction processing and analytic workloads
  - Stars and Snowflakes: Schemas for Analytics
    - star schema or dimensional modeling
    - fact table representing events with foreign keys
    - snowflake schemas more normalized
- Column-Oriented Storage
  - often query accesses only a few columns
  - row-oriented storage engine needs to load all rows
  - fact tables could have 100+ attributes
  - column-oriented store by columns
  - each column file contain rows in same order
  - Column Compression
    - lends itself well to compression
    - bitmap encoding is one technique
    - often number of distinct values is small
  - writes often more difficult
  - materialized views persist a query, but need to be updated
  - data cube or olap grid of aggregates
Encoding and Evolution
- Formats for Encoding Data
  - programs usually deal with data in two forms: in-memory and on disk (seq of bytes)
  - translation between two is encoding (to byte sequence) and decoding (from byte sequence)
  - Language-Specific Formats
    - some languages provide e.g. java serializable, ruby marshal, python pickle
    - issues: ties you to language, security issues, versioning difficulties, efficiency
    - generally bad idea to use language’s built-in encoding
  - JSON, XML, and Binary Variants
    - JSON/XML human readable
    - issues: ambiguity with numbers, don’t support binary strings, schema support can be complicated
    - binary encodings of json and xml available but not widely adopted
  - Thrift and Protocol Buffers
    - requires schema for data in thrift interface definition language and protobuf equivalent
    - comes with code generation tool to produce classes in various languages
    - data stored with type, field tag, length and value
    - can add field and remove optional field, but shouldn’t use same tag number again
  - Avro
    - avro allows different writer and reader schemas. used in hadoop friendlier to dynamically generated schemas because no tag numbers
  - The Merits of Schemas
    - can be more compact than binary JSON variants since omit field names
    - schema acts as form of documentation that is up-to-date
    - can keep database of schemas to check forward and backward compatibility
    - for statically typed languages, enables type checking at compile time
    - provide better guarantees about your data than schema-on-read
- Modes of Dataflow
  - Dataflow Through Databases
    - may have newer and older versions of application code working on db
    - newer versions may write values for new fields that old code doesn’t know about
    - encoding formats discussed support such preservation of unknown fields, but sometimes need to take care at application level
    - rewriting data into new schema expensive. most dbs allow simple changes like adding null column and fills null when that row read
    - during archival can encode data consistently
  - Dataflow Through Services: REST and RPC
    - client/server most common way of communicating over network
    - web services use REST or SOAP
    - SOAP involves code generation and using local classes and method calls. XML-based
    - RESTful API tends to be simpler
    - RPCs try to make a request look like a function or method call
    - RPCs are less predicable due to network, can have timeouts, request may be getting through, but response lost (idempotence issues), slower, efficiency (need to pass and encode all params)
    - encoding formats provide RPC support
    - returns futures(promises), more explicit about async and failure possibility
    - RESTful API can be easier to experiment with and provide documentation
    - RPC typically used for services owned by same organization
    - reasonable to assume servers updated first
  - Message-Passing Dataflow
    - pros: message broker can act as a buffer if recipient not available, automatically redeliver, don’t need to know recipient IP/port, allow 1 message to multiple recipient, decouples sender from recipient
    - async / one-way
    - open source: RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka
    - broker ensures message is delivered to one or more consumers/subscribers
    - distributed actor framework integrates message broker and actor programming model
  - Summary
    - transform data to bytes for network or disk
    - service should support rolling upgrades
    - encode data to provide backward and forward compatibility
Replication
- Intro
  - keep data close to users
  - allow system to keep working even if some parts fail
  - scale out machines that can serve read queries
- Leaders and Followers
  - leader-based replication most common
  - writes go to leader and sends data change to followers as replication log or change stream
  - followers apply same writes in order
  - Synchronous Versus Asynchronous Replication
    - in sync, leader must receive confirmation from all followers before writing. generally impractical
    - usually might enable one follower sync means you’ll have up-to-date info on at least two nodes
    - often completely asynchronous. higher write throughput but data could be lost if leader dies
  - Setting Up New Followers
    - take snapshot of leaders db w/o taking lock
    - copy to new node / follower
    - connect to leader and request data changes
    - replication log position e.,g. postgres (log sequence number) and binlog coordinates for mysql
    - wait until catches up
  - Handling Node Outages
    - when follower fails, can check log of data changes and see which transaction last succeeded and request remaining
    - leader failure: determine leader failed, choose new leader, reconfigure clients to point to new leader
    - if async replication used and former leader rejoins, may have conflicting writes. usually discard old leader’s unreplicated writes
    - can have issues coordinating with other data systems, e.g. reusing primary keys
    - can have two leaders / split brain
    - balance for timeout for leader
  - Implementation of Replication Logs
    - statement-based: sends sql statements, but replaces non-deterministic functions with values
    - write-ahead log (WAL): often writes appended to a log. send same thing, but very low level. involves specifying which bytes were changed on which disk block. Need same software for leader and follower
    - logical (row-based) log replication: decoupled from storage engine and describe writes at granularity of row. can allow leader and followers to run different versions and parseable by other applications
    - trigger-based replication: sometimes want to move to application level. more flexibility at cost of overhead and bug prone
- Problems with Replication Lag
  - If many reads/small writes, leader follower is attractive, but if async replication possible follower is dated
  - Reading Your Own Writes
    - requires read-after-write consistency
    - can make same queries go to the leader
    - another issue is if user accesses from multiple devices
    - track last update and within some time frame query leader
  - Monotonic Reads
    - same queries can retrieve different results making it seem like going back in time due to lagging replica
    - could make sure user reads from same replica
  - Consistent Prefix Reads
    - ensure proper ordering e.g. answer doesn’t come before question
    - one way is making sure writes that are causally related written to same partition
  - Solutions for Replication Lag
    - pass certain type of queries to leader
    - transactions
- Multi-Leader Replication
  - Use Cases for Multi-Leader Replication
    - multi datacenters: could add significant latency if all writes need to go to one datacenter
    - can have conflicts, generally should be avoided
    - need to be working while not connected to network
    - can have local db that syncs
  - Handling Write Conflicts
    - avoid by ensuring all writes for a particular row go to same leader (recommended approach)
    - can use last write wins, but prone to data loss
    - can have custom resolution logic at application level
  - Multi-Leader Replication Topologies
    - circular, star, all-to-all
- Leaderless Replication
  - Writing to the Database When a Node Is Down
    - send all replicas the write request and wait to hear 2/3 confirmed
    - query all replicas as well to get most recent value
    - replicas get back in sync via read repair (client tells outdated replica) and anti-entropy process (background process)
    - quorum conditions allow work around unavailable nodes
  - Limitations of Quorum Consistency
    - possible to get stale data depending on availability requirements
    - sloppy quorums and hinted handoff
    - good for high availability and low latency
    - if network issues, can be cut off from replicas and not reach a quorum
    - should we write to some nodes? (sloppy)
    - most implementations not good at becoming eventually consistent based on concurrent writes
  - Detecting Concurrent Writes
    - app developer needs to know a lot about internals
    - last write wins uses timestamp, can discard data that supposedly correctly wrote
    - only safe way is ensuring key written once and immutable
    - merge concurrently written values, error-prone in application code
    - version vector: version number per replica per key
Partitioning
- Partitioning and Replication
  - generally used with replication
  - node can contain multiple partitions
  - some nodes leader to some partitions and follower for others
- Partitioning of Key-Value Data
  - avoid hotspots and skew
  - could do random, but when need to query all nodes
  - Partitioning by Key Range
    - within each partitions keep keys in sorted order making range scans easier
    - can result in hot spots e.g. query for certain day if timestamp is key
    - need to do something other than just timestamp e.g. prefix with sensor name
  - Partitioning by Hash of Key
    - assign each partition in a range of hashes
    - lose sorting, but can have compromise by having multiple values in primary key, which makes it sortable e.g. (user_id, timestamp), user_id used for hashing and within partition rows sorted by timestamp
  - Skewed Workloads and Relieving Hot Spots
    - even with hashing if read/writes to same key have hot spots
    - responsibility of application to reduce skew
    - e.g. could prefix or suffix random number to key, but more book keeping
- Partitioning and Secondary Indexes
  - secondary indexes don’t necessarily uniquely identify row
  - solr and elasticsearch exist because of this reason
  - don’t map neatly to partitions
  - Partitioning Secondary Indexes by Document
    - each partition maintains own secondary index
    - e.g. partition 1: color:red -> [191, 306]; partition 2: color:red -> [789]
    - aka local indexes
    - requires querying all partitions (scatter/gather) makes reads expensive (prone to tail latency amplification)
    - vendors recommend partitioning scheme so secondary index queries can be served from single partition
  - Partitioning Secondary Indexes by Term
    - use global index and partition differently from primary key index
    - aka term-partitioned
    - makes reads more efficient, but writes are slower and more complicated because write to single doc can affect multiple partitions of the index
    - secondary indexes also usually updated async
- Rebalancing Partitions
  - query throughput or data may increase
  - machines may fail
  - rebalancing should happen in background and keep consistent load among nodes
  - should move minimum amount of data
  - Strategies for Rebalancing
    - mod strategy makes rebalancing excessively expensive when adding/removing nodes
    - fixed number of partitions: create many more partitions and assign several to each node
    - if new node is added, can take a few partitions from each node
    - move entire partitions between nodes
    - can even distribute unevenly e.g. stronger hardware
    - dynamic partitioning: key range partitioned dbs create partitions dynamically
    - split or merge partitions and move to new nodes
    - number of partitions adapt to data volume
    - another option is having a fixed number of partitions to nodes
  - Operations: Automatic or Manual Rebalancing
    - rebalancing can be unpredictable and expensive
    - good to have human in loop to prevent surprises
- Request Routing
  - service discovery issue: where to query?
    - allow client to contact any node
    - send all request through routing tier (partition aware LB)
    - require clients be aware of partitioning and assignments
  - many distributed systems rely on a separate coordination service to keep track of cluster meta data e.g. zookeeper
  - each node registered with zookeeper and zookeeper maintains mapping
  - whenever changes, zookeeper notifies routing tier
  - Parallel Query Execution
    - NoSQL distributed data stores have simple access patterns based on single key or secondary index
    - some relational db products offer very sophisticated operations and have optimizers that can break into many parallel steps
Transactions
- The Slippery Concept of a Transaction
  - The Meaning of ACID
    - atomicity, consistency, isolation, durability
    - atomicity: cannot be broken down into smaller parts. operations won’t be partially applied. abortability.
    - consistency: certain invariants must always be true e.g. foreign keys or uniqueness constraints, but in general application defines what data is valid. more a property of the application
    - isolation: concurrency issues. transactions are isolated from each other. serializability. like running serially. in practice rarely used because of performance penalty.
    - durability: will not lose data. written to hard disk or SSD
  - Single-Object and Multi-Object Operations
    - multi-object transactions need to know which operations belong to same transactions e.g. “BEGIN TRANSACTION” and “COMMIT”
    - non-relational dbs may not have way of grouping operations together resulting in partial state
    - atomicity implemented with logs and isolation with locks
    - many distributed datastores abandon multi-object transactions but could be useful for keeping foreign keys, denormalized data, secondary indexes in sync
    - key feature is transactions can be safely aborted and retried, but may be cases where these can be issues e.g. transaction succeeded, but response failed so do again, if not transient error, error due to overload can exacerbate issue, side effects
- Weak Isolation Levels
  - In theory let you pretend no concurrency is happening
  - dbs generally don’t do serializable isolation because too costly, but do protect against some concurrency issues
  - Read Committed
    - only read committed data (no dirty reads)
    - when writing to database, only overwrite committed (no dirty writes)
    - dirty reads may result reading data in partial state or prior to rollback
    - dirty write, usually delay one until other finished
    - popular isolation level which is default of many databases
    - when modifying now, needs to acquire lock until commit
    - only one transaction can hold lock for any given object
    - read locks generally don’t work well in practice
    - during lock, db tracks old and new value and during transaction, just return old value
  - Snapshot Isolation and Repeatable Read
    - some issues can see data in inconsistent state
    - if backing up db that is still getting written to, may end up with backup with some old and new data
    - large queries may see weird results if certain rows represent different points in time
    - snapshot isolation is solution and allows transactions to read a consistent snapshot of the data
    - readers should never block writers and writers shouldn’t block readers
    - to implement snapshot isolation, db keeps track of different committed versions of object at different points in time
    - known as MVCC (Multi-version concurrency control)
    - transactions are given a always increasing transaction id and data is tagged with id
    - when transaction reads from db, ids used which object it can see
    - at start of transaction, db has list of other transactions, writes by those transactions are ignored
    - called “serializable” in oracle and “repeatable read” in mysql and postgres
  - Preventing Lost Updates
    - lost update problem: simultaneous red, modify, write operations like it didn’t happen
    - e.g. incrementing a counter, making local change to complex value, two users editing wiki page at same time
    - dbs provide atomic updates ops to avoid read-modify-write cycle e.g. update set where
    - can use explicit locking
    - alternatives is allowing execution and abort when lost update is detected
    - compare-an-set can’t modify if initial value changed
    - LWW default on many replicated dbs
  - Write Skew and Phantoms
    - changes to different objects may result in constraint violation
    - can explicitly lock rows
    - phantom: write in one transaction affects read of another
    - materializing conflicts: explicitly create a table with rows which can represent locks. discouraged. last resort. serializable isolation level is much preferable
- Serializability
  - serializable isolation usually regarded as strongest isolation level
  - db prevents all possible race conditions
  - Actual Serial Execution
    - RAM cheaper and OLTP usually short/small
    - use stored procedures vs individual queries across network
    - lower throughput, but partitioned data can have own transaction processing
    - very slow for cross-platform transactions
    - better if data can fit into memory
  - Two-Phase Locking (2PL)
    - can’t read if getting written
    - can’t write if its been read / not committed
    - protects against a lot of race conditions
    - to read, read shared lock, but can’t read if exclusive lock on object
    - db detects deadlocks and aborts one
    - not great performance, reduced concurrency can result in queue of transactions, have unstable latencies
    - deadlocks can be common
    - predicate locks can affect a number of rows (bad performance)
    - index-range locks is approximation. locks a superset of rows, but lower overheads
  - Serializable Snapshot Isolation (SSI)
    - 2PL pessimistic, assumes a lot will go wrong
    - instead of blocking, transactions continue and when want to commit db checks if anything bad happens and potentially aborts
    - bad if high contention and aborts
    - good if enough space capacity and contention not too high
    - good if transactions are communicative (order doesn’t matter)
    - db checks that original “premise” hasn’t changed (detect read of stale MVCC object version or detecting writes that affect prior reads)
    - big advantage is don’t need to block waiting for locks
    - can even work in distributed environment
The Trouble with Distributed Systems
- Faults and Partial Failures
  - single computer generally does the same thing for deterministic operations
  - in distributed systems, nondeterminism and partial failures
  - Cloud Computing and Supercomputing
    - in cloud set up, if large enough, something always not working, fault handling needs to be part of design
- Unreliable Networks
  - shared nothing architecture in this book
  - over network, no guarantee you will get response
  - usual way to deal with is with timeout
  - Network Faults in Practice
    - network failures common, major cause also human error
    - if network faults not tested, arbitrarily bad things can happen
  - Detecting Faults
    - sometimes can get explicit feedback node is down
    - may need to retry at application level (TCP retries too)
  - Timeouts and Unbounded Delays
    - balance and risk between too much and too little timeout threshold
    - can observe expected patterns and set manually or even use systems that monitor and automatically adjust
  - Synchronous Versus asynchronous Networks
    - e.g. phone networks guarantee a fixed amount of space
    - networks use packet switching to optimize for bursty traffic
    - using circuits would be wasteful, TCP more dynamic
    - maximizes utilization and since wire is fixed cost lowers cost per byte sent
- Unreliable Clocks
  - Monotonic Versus Time-of-Day Clocks
    - time of day clocks: sync with NTP (Network Time Protocol), but times may not jump and may not be suitable for duration tracking
    - monotonic clocks: guaranteed to always move forward. NTP can adjust speed but not more back
  - Clock Synchronization and Accuracy
    - syncing hard
    - can improve but costly
  - Relying on Synchronized Clocks
    - must monitor offsets
    - “most recent” depends on a local time-of-day clock
    - transaction ids hard to use for snapshot isolation in distributed systems
    - google uses timestamp, based on confidence intervals, but requires a lot of effort to reduce time error
  - Process Pauses
    - possible for processes to be halted e.g. GC need to trust elapsed time measurements
    - node in distributed system must assume execution can be paused
    - can get more read-time guarantees, but requires a large amount of work
    - can also treat GC like outages and use other nodes
- Knowledge, Truth, and Lies
  - reliability is achievable even when underlying system provides few guarantees
  - The Truth is Defined by the Majority
    - can decide node is dead by quorum or majority
    - frequently need one of something e.g. lock, leader need to be careful that leader does not cause damage after being demoted
    - fencing: track token rds and only allow higher ones
    - unwise for a service to assume clients will always be well-behaved
  - Byzantine Faults
    - nodes acting maliciously
    - generally expensive to guard against, but usually not an issue
    - may ant to guard against weak forms of “lying” e.g. network packets can be corrupted, sanitize inputs from user
  - System Model and Reality
    - consider system models: synchronous (not practical), partially synchronous (realistic) and async (restrictive)
    - failure models: crash-stop fault, crash-recovery, byzantine
    - safety properties should not be violated
Consistency and Consensus
- Consistency Guarantees
  - replication lag inevitable
  - must be aware of limitations
- Linearizability
  - basic idea is to make system appear as if there was only one copy
  - people shouldn’t see different things
  - What Makes a System Linearizable?
    - once a client reads a certain value, other clients must read that value for the same key
    - different form “serializability” recency guarantee
  - Relying on Lineriazability
    - in leader election, all nodes must agree
    - zookeeper, etcd provide consensus algorithms
    - hard uniqueness constraints require lineriazability
    - if multiple communication channels may need
  - Implementing Linearizable Systems
    - quorum read/writes seem lineriazable, but variable network delays can cause race conditions
    - can get different quorum results
    - read repair sync costly
  - The Cost of Linearizability
    - must query through leader to get linerazability
    - can cause unavailability
    - cost: replicas that are disconnected cannot process requests
    - CAP: can really only pick between consistency and availability (when partitioned)
    - few systems linerazable in practice e.g. in CPU’s cores have own cache
    - performance choose to not implement because performance cost
- Ordering Guarantees
  - Ordering and Causality
    - helps preserve causality
    - concurrent operations not comparable
    - causality guarantees possible with less performance penalty
    - need to know which operation happened before another (partial order)
    - in order to determine causal ordering, db needs to know which version of data was read by application
  - Sequence Number Ordering
    - if not single leader harder to generate sequence
    - lamport timestamp
  - Total Order Broadcast
    - reliable delivery and total ordered delivery must be satisfied
    - if all processes in same order, partitions and replicas are kept consistent
    - using an append log, commit when you see your write
    - abort if another takes first
    - writes linearizable, but doesn’t guarantee for reads
    - you can sequence reads through the log by appending a message
    - read from sync updated read replica
    - comes down to a consensus problem
- Distributed Transactions and Consensus
  - Atomic Commit and Two-Phase Commit (2PC)
    - coordinator (transaction manager) sends prepare message to all nodes and if all reply “yes”, then send out commit requests
  - Distributed Transactions in Practice
    - locks can cause issues
  - Fault-Tolerant Consensus
    - get majority consensus (quorum) tolerate some failure
    - total order broadcast requires messages to be delivered once / equivalent to performing several rounds of consensus
    - protocols use epoch number to guarantee that within each epoch that leader is unique
    - whenever leader votes, if conflict, leader with higher epoch number wins
    - node cannot necessarily trust itself / must collect votes from nodes
    - cons: need majority somewhat sync frequent leader election costly
  - Membership and Coordination Services
    - zookeeper and etcd described as distributed key-value stores or coordination/configuration services
    - fit all data in memory thats replicated across nodes
    - also provides linerizable atomic operations, total ordering of operations, failure detection, change notification
    - allocation work to nodes
    - service discovery
    - membership service
Batch Processing
- Batch Processing with Unix Tools
- MapReduce and Distributed Filesystems
- Beyond MapReduce
Stream Processing
- Transmitting Event Streams
  - unlike batch, in streaming event event generated once by a producer and potentially processed by multiple subscribers
  - related events usually grouped in topics
  - polling expensive so better for consumers to be notified
  - dbs traditionally haven’t supported notification
  - Messaging Systems
    - allows multiple producers to send messages of same topic and consumers to receive messages
    - if producers send messages too fast can 1) drop message, 2) buffer, 3) apply back pressure
    - durability may require writing to disk
    - whether dropped messages acceptable depends on app
    - some messaging systems use direct messaging from producer to consumer, but require application to handle loss
    - message brokers run as servers / some keep in memory
    - consumers generally asynchronous
    - messages generally deleted when successfully delivered to consumer
    - with multiple consumers, can have load balancing or fan-out
    - clients must acknowledge when a message has finished processing
    - can result in out of order if failure required
  - Partitioned Logs
    - log-based message brokers: append-only sequence of records on disk
    - log can be partitioned and broker can monotonically increase sequential number or offset, messages within partition will be ordered
    - does not delete logs after reading
    - if ordering is important and each message is fast to process, log-based approach may be good
    - using offset from client can be more flexible
    - repeat with varying processing code and easier for recovery
- Databases and Streams
  - a write to db can be seen as an action
  - replication log can be seen as a team of write events
  - replicas eventually end up in same final state
  - Keeping Systems in Sync
    - no single system satisfies all and requires different views
    - making application keep in sync can cause errors
  - Change Data Capture
    - make streams more available and update multiple things e.g. search index and data warehouse
    - can think of log consumers derived data systems
    - have to deal with replication lag
    - log compaction can reduce size
  - Event Sourcing
    - from DDD, built on immutable events, but happen at application level
  - State, Streams, and Immutability
    - state is result of events that mutate it over time
    - storing change log makes state reproducible
    - truth is the logs and database is cache
    - separate form in which data is written from its read
- Processing Streams
  - Uses of Stream Processing
    - monitoring
    - complex event processing
    - stream analytics
    - maintaining materialized views
    - search on streams
  - Reasoning About Time
    - event time vs processing time
    - can sync by tracking 1) when event occurred 2) time event was sent to server 3) time event was received
  - Stream Joins
    - stream-stream join: choose suitable window requires maintaining state
    - stream-table join: enriches. could be good to have local copy and keep updated with change log.
    - table-table
  - Fault Tolerance
    - micro batching and checkpointing
    - to give appearance of exactly-once processing in the presence of faults need to ensure that all outputs and side-effects of processing an event take effect if and only if the processing is successful
    - rebuild by keeping state in remote datastore and replicate it
    - or can keep state local to stream processor
The Future of Data Systems
- Data Integration
  - appropriate tool depends on circumstances
  - data can be used in several ways, so often end up having to cobble together several different pieces of software
  - Combining Specialized Tools by Deriving Data
    - as number of representations increase, integration problem becomes harder
    - e.g. search index, analytics, caches, models, etc
    - could have a system of record database
    - if possible to funnel all user input through single system that decides on order, easier to derive other representations of data by processing in same order
    - can achieve similar goal as distributed transactions
    - transaction system provides linearizability (e.g. read own writes)
    - derived data system often update async so by default may not provide sam guarantees
    - limits: all events pass through single leader node, may need to partition
    - if multiple leaders, ordering can become ambiguous
    - clients that maintain state (offline) may see things in different order
    - ordering could matter in some cases
    - could use timestamp, pass metadata, use unique identifier
  - Batch and Stream Processing
    - strong functional flavor
    - async is what makes system based on event logs robust
    - schema evolution limited to simple changes
    - lambda architecture: incoming data recorded by appending immutable events to an always-growing dataset
    - stream processor quickly produces an approximate update and batch processor later produces a correct version
    - having to maintain both can be significant effor
    - efforts being made to enable both batch and stream
- Unbundling Databases
  - unix low-level abstraction / dbs high level
  - I would interpret the NoSQL movement as way to apply a unix-esque approach of low-level abstraction to the domain of OLTP data storage
  - Composing Data Storage Technologies
    - parallels between features built into dbs and the derived data systems people building with batch/stream processor
    - keeping an index up to date
    - data flow starts to look like one hug db
    - batch/stream processors like elaborate implementations of trigger
    - speculate two avenues by which to build cohesive systems 1) federated dbs (unifying reads) 2) unifying writes unbundled dbs
    - believe for data crossing boundaries between different technologies an asynchronous event log with idempotent writes is more robust and practical
    - not replacing databases in current form
    - goal is allow combination of several different databases in order to achieve good performance
  - Designing Applications Around Dataflow
    - database inside-out approach
    - overlap with data-flow languages (oz, juttle) and functional reactive programming (elm) and logic programming (bloom)
    - application code as derivation function
    - separation of application code and state
    - dataflow: interplay between state changes and application code
    - application code responds to state changes
    - ordering matters and fault tolerance is key
    - can use microservices, but update so services subscribe to updates w/ sync network request
  - Observing Derived State
    - write path/read path
    - write path done eagerly
    - caches, indexes, materialized views shift this boundary
    - offline first world: think of on-device state as cache of state on server
    - can push state changes to clients with server-sent events / web-sockets
    - react, flux, redux manage state by subscribing to a stream of events
    - assumption of stateless clients and request/response interactions is deeply ingrained in our databases
    - reads are events too
- Aiming for Correctness
  - The End-to-End Argument for Databases
    - application needs to take end-to-end measures to suppress duplication
    - believe transactions are not enough need new right abstraction
  - Enforcing Constraints
    - uniqueness constraints require concensus
    - log-based messaging would guarantee order
    - deal with multi-partition requests by passing request id
  - Timeliness and Integrity
    - integrity more important than timeliness
    - in many business contexts, acceptable to temporarily violate constraints
    - e.g. transfer s take time. refunds
    - measure cost of apology and see if requires timeliness
    - synchronous coordination can be introduced where needed
  - Trust, but Verify
    - software / hardware can have issues
    - run audits and check data
    - don’t blindly trust
    - event-based systems are reproducible
- Doing the Right Thing
  - Predictive Analytics
    - can have direct impact on people’s lives
    - could suffer server consequences
    - ML can bias and discriminate and may be hard to identify
    - recourse may be impossible
    - dangerous to have blind belief in the supremacy of data for decision making
    - need to figure out ways to make algorithms accountable and transparent
    - feedback loops and echo chambers
  - Privacy and Tracking
    - surveillance
    - can reveal intrusive things
    - consent and freedom: users have little knowledge
    - terms set by service
    - social cost to not using
    - surveillance becomes inescapable
    - data is asset and power
    - data is the pollution problem of the information age
    - need to self-regulate data collection and allow individuals to maintain privacy
    - should not retain forever

Designing Data-Intensive Applications

Published by yujinjc

Leave a comment Cancel reply

Share this:

Published by yujinjc

Leave a comment Cancel reply