Building Microservices

  1. Microservices
    • Intro
      • Emerged in context of domain driven design, CI/CD, on-demand virtualization, infrastructure automation, small autonomous teams, systems at scale
      • more flexibility to respond faster is goal
    • What are microservices?
      • small autonomous services that work together
      • service boundaries based on business boundaries
      • shouldn’t be too big (~2 weeks to rewrite)
      • separate entities ideally own own machine
      • communication via network calls, hide internal representations, allow loose coupling
      • should be able to make changes, deploy without changing anything else
    • Key Benefits
      • tech heterogeneity: different tech for different needs e.g. language or data persistence. reduce risk of new tech and absorb new tech quicker.
      • resilience: system can keep working and handle failed services, but need to know how to handle machine and network failures
      • scaling: can scale just parts that need it
      • ease of deployment: can ship or rollback quickly. slow releases means changes build up increasing risk.
      • organizational alignment: small teams on small code base are more productive
      • composability: resuse. these days can interact with web, native, mobile web, devices, etc.
      • optimizing for replaceability: big legacy hard to change and risky
    • What About Service-Oriented Architecture?
      • microservices is specific approach to SOA
      • microservices:SOA::XP/scrum:Agile
      • SOA doesn’t offer ways to ensure services don’t become too coupled or how to split up something big
    • Other Decompositional Techniques
      • libraries: same language / platform. ability to deploy changes in isolation still reduced can become coupling point.
      • modules: some allow life cycle management, but module authors must deliver proper module isolation and can be big source of complexity.
      • having process boundary enforces clean hygiene
    • No Silver Bullet
      • still hard and not always appropriate
      • must get better at deployment, testing and monitoring
  2. The Evolutionary Architect
    • Inaccurate Comparisons
      • architects need to have technical vision for how to deliver system to customers
      • scope can vary by size/company
      • can have huge impact but people often get wrong
      • industry is young and don’t know what good looks like
      • not quite engineering, no hard certifications, changing environment / requirements
      • “architect” terminology caused a lot of harm propagating image of planner who expects others to carry vision out
    • An Evolutionary Vision for the Architect
      • requirement, tools, tech shift rapidly
      • need to react and adapt to users
      • need to create framework in which right system can emerge and continue to grow
      • like city planner, set zones and let people make specific decisions
      • can’t foresee everything so shouldn’t over-specify everything
      • Ecosystem includes users, developers, and other workers
      • should step in on highly specific implementation details only in limited cases
    • Zoning
      • zones are like service boundaries or coarse-grained groups inside services
      • more important to be concerned about what happens between zones than within
      • getting things wrong here leads to all sorts of problems that are hard to correct
      • some constraints include high cost to maintain knowledge/expertise for many technologies plus hiring
      • there are some benefits of tooling/expertise around specific tech
      • can have a lot of issues around integration e.g. REST, protobuf, Java RMI
      • coding architect: need to understand impact of decision and what normal looks like. should be routine.
    • A Principled Approach
      • how to make decisions
      • strategic goals: company goals. tech must be aligned
      • principles: rules that align what you are doing to strategic goals. fewer than 10 is good.
      • practices:
        • how to ensure principles are carried out.
        • practical guide that any developer should be able to understand
        • can change often and very technical
      • can combine, but for larger orgs may want distinction for different types of teams
    • The Required Standard
      • need to think about how much variance allowed
      • think about a “good” service and what are common characteristics of well-behaved services?
      • might not want too much divergence here
      • need to balance optimizing autonomy and losing sight of bigger picture
      • system wide health monitoring is essential
      • should standardize how this and logging is done
      • keep 1 or 2 interfaces and know how you will use and define
      • architectural safety:
        • one bad service shouldn’t ruin everything
        • each may have own circuit breaker
        • each needs to play by rules consistently otherwise may have more vulnerable system
    • Governance Through Code
      • exemplars: real world examples that are good. hard to go wrong if imitating
      • tailored service templates: out of box set of libraries that provide health checking, serving http, exposing metrics and team/company specific context
      • might subtly constrain language choices
      • try not to make centralized, teams should have joint responsibility
      • may hurt morale/productivity by enforcing a framework
      • should be about ease of use for developers
      • can result in source of coupling b/w services
    • Technical Debt
      • sometimes cut corners but cost to debt
      • sometimes due to changing vision
      • need to understand balance and have view of debt
      • some teams let them decide and pay down, other places maintain a log and review regularly
    • Exception Handling
      • keep track of when you make decisions that deviate from principles and practices
      • if happens enough may want to revisit rules
      • some places more strict
      • microservices about autonomy, if company places a lot of restrictions, microservices may not be for you
    • Governance and Leading from the Center
      • ensure what we are building matches vision
      • do with governance group
      • should predominantly consist of people who are executing the work being governed
      • group as whole responsible for governance
      • usually architect follows decision of group unless very dangerous/risky, but can undermine position and make team feel like they don’t have a say
    • Building a Team
      • not just tech decisions, but people and helping them grow
      • have them involved in shaping and implementing vision
      • microservices provides way for people to step up and accept more responsibility
    • Conclusion
      • vision, empathy, collaboration, adaptability, autonomy, governance
      • constant balancing
      • worst is to be rigid
  3. How to Model Services
    • Introducing MusicCorp
      • helps to have an example
      • want to grow but make changes easily
    • What Makes a Good Service?
      • loose coupling & high cohesion
      • loose coupling:
        • change in one service doesn’t affect another
        • tight integration style can lead to coupling
        • services should know little about others
        • limit types of calls
      • high cohesion:
        • related behaviors sit together
        • make behavioral changes in one place
        • making changes in multiple places slower and riskier
    • The Bounded Context
      • in domain driven design model real world domains
      • there are multiple bounded contexts and there are things that should and shouldn’t be shared
      • each bounded context has explicit interface where it decides what models to share
      • bounded context: specific responsibility enforced by explicit boundaries
      • communicate using models
      • musiccorp has warehouse, reception desk, finance, ordering
      • shared and hidden models:
        • some items only relevant to one bounded context
        • will be shared items, but can have internal and external representations based on what needs to be shared
        • models with same name may have different meanings in different contexts
      • modules and services:
        • thinking about what should and shouldn’t be shared leads to tight coupling
        • bounded contexts lend themselves well to being compositional boundaries
        • using modules good place to start
      • premature decomposition:
        • very costly to be wrong with services, wait for things to stabilize
        • easier to have something existing and decompose
    • Business Capabilities
      • for each bounded context, think about what capabilities it provides rest of domain
      • first, think about what does this context do
      • second, what data does it need?
      • these capabilities become key operations exposed over the wire
    • Turtles All the Way Down
      • can subdivide contexts or break out contexts into higher level contexts
      • easier to test/stub nested bounded contexts
    • Communication in Terms of Business Concepts
      • often want changes that business wants
      • if services decomposed along bounded contexts changes likely isolated
      • the same terms and ideas shared between parts of organization should be reflected in your interfaces
    • The Technical Boundary
      • can slice horizontally by tech e.g. frontend, data access over db
      • not always wrong
      • may make sense to achieve certain performance
  4. Integration
    • Looking for the Ideal Integration Technology
      • Avoid breaking changes: consumers shouldn’t also need to make change
      • keep API’s tech agnostic
      • simple for consumers to use: can allow total freedom or provide libraries
      • hide internal implementation details
    • Interfacing with Customers
      • more complex than CRUD app, can kick off multiple processes
    • The Shared Database
      • most common form of integration
      • bounded by db internal implementation details and changes can break consumers
      • consumer tied to specific tech
      • update operation might be spread between consumers / lose cohesion
      • easier to share data but not behavior
    • Synchronous Versus Asynchronous
      • important because guides implementation details
      • sync easier to reason about
      • async better for long-running or low latency but more involved
      • req/res is sync, but can make async with callbacks
      • event-based: client emits an event and expects subscribers to know what to do
      • client doesn’t have to know about subscribers / highly decoupled
    • Orchestration Versus Choreography
      • orchestration: central brain can have too much authority and be central point
      • choreography: services subscribe and act accordingly. more decoupled, but need to build in monitoring and tracking
    • Remote Procedural Calls
      • make remote call look like local call
      • some implementation details tied to specific network protocol
      • easy to use
      • can have tech coupling e.g. Java RMI, but thrift and protobuf have a lot of lang support
      • local calls are not like remote calls with different costs and reliability profiles users should be aware of
      • brittleness: clients and servers may have to be updated in lock-step. in practice objects used in binary serialization thought of as expand-only types
    • REST
      • concept of resources
      • doesn’t talk about underlying protocol but mostly used with HTTP
      • large ecosystem and tools for caching proxies, load balancing, monitoring and security
      • hypermedia as the engine of application state:
        • piece of context has links
        • client performs interactions via links
        • greatly abstracted from underlying details
        • can be chatty, but good to start here and optimize later
      • json, xml, else?
        • json most popular
        • xml may have better link control and can also extract certain parts with xpath
      • too much convenience:
        • some frameworks take db representation and deserialize into in-process objects and directly expose
        • results in high coupling
        • one strategy is delaying data persistence until interface has stabilized / store in local disk
        • too easy to let db storage influence how we send models over the wire
        • defers work, but for new service boundaries acceptable trade-off
      • downsides to REST over HTTP:
        • cannot easily generate client stubs may result in RPC over http or shared client libraries
        •  some frameworks don’t support all verbs
        • may not be as lean as binary protocols
        • might not be great for low latency
        • web sockets might be better for streaming
        • consuming payloads might be more work than some RPC implementations with advanced serialization
        • generally is  a sensible default though
    • Implementing Asynchronous Event-Based Collaboration
      • message brokers like rabbit mq
      • enterprise solutions try to pack other things, keep dumb
      • can use http, but not good at latency, may have to keep track of own polling schedule
      • leads to complexity, should specify max retry and have good monitoring and be able to trace requests
    • Services as State Machines
      • microservice should own all logic associated with behavior in this context
      • avoid dumb services that only CRUD
      • shouldn’t let decisions about domain leak out of right service
      • have one place to manage state collisions or attach behaviors that correspond to changes
    • Reactive Extensions
      • compose results of multiple calls and run operations on them
      • observe outcome and react when something changes
      • abstract out details of how calls are made
      • good for handling concurrent calls to downstream services
    • DRY and the Perils of Code Reuse in a Microservice World
      • shared libraries can lead to too much coupling
      • may not be able to independently upgrade client and server
      • things like logging is ok / invisible to outside world
      • don’t violate DRY within a service, but be more relaxed across services
      • too much coupling worse than code duplication
      • if you have client libraries, have separation, maybe separate people do API vs library
      • keep logic out of client library
    • Access by Reference
      • how to pass around info for domain entities
      • if in memory, can be out of date
      • can include a reference i.e. URI
      • may cause too much load / define freshness by timestamp and timestamp
      • no hard rule, but be weary about freshness
    • Versioning
      • defer as long as possible by picking right integration tech and decoupling clients (e.g. tolerant reader)
      • catch breaking changes early with consumer-driven contracts
      • semantic versioning / MAJOR.MINOR.PATCH
      • co-exist different end points:
        • gives consumers time to migrate
        • expand and contract
        • can include version in headers or uri
      • multiple concurrent service versions:
        • usually not great idea
        • need to handle directing consumers and may need to manage/fix two services
        • may only want to do if short-lived
    • User Interface
      • historically have been thick, moved to thin, and now thick again
      • in recent history, UI has become compositional layer that weaves together various capabilities
      • constraints vary by how people interact e.g. mobile must factor in physical usage, battery, network bandwidth, no right click
      • API composition: UI can directly interact with services, but can be very chatty and little ability to tailor responses by device
      • UI fragment composition: different services compose their widget. issues with consistency / UX. some capabilities may not fit as a widget. might not work with native.
      • backends for front ends:
        • have api gateway to handle, but don’t want one layer
        • better to have service for each frontend e.g. mobile, customer website
        • think as part of UI
        • careful it doesn’t take on logic it shouldn’t
        • business logic should reside in services
        • only contain behavior specific to UI
      • hybrid: some solutions work better for different things
      • don’t put too much behavior into intermediate layers
    • Integrating with Third-Party Software
      • rational to use off-the-shelf software / buy when tool isn’t special
      • lose control to vendor / should be part of too selection process
      • customization may be hard or expensive
      • might have communication protocols that involve reaching into datastore
      • one solution: for CMS wrap (facade) into service and keep scope of what it does to a minimum and make easy to access
      • strangler pattern:
        • intercept calls to legacy system and decide route to new or legacy
        • replace functionality gradually
  5. Splitting the Monolith
    • It’s All About The Seams
      • in a monolith changes can impact rest of system and will have to redeploy entire system
      • seam: portion of code that can be treated in isolation
    • Breaking Apart MusicCorp
      • identify high level bounded contexts
      • first create packages representing them
      • packages should interact and have dependencies similar to real work groups
      • work incrementally!
    • The Reasons to Split the Monolith
      • start with where you will get most benefit
      • areas we expect lots of changes?
      • team organization
      • do some functionalities require more security e.g. finance?
      • would some service benefit from certain type of tech?
    • Tangled Dependencies
      • find least depended on seam
      • view services as a DAG
    • The Database
      • common area of integration
    • Getting to Grips with the Problem
      • see which parts interact with db
      • usually via repository layer
      • split up repository to different parts corresponding to bounded contexts
    • Example: Breaking Foreign Key Relationships
      • prevent services from directly accessing rows in a table
      • set up API from owning service
      • additional db calls, but may still be performant
      • lose foreign key and consistency checks
      • domain or people defining system should have ideas on how to represent cases with violations
    • Example: Shared Static Data
      • e.g. list of countries
      • A) could duplicate in tables of services
      • B) config file or keep in code
      • C) separate service if sufficiently complex
      • generally config file is simplest
    • Example: Shared Data
      • services have to reach into same table
      • can mean there is a domain concept that isn’t explicitly modeled
    • Example: Shared Tables
      • may not be normalized or can break it up
    • Refactoring Databases
      • first split schema
      • then split service
      • review and assess things first with new schema broken out
      • make easier to revert
    • Transactional Boundaries
      • lose this safety
      • try again later: can have a queue
      • abort entire operation: need a compensation transaction. can be hard to reason about. what if it fails?
      • distributed transactions: transaction manager gets consensus and each system needs to OK to proceed / can be blocking
      • If you really need transactional safety, think about creating a concept to represent the transaction itself and keep logic here
    • Reporting
      • Need to figure out how to make architecture work with existing processes
    • The Reporting Database
      • have a read replica, but can’t structure in different way and comes in one format
    • Data Retrieval via Service Calls
      • ok if simple, but not good for large volumes
      • can hurt load
    • Data Pumps
      • each service pushes data to reporting db
      • teams owning service should implement pump and version/deploy together
      • schedule to run regularly
      • can aggregate via materialized view
    • Event Data Pump
      • subscribe to events and pump data to reporting database
      • less coupling / bind to emitted event
      • more fresh data and less inserts
      • separate group can manage this easier
      • all required info must be broadcast and may not scale well
    • Backup Data Pump
      • use backup data as source
      • still have coupling to dest. reporting schema
    • Toward Real Time
      • is reporting data all going to one place?
      • there are dashboards, alerting, financial reports, user analytics
      • different tolerances for accuracy and timeliness
      • things moving to more generic eventing systems
    • Cost of Change
      • promote incremental change
      • think about impact / you will make mistakes
      • make mistakes where impact will be lowest
      • cost of change and mistakes lowest at white board
      • think about how things interact. are there circular references? overly chatty apps?
      • class responsibility collaboration (CRC) cards: names, responsibilities, collaborators
      • when working with design, can help see how things hang together
    • Understanding Root Causes
      • growing a service is okay, but should split before that becomes to expensive
      • hard to know where to start, but also challenges with running and deploying services
  6. Deployment
    • A Brief Introduction to Continuous Integration
      • keep everyone in sync and make sure code properly integrated
      • use consistent artifact for testing and deploying
      • automate creation of artifact and version control
      • Are you really doing it?
        • do you check into main line once a day?
        • suite of tests to validate changes?
        • is fixing a broken build #1 priority?
    • Mapping Continuous Integration to Microservices
      • monolithic build: requires lock step release
      • prefer one CI build per service with one service repo resulting in one artifact
    • Build Pipelines and Continuous Delivery
      • multiple stages to build pipeline
      • compile/fast tests, slow tests, UAT, performance testing, production
      • production readiness of every check-in
      • usually want one microservice per build, but major exception is if new team/project
      • can keep services larger until understanding of domain stabilizes
      • while breaking out if unable to get stability in service boundaries, merge into monolithic service
    • Platform-specific Artifacts
      • e.g. ruby gems, java jar/war, python eggs
      • may not be enough for some e.g. ruby/python and still need process manager inside apache/nginix
      • automation can hide differences in deploy mechanisms
    • Operating System Artifacts
      • redhat/centOS has RPMs, Ubuntu has deb, windows has MSI
      • can be hard to build
      • potential overhead of multiple different OS
    • Custom Images
      • can take a long time to provision instances from scratch
      • bake in common dependencies to virtual machine image
      • building can take long time and size can be large
      • building for different platforms can be a challenge e.g. VMWare, AWS AMI, Vagrant, Rackspace
      • can also bake service into image
      • immutable services: avoid config drift and make config changes go through build pipeline
    • Environments
      • as you progress i.e. laptop -> build server -> UAT -> prod
      • want to ensure environments are more and more production-like to catch problems sooner
      • constance balance between cost and fast feedback
    • Service Configuration
      • ideally small otherwise run into problems occurring only in certain environments (very painful)
      • use one single artifact and manage configs separately
      • can also have dedicated system if dealing with large number of microservices
    • Service-to-Host Mapping
      • because of virtualization use term “host” over machine as general unit of isolation
      • can put multiple services on one host, but makes monitoring, deploying, scaling harder
      • recommend single server per host
        • easier to monitor / remediate
        • reduce single point of failure
        • easier to scale single service
        • focused security concerns
        • use alternate deployment techniques
        • cost may include more servers and hosts to manage
        • PAAS can be good if works, but might not be customizable
    • Automation
      • automation is essential
      • give devs ability to self-service and provision services
      • picking tech that enables is highly important which starts with tools used to manage hosts
    • From Physical to Virtual
      • traditional virtualization: independent machines, but overhead / cost of splitting
      • vagrant: deployment platform usually used for dev/test. can tax average machine and may not be able to run entire system.
      • linux container:
        • don’t need hypervisor, containers share same kernel (where process tree lives)
        • faster and more lightweight than VM
        • still need to route to outside world
        • also processes can interact with other containers
        • if you want isolation, VM may be better
      • docker:
        • hides underlying technology
        • fast/lightweight
        • deploy and install docker apps
        • still need scheduling layer e.g. Kubernetes
    • A Deployment Interface
      • should have uniform interface
      • like using single parameterized command-line call with name, version, environment e.g. “deploy artifact=catalog environment=local version=local”
      • version is usually local, latest, or specific
      • Fabric (python library) useful for mapping command line calls to functions / can pair with AWS Bote to automate AWS environment
      • can use yaml files for environment definitions
      • a lot of upfront work but essential for managing deployment complexity
  7. Testing
    • Types of Tests
      • tech facing: unit and property help developers
      • business facing: acceptance and exploratory
    • Test Scope
      • test pyramid top to bottom: UI, service, unit
      • unit tests:
        • simple functions or methods
        • not launching any services and limiting use of external files or network connections
        • small individual scope, fast, helps refactor
      • service tests:
        • test services directly
        • in monolithic app might be collection of classes
        • only test service itself and stub external collaborators
      • end-to-end tests:
        • entire system
        • high test scope and confidence
        • too many of these can be slow and make builds hard
    • Implementing Service Tests
      • stubs return canned responses
      • mocks make sure calls were made
      • Mounteback: tool for stub/mocks
    • Those Tricky End-to-End Tests
      • easy to cover same ground
      • slow and might be hard to diagnose issues
    • Flaky and Brittle Tests
      • can have failures not due to app and make test untrustworthy
      • hard to determine who writes these tests
      • don’t want another team to own completely
      • treat like shared code base
    • Test Journeys, Not Stories
      • instead of every functionality, focus on small number of core journeys
      • any functionality not covered here, use service tests
      • should have low double digits even for complex systems
    • Consumer-Driven Tests to the Rescue
      • consumer defines expected behavior
      • producer checks that incoming API call receives expected behavior
      • run only against single producer in isolation
      • sit at same level as service tests
      • pact is open source tool that helps do this
      • requires good communication between consumer and producer teams
    • So Should You Use End-to-End Tests?
      • could potentially get rid of, but can be good training wheels
      • also depends on teams appetite to learn in production
    • Testing After Production
      • can always add more tests to catch errors but diminishing returns
      • can’t reduce chance of failure to zero
      • separate deployment from release
      • blue/green: deploy and run smoke tests
      • canary: trickle traffic / coexisting versions / sometimes copy production load
      • sometimes more beneficial to get better at remediation of a release than adding more automated functional tests
      • many orgs don’t spend effort on better monitoring or failure recovery
      • in certain cases, e.g. trying to validate business idea, may not need tests
    • Cross-Functional Testing
      • acceptable latency, no users supported, security
      • falls under property testing
      • may want to track at service level and decide acceptable thresholds
      • performance tests:
        • more network calls
        • tracking down sources of latency is important
        • run in prod-like environment
        • takes time maybe run a subset everyday
        • make sure to look at number
  8. Monitoring
    • Intro
      • monitor the small things and use aggregation to see bigger picture
    • Single Service, Single Server
      • host: CPU, memory
      • server: logs
      • application: response time or errors
    • Single Service, Multiple Servers
      • monitor same things
      • can use ssh multiplexor to check multiple at same time
      • can use load balancer to track response times
    • Multiple Services, Multiple Servers
      • harder to do
      • need collection and central aggregation
    • Logs, Logs, and Yet More Logs…
      • use centralized subsystem to make available centrally e.g. logstash
      • kibana is tool for viewing and querying logs
    • Metric Tracking Across Multiple Services
      • look at metrics long-enough time frame to know “normal”
      • should be able to look at aggregations at different levels and drill down
      • graphite is tool that handles some of this and compacts logs
      • can be helpful for capacity planning
    • Service Metrics
      • try to expose a lot of metrics
      • 80% of software features never used
      • metrics can help inform what is actually being used
      • can be hard to know what will be useful in future so err on side of exposing everything
    • Synthetic Monitoring
      • fake events or synthetic transactions
      • often better indicator that something is working than low level metrics
      • can use end-to-end tests but be careful not to trigger things on accident
    • Correlation IDs
      • hard to track especially if things async
      • generate GUID and pass along to subsequent calls
      • put in soon, otherwise may not have it when you need it
      • could have in thin shared client wrapper, but keep very thing
    • The Cascade
      • services might look healthy but can’t talk
      • important to monitor integration points between services
    • Standardization
      • balance between narrow decisions vs standardization
      • monitoring is one area where standardization is incredibly important
      • must be able to view in holistic way
      • write logs in std format
      • have metrics in one place
      • standardize metric names too
    • Consider the Audience
      • what do they need to know right now in order to react?
      • what might people need later?
      • how do people like consuming data?
      • create big displays and make accessible
    • The Future
      • metrics siloed in many orgs/companies
      • can think of metrics as events
      • tying user and application logs can help real-time analysis
      • many orgs moving away from specialized tool chains to more generic event routing systems
  9. Security
    • Authentication and Authorization
      • authentication: confirm principal is who they say they are
      • authorization: allowed actions
      • ideally not every service will need to handle separately
      • common single sign-on implementations:
        • SSO e.g. SAML or OpenID
        • principal trying to access resource directed to authenticate with identity provider
        • username/pw or 2-factor
        • identity provider can be internal (LDAP) or external
      • single sign on gateway:
        • use gateway to act as proxy
        • downstream services can get info about principal in headers
        • harder to reason about service in isolation
        • potentially single source of failure
        • careful gateway layer doesn’t take on too much responsibility / results in coupling
      • fine-grained authorization:
        • can put people in groups
        • let microservices make decisions based on them
        • careful to not embed too much logic into group e.g. CALL_CENTER_50_DOLLAR_REFUND
        • model around how your organization works
    • Service-to-Service Authentication and Authorization
      • allow everything inside perimeter:
        • trust implicitly
        • if attacker perpetrates little protection from man-in-the-middle attack
      • HTTP(S) Basic Authentication:
        • username/pw in header, but visible if HTTP
        • HTTPS can mitigate but need to manage certificates for different machines
        • traffic sent via SSL cannot be cached by reverse proxies
        • can have load balancer terminate SSL traffic and cache behind it
      • SAML or OpenID Connect:
        • still need to route over HTTPS
        • need accounts for clients
        • keep scope narrow
        • need to securely store credentials
      • client certificates:
        • more onerous than server-side certificates
        • only want to use when especially concerned about sensitivity of data being sent
      • HMAC over HTTP:
        • hash-based messaging code
        • body and private key are hashed to see if any modifications
        • traffic can be hashed
        • downside: 1) both client and server need shared secret 2) pattern not standard / can be hard to get right 3) only ensures no changes, still visible
      • API Keys:
        • use to identify and set restrictions
        • common to use public and private key-pair
      • The Deputy Problem:
        • malicious party can try to trick and make calls they shouldn’t be able to
        • no simple answer
        • 1) implicit trust
        • 2) verify identify of caller
        • 3) asking caller to provide credentials of original principal
    • Securing Data at Rest
      • data is liability
      • should be encrypted
      • go with what is well known / DO NOT TRY TO IMPLEMENT YOURSELF
      • It’s all about keys:
        • use one solution is to use separate security appliance to encrypt and decrypt
        • life cycle management of keys can be vital operation
      • pick your targets: can limit encryption to sensitive data
      • decrypt on demand: encrypt when first see and ensure not stored anywhere
      • encrypt backups
    • Defense in Depth
      • firewalls: restrict traffic/access by port/IP
      • logging: lets you see if something bad happened. make sure not to leak important info
      • intrusion detection: monitor network/hosts sand report problems
      • network segregation:
        • VPC
        • which can see each other
        • routing traffic through gateways to proxy access results in multiple parameters
      • OS:
        • patch software regularly and automatically
    • Be Frugal
      • don’t store what you don’t need
      • can’t be stolen or demanded from you
    • The Human Element
      • revoke credentials when someone leaves
      • social engineering protection
      • what damage can be done by a disgruntled employee?
    • The Golden Rule
      • Don’t write your own crypto!
    • Baking Security In
      • educate developers
      • familiarize with OWASP Top ten list, OWASP security testing framework
      • automated tools can probe for vulnerabilities
      • can integrate into CI Builds
    • External Verification
      • external assessments
      • penetration testing
  10. Conway’s Law and System Design
    • Evidence
      • loosely/tightly coupled orgs:
        • commercial product firms produced less modular software than loosely coupled orgs e.g. distributed open source software
      • for windows vista, microsoft found metrics associated with org structure proved to be most statistically relevant
    • Netflix and Amazon
      • amazon saw benefit of team owning lifecycle of system. 2 pizza team and AWS allowed self-sufficiency
      • Netflix wanted small independent teams to optimize speed of change
    • Adapting to Communication Pathways
      • single teams with single service can make changes quicker due to low cost of communication
      • geographically dispersed teams might have harder time resulting in hard to maintain code
      • single team owning many services results in tight integration
    • Service Ownership
      • own requirement gathering, building, deploying, maintaining
      • increased autonomy and speed of ownership
      • incentives to make easy to deploy
    • Drivers for Shared Services
      • too hard to split: consider merging teams
      • feature teams: cut across technical layers
      • delivery bottleneck: can wait,  add people to team needing help, or break into new service
    • Internal Open Source
      • if you can’t avoid sharing service
      • have core committers that ensure quality and encourage good behavior
      • need to balance their time
      • hard to do if project not mature / may not know what good looks like
      • tooling: version control, discussions, code review process
    • Bounded Contexts and Team Structures
      • good to have teams aligned by bounded contexts
      • easier to grasp domain
      • services within bounded context more likely to communicate making system design and release coordination easier
    • The Orphaned Service?
      • should still have owner if services separated by bounded contexts
      • risk of polyglot approach is teams may not know tech stack of old service making change harder
    • Conway’s Law in Reverse
      • have seen anecdotally system design influencing org structure
    • People
      • developers must be more aware e.g. calls across network boundaries, implications of failure
      • more accountability
      • staff needs time to adjust
      • possible you need different types of people
  11. Microservices at Scale
    • Failure Is Everywhere
      • at scale failure is certain
      • spend less time trying to stop inevitable and more time dealing with things gracefully
      • thinking this way helps you make different trade-offs
    • How Much Is Too Much?
      • know what failures you can tolerate per service
      • don’t need to go overboard on everything
      • define general cross functional requirements and override for particular use cases
      • response / latency: how long should operation take? e.g. 90th percentile response time of 2 seconds when handling 200 concurrent requests
      • availability: 24/7? can it be down?
      • durability of data:
        • how much data loss is acceptable?
        • how long to keep?
        • different based on data e.g. logs vs financial transactions
    • Degrading Functionality
      • need to know how to handle system failures
      • shouldn’t block everything else
      • often not technical decision and requires business context
    • Architectural Safety Measures
      • slow responses often worse than fast failures which are easier to detect
      • look into timeout configs, connection pools, circuit breakers
    • The Antifragile Organization
      • netflix actively incites failures to ensure robust systems
      • timeouts
        • balance between slowing whole system and killing req/res that would have worked
        • pick default, log, look into, and adjust
      • circuit breakers:
        • after certain number of request failures
        • can be timeout or 5XX code
        • automatically stop sending / fail fast
        • pick sensible defaults and change for specific cases
        • options after 1) queue and try later, 2) fail and propagate error 3) degrade functionality subtly
      • bulk heads:
        • way to isolate failures
        • connection pool for each downstream connection
        • mandate circuit breakers for all synchronous downstream calls
        • most important pattern, ensures resources aren’t constrained in first place
      • isolation:
        • less coupling and reliance on other services’ health
        • requires less coordination between service owners
    • Idempotency
      • outcome doesn’t change after first application
      • repeat operation w/o adverse impact
      • more important that underlying business operations are idempotent vs entire state of system
    • Scaling
      • helps with 1) failure and 2) reduce latency and handle more load
      • go bigger:
        • vertical scaling
        • can be more expensive and software might not be written in way to take advantage of CPU cores
        • doesn’t help resiliency
      • splitting workload:
        • have services in independent hosts
        • split critical from non-critical and manage load more effectively
      • spreading your risk:
        • host can be virtual / put on more than one physical machine
        • more than in a single rack in data center
        • distribute across more than one data center
        • know SLA’s and if that works for you
        • AWS has availability zones within regions
        • diversify appropriately
      • load balancing:
        • hard ware and software based
        • can distribute calls and remove unhealthy instances and add back when healthy
        • SSL termination
        • safer if using VLAN (virtual local area network) or VPC
        • treat config like anything else and version control
      • worker-based systems:
        • workers work on same shared backlog of work
        • good for batch or async jobs
        • helps manage peaky loads
        • improve throughput and resiliency and make sure nothing lost
        • individual workers don’t have to be reliable but system containing work needs to be
        • can handle with persisted message broker system like zookeeper
      • starting again:
        • jeff dean: “design for 10x growth, expect rewrite before 100x”
        • building for scale at beginning can be disastrous because don’t know what we want to build
        • need to be able to experiment rapidly and understand capabilities we need
        • needing to change is sign of success
    • Scaling Databases
      • difference between available and durable
      • scaling for reads:
        • read replicas and writes go to primary node
        • some stale data / eventual consistency
        • suggest looking into caching first
      • scaling for writes:
        • sharding / makes queries more difficult
        • query all nodes or have alternative read store
        • usually handled by async mechanisms using cached results
        • recent tech makes adding new node in live system do-able, but test throughly
        • often teams change db tech when needing to scale writes
      • shared database infrastructure:
        • usually don’t want multiple services under different schemas within one database
        • single point of failure
        • consider risks if doing so and make sure db resilient as possible
      • CQRS:
        • command-query responsibility segregation
        • commands represent changes in state
        • can process as events / storing a list of commands (event sourcing)
        • can better handle scale
        • can be hard to get right
    • Caching
      • client-side proxy, and server-side:
        • client can help reduce network calls / state invalidation can be tricky
        • proxy e.g. squid/varnish. cache any http traffic
        • server-side might be easier to reason about and can be good if multiple types of clients
        • usually mix of all
        • need to know what load to handle and how fresh data needs to be
      • caching in http:
        • cache-control directives say how long client should keep
        • expires headers gives date
        • helps prevent requests in first place
        • ETags let you know if resource changed
        • careful not to give conflicting info
      • caching for writes:
        • can be good if flushed to downstream source
        • buffer and potentially batch writes
        • can queue up writes if downstream unavailable
      • caching for resilience:
        • client can show stale data if downstream not available
        • reverse proxy can send stale data
        • may be better than returning nothing
      • hiding the origin:
        • for highly cacheable data, prevent requests from going to origin
        • origin populates cache in background
        • if cache miss fail original request
        • fail fast to avoid taking up resources to prevent cascading failure downstream
      • cache poisoning:
        • if you put “Expires: Never” can be stuck in caches you can’t reach e.g. CDN, ISPs, browsers
        • need to understand fullpath of data that is cached from source to destination to appreciate complexity and what can go wrong
    • Autoscaing
      • can scale by well-known trends
      • usually better to have some extra capacity
      • have good suite of load tests almost essential
      • use to test autoscaling rules
      • may want mix of predictive and reactive scaling
      • suggest using autoscaling for failure conditions first while collecting data
    • CAP Theorem
      • consistency, availability, partition tolerance
      • partition tolerance a given so pick consistency or availability
      • sacrificing consistency: can have stale data. replication not instantaneous
      • sacrificing availability: can’t respond until ensure all other agree and original node didn’t change. very hard to do. do not implement on your own.
      • AP or CP?: AP scales more easily. need to know context and discuss trade-offs
      • Not all or nothing even within a service
      • real world:
        • even a perfectly consistent system can be off because of things happening in real world (lost an order)
        • AP systems end up being right in many situations
    • Services Discovery
      • where to find
      • DNS:
        • associate name with IP
        • can have it resolve to host or load balancer
        • can have diff env e.g. <service_name>-<env>.musiccorp.com
        • DNS well understood/used format. issue is services for managing DNS not designed for highly disposable hosts
        • DNS domain names have TTL (time to live) meaning clients can hold on to old IPs (avoid if pointing to LB)
        • DNS round robin can be problematic because client hidden from underlying host (what if sick instance?)
    • Dynamic Service Registries
      • zookeeper:
        • general use case: config management, synchronizing data between services, leader election, message queues, naming service
        • runs a number of nodes (at least 3)
        • replicates safely between nodes and keep consistent in failure
        • can watch for changes
        • may still need to build on top of it, but distributed coordination works which is hard to get right
      • consul:
        • exposes http interface for service discovery
        • provides DNS server out of the box (serves SRV records)
        • can do health checks, but more often use as source of info and pull into monitoring or alerting systems
        • RESTful HTTP interface easy to integrate
        • new but Hashicorp good track record
      • Eureka:
        • provides LB and rest-based endpoint and java client
        • at netflix all clients use eureka
      • roll your own:
        • e.g. add tags to AWS and use query API
      • make it easy for humans to access info too
    • Documenting Services
      • documentation can be out of date
      • swagger:
        • generate nice UI that lets you view documentation and interact with API via browser
      • HAL and HAL browser:
        • embedded on hypermedia controls
        • must be using hypermedia or retrofit
        • a lot of client libraries
    • The Self-Describing System
      • early days of SOA had standards UDDI, but many approaches heavy-weight
      • Martin Fowler discussed concept of human registry or a place humans can record info i.e. simple wiki.
      • getting picture of system is important
      • access a lot of info through monitoring
      • more powerful than simple wiki
      • create custom dashboards
      • can start with simple web server or wiki, but pull in info over time
      • make this info readily available as its a key tool in managing emerging complexity
    • Summary
      • read “Release It!” by Michael Nygard
      • essential reading
  12. Bringing it All Together
    • Principles of Microservices
      • model around business concepts:
        • more stable than tech boundaries
        • better reflect changes in business
      • adopt culture of automation:
        • adds complexity more moving parts
        • front load effort and tooling
        • automated testing, uniform deploy cli, CI/CD
        • custom images can speed up deployment
        • immutable servers easier to reason about
      • hide internal implementation details:
        • so services can change independently
        • hide databases
        • use data pumps or event data pumps to consolidate for reporting
        • use tech agnostic APIS, consider REST
      • decentralize all the things:
        • constantly delegate decision making and control
        • embrace self-service
        • another team shouldn’t be deploying or testing
        • use internal open source where appropriate
        • conway’s law: align teams to organization
        • avoid enterprise service bus or orchestration
        • prefer choreography / dumb middleware and smart endpoints
        • keep associated logic and data within service boundaries
      • independently deployable:
        • can coexist versioned endpoints
        • let consumers change at own pace
        • avoid tightly bound client/server stub generation
        • consider blue/green or canary release to seperate deployment from release
        • consumer-driven contracts can catch breaking changes before they happen
      • isolate failure:
        • need to plan for it else can have catastrophic cascading failure
        • don’t treat remote calls like local
        • set appropriate timeouts
        • understand when and how to use bulkheads/ circuit breakers to limit fallout of failing components
        • understand customer facing impact of one part misbehaving
        • implications of sacrificing availability or consistency
      • highly observable:
        • need a joined up view of what is happening
        • semantic monitoring by using synthetic transactions
        • aggregate logs and stats but be able to drill down
        • use correlation IDs to trace calls
    • When Shouldn’t You Use Microservices?
      • if you don’t understand domain hard to find proper bounded contexts
      • in greenfield development consider starting with monolith
      • need tooling and practices to do it well
      • start gradually so you understand org’s appetite/ability to change
    • Parting Words
      • more options / more decisions
      • you won’t get all decisions right (guaranteed)
      • make decisions in small scope
      • embrace concept of evolutionary architecture
      • avoid big-bang rewrites in favor of series of changes
      • learning to change and evolve system is most important lesson
      • change is inevitable so embrace it
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s