- Microservices
- Intro
- Emerged in context of domain driven design, CI/CD, on-demand virtualization, infrastructure automation, small autonomous teams, systems at scale
- more flexibility to respond faster is goal
- What are microservices?
- small autonomous services that work together
- service boundaries based on business boundaries
- shouldn’t be too big (~2 weeks to rewrite)
- separate entities ideally own own machine
- communication via network calls, hide internal representations, allow loose coupling
- should be able to make changes, deploy without changing anything else
- Key Benefits
- tech heterogeneity: different tech for different needs e.g. language or data persistence. reduce risk of new tech and absorb new tech quicker.
- resilience: system can keep working and handle failed services, but need to know how to handle machine and network failures
- scaling: can scale just parts that need it
- ease of deployment: can ship or rollback quickly. slow releases means changes build up increasing risk.
- organizational alignment: small teams on small code base are more productive
- composability: resuse. these days can interact with web, native, mobile web, devices, etc.
- optimizing for replaceability: big legacy hard to change and risky
- What About Service-Oriented Architecture?
- microservices is specific approach to SOA
- microservices:SOA::XP/scrum:Agile
- SOA doesn’t offer ways to ensure services don’t become too coupled or how to split up something big
- Other Decompositional Techniques
- libraries: same language / platform. ability to deploy changes in isolation still reduced can become coupling point.
- modules: some allow life cycle management, but module authors must deliver proper module isolation and can be big source of complexity.
- having process boundary enforces clean hygiene
- No Silver Bullet
- still hard and not always appropriate
- must get better at deployment, testing and monitoring
- The Evolutionary Architect
- Inaccurate Comparisons
- architects need to have technical vision for how to deliver system to customers
- scope can vary by size/company
- can have huge impact but people often get wrong
- industry is young and don’t know what good looks like
- not quite engineering, no hard certifications, changing environment / requirements
- “architect” terminology caused a lot of harm propagating image of planner who expects others to carry vision out
- An Evolutionary Vision for the Architect
- requirement, tools, tech shift rapidly
- need to react and adapt to users
- need to create framework in which right system can emerge and continue to grow
- like city planner, set zones and let people make specific decisions
- can’t foresee everything so shouldn’t over-specify everything
- Ecosystem includes users, developers, and other workers
- should step in on highly specific implementation details only in limited cases
- Zoning
- zones are like service boundaries or coarse-grained groups inside services
- more important to be concerned about what happens between zones than within
- getting things wrong here leads to all sorts of problems that are hard to correct
- some constraints include high cost to maintain knowledge/expertise for many technologies plus hiring
- there are some benefits of tooling/expertise around specific tech
- can have a lot of issues around integration e.g. REST, protobuf, Java RMI
- coding architect: need to understand impact of decision and what normal looks like. should be routine.
- A Principled Approach
- how to make decisions
- strategic goals: company goals. tech must be aligned
- principles: rules that align what you are doing to strategic goals. fewer than 10 is good.
- practices:
- how to ensure principles are carried out.
- practical guide that any developer should be able to understand
- can change often and very technical
- can combine, but for larger orgs may want distinction for different types of teams
- The Required Standard
- need to think about how much variance allowed
- think about a “good” service and what are common characteristics of well-behaved services?
- might not want too much divergence here
- need to balance optimizing autonomy and losing sight of bigger picture
- system wide health monitoring is essential
- should standardize how this and logging is done
- keep 1 or 2 interfaces and know how you will use and define
- architectural safety:
- one bad service shouldn’t ruin everything
- each may have own circuit breaker
- each needs to play by rules consistently otherwise may have more vulnerable system
- Governance Through Code
- exemplars: real world examples that are good. hard to go wrong if imitating
- tailored service templates: out of box set of libraries that provide health checking, serving http, exposing metrics and team/company specific context
- might subtly constrain language choices
- try not to make centralized, teams should have joint responsibility
- may hurt morale/productivity by enforcing a framework
- should be about ease of use for developers
- can result in source of coupling b/w services
- Technical Debt
- sometimes cut corners but cost to debt
- sometimes due to changing vision
- need to understand balance and have view of debt
- some teams let them decide and pay down, other places maintain a log and review regularly
- Exception Handling
- keep track of when you make decisions that deviate from principles and practices
- if happens enough may want to revisit rules
- some places more strict
- microservices about autonomy, if company places a lot of restrictions, microservices may not be for you
- Governance and Leading from the Center
- ensure what we are building matches vision
- do with governance group
- should predominantly consist of people who are executing the work being governed
- group as whole responsible for governance
- usually architect follows decision of group unless very dangerous/risky, but can undermine position and make team feel like they don’t have a say
- Building a Team
- not just tech decisions, but people and helping them grow
- have them involved in shaping and implementing vision
- microservices provides way for people to step up and accept more responsibility
- Conclusion
- vision, empathy, collaboration, adaptability, autonomy, governance
- constant balancing
- worst is to be rigid
- How to Model Services
- Introducing MusicCorp
- helps to have an example
- want to grow but make changes easily
- What Makes a Good Service?
- loose coupling & high cohesion
- loose coupling:
- change in one service doesn’t affect another
- tight integration style can lead to coupling
- services should know little about others
- limit types of calls
- high cohesion:
- related behaviors sit together
- make behavioral changes in one place
- making changes in multiple places slower and riskier
- The Bounded Context
- in domain driven design model real world domains
- there are multiple bounded contexts and there are things that should and shouldn’t be shared
- each bounded context has explicit interface where it decides what models to share
- bounded context: specific responsibility enforced by explicit boundaries
- communicate using models
- musiccorp has warehouse, reception desk, finance, ordering
- shared and hidden models:
- some items only relevant to one bounded context
- will be shared items, but can have internal and external representations based on what needs to be shared
- models with same name may have different meanings in different contexts
- modules and services:
- thinking about what should and shouldn’t be shared leads to tight coupling
- bounded contexts lend themselves well to being compositional boundaries
- using modules good place to start
- premature decomposition:
- very costly to be wrong with services, wait for things to stabilize
- easier to have something existing and decompose
- Business Capabilities
- for each bounded context, think about what capabilities it provides rest of domain
- first, think about what does this context do
- second, what data does it need?
- these capabilities become key operations exposed over the wire
- Turtles All the Way Down
- can subdivide contexts or break out contexts into higher level contexts
- easier to test/stub nested bounded contexts
- Communication in Terms of Business Concepts
- often want changes that business wants
- if services decomposed along bounded contexts changes likely isolated
- the same terms and ideas shared between parts of organization should be reflected in your interfaces
- The Technical Boundary
- can slice horizontally by tech e.g. frontend, data access over db
- not always wrong
- may make sense to achieve certain performance
- Integration
- Looking for the Ideal Integration Technology
- Avoid breaking changes: consumers shouldn’t also need to make change
- keep API’s tech agnostic
- simple for consumers to use: can allow total freedom or provide libraries
- hide internal implementation details
- Interfacing with Customers
- more complex than CRUD app, can kick off multiple processes
- The Shared Database
- most common form of integration
- bounded by db internal implementation details and changes can break consumers
- consumer tied to specific tech
- update operation might be spread between consumers / lose cohesion
- easier to share data but not behavior
- Synchronous Versus Asynchronous
- important because guides implementation details
- sync easier to reason about
- async better for long-running or low latency but more involved
- req/res is sync, but can make async with callbacks
- event-based: client emits an event and expects subscribers to know what to do
- client doesn’t have to know about subscribers / highly decoupled
- Orchestration Versus Choreography
- orchestration: central brain can have too much authority and be central point
- choreography: services subscribe and act accordingly. more decoupled, but need to build in monitoring and tracking
- Remote Procedural Calls
- make remote call look like local call
- some implementation details tied to specific network protocol
- easy to use
- can have tech coupling e.g. Java RMI, but thrift and protobuf have a lot of lang support
- local calls are not like remote calls with different costs and reliability profiles users should be aware of
- brittleness: clients and servers may have to be updated in lock-step. in practice objects used in binary serialization thought of as expand-only types
- REST
- concept of resources
- doesn’t talk about underlying protocol but mostly used with HTTP
- large ecosystem and tools for caching proxies, load balancing, monitoring and security
- hypermedia as the engine of application state:
- piece of context has links
- client performs interactions via links
- greatly abstracted from underlying details
- can be chatty, but good to start here and optimize later
- json, xml, else?
- json most popular
- xml may have better link control and can also extract certain parts with xpath
- too much convenience:
- some frameworks take db representation and deserialize into in-process objects and directly expose
- results in high coupling
- one strategy is delaying data persistence until interface has stabilized / store in local disk
- too easy to let db storage influence how we send models over the wire
- defers work, but for new service boundaries acceptable trade-off
- downsides to REST over HTTP:
- cannot easily generate client stubs may result in RPC over http or shared client libraries
- some frameworks don’t support all verbs
- may not be as lean as binary protocols
- might not be great for low latency
- web sockets might be better for streaming
- consuming payloads might be more work than some RPC implementations with advanced serialization
- generally is a sensible default though
- Implementing Asynchronous Event-Based Collaboration
- message brokers like rabbit mq
- enterprise solutions try to pack other things, keep dumb
- can use http, but not good at latency, may have to keep track of own polling schedule
- leads to complexity, should specify max retry and have good monitoring and be able to trace requests
- Services as State Machines
- microservice should own all logic associated with behavior in this context
- avoid dumb services that only CRUD
- shouldn’t let decisions about domain leak out of right service
- have one place to manage state collisions or attach behaviors that correspond to changes
- Reactive Extensions
- compose results of multiple calls and run operations on them
- observe outcome and react when something changes
- abstract out details of how calls are made
- good for handling concurrent calls to downstream services
- DRY and the Perils of Code Reuse in a Microservice World
- shared libraries can lead to too much coupling
- may not be able to independently upgrade client and server
- things like logging is ok / invisible to outside world
- don’t violate DRY within a service, but be more relaxed across services
- too much coupling worse than code duplication
- if you have client libraries, have separation, maybe separate people do API vs library
- keep logic out of client library
- Access by Reference
- how to pass around info for domain entities
- if in memory, can be out of date
- can include a reference i.e. URI
- may cause too much load / define freshness by timestamp and timestamp
- no hard rule, but be weary about freshness
- Versioning
- defer as long as possible by picking right integration tech and decoupling clients (e.g. tolerant reader)
- catch breaking changes early with consumer-driven contracts
- semantic versioning / MAJOR.MINOR.PATCH
- co-exist different end points:
- gives consumers time to migrate
- expand and contract
- can include version in headers or uri
- multiple concurrent service versions:
- usually not great idea
- need to handle directing consumers and may need to manage/fix two services
- may only want to do if short-lived
- User Interface
- historically have been thick, moved to thin, and now thick again
- in recent history, UI has become compositional layer that weaves together various capabilities
- constraints vary by how people interact e.g. mobile must factor in physical usage, battery, network bandwidth, no right click
- API composition: UI can directly interact with services, but can be very chatty and little ability to tailor responses by device
- UI fragment composition: different services compose their widget. issues with consistency / UX. some capabilities may not fit as a widget. might not work with native.
- backends for front ends:
- have api gateway to handle, but don’t want one layer
- better to have service for each frontend e.g. mobile, customer website
- think as part of UI
- careful it doesn’t take on logic it shouldn’t
- business logic should reside in services
- only contain behavior specific to UI
- hybrid: some solutions work better for different things
- don’t put too much behavior into intermediate layers
- Integrating with Third-Party Software
- rational to use off-the-shelf software / buy when tool isn’t special
- lose control to vendor / should be part of too selection process
- customization may be hard or expensive
- might have communication protocols that involve reaching into datastore
- one solution: for CMS wrap (facade) into service and keep scope of what it does to a minimum and make easy to access
- strangler pattern:
- intercept calls to legacy system and decide route to new or legacy
- replace functionality gradually
- Splitting the Monolith
- It’s All About The Seams
- in a monolith changes can impact rest of system and will have to redeploy entire system
- seam: portion of code that can be treated in isolation
- Breaking Apart MusicCorp
- identify high level bounded contexts
- first create packages representing them
- packages should interact and have dependencies similar to real work groups
- work incrementally!
- The Reasons to Split the Monolith
- start with where you will get most benefit
- areas we expect lots of changes?
- team organization
- do some functionalities require more security e.g. finance?
- would some service benefit from certain type of tech?
- Tangled Dependencies
- find least depended on seam
- view services as a DAG
- The Database
- common area of integration
- Getting to Grips with the Problem
- see which parts interact with db
- usually via repository layer
- split up repository to different parts corresponding to bounded contexts
- Example: Breaking Foreign Key Relationships
- prevent services from directly accessing rows in a table
- set up API from owning service
- additional db calls, but may still be performant
- lose foreign key and consistency checks
- domain or people defining system should have ideas on how to represent cases with violations
- Example: Shared Static Data
- e.g. list of countries
- A) could duplicate in tables of services
- B) config file or keep in code
- C) separate service if sufficiently complex
- generally config file is simplest
- Example: Shared Data
- services have to reach into same table
- can mean there is a domain concept that isn’t explicitly modeled
- Example: Shared Tables
- may not be normalized or can break it up
- Refactoring Databases
- first split schema
- then split service
- review and assess things first with new schema broken out
- make easier to revert
- Transactional Boundaries
- lose this safety
- try again later: can have a queue
- abort entire operation: need a compensation transaction. can be hard to reason about. what if it fails?
- distributed transactions: transaction manager gets consensus and each system needs to OK to proceed / can be blocking
- If you really need transactional safety, think about creating a concept to represent the transaction itself and keep logic here
- Reporting
- Need to figure out how to make architecture work with existing processes
- The Reporting Database
- have a read replica, but can’t structure in different way and comes in one format
- Data Retrieval via Service Calls
- ok if simple, but not good for large volumes
- can hurt load
- Data Pumps
- each service pushes data to reporting db
- teams owning service should implement pump and version/deploy together
- schedule to run regularly
- can aggregate via materialized view
- Event Data Pump
- subscribe to events and pump data to reporting database
- less coupling / bind to emitted event
- more fresh data and less inserts
- separate group can manage this easier
- all required info must be broadcast and may not scale well
- Backup Data Pump
- use backup data as source
- still have coupling to dest. reporting schema
- Toward Real Time
- is reporting data all going to one place?
- there are dashboards, alerting, financial reports, user analytics
- different tolerances for accuracy and timeliness
- things moving to more generic eventing systems
- Cost of Change
- promote incremental change
- think about impact / you will make mistakes
- make mistakes where impact will be lowest
- cost of change and mistakes lowest at white board
- think about how things interact. are there circular references? overly chatty apps?
- class responsibility collaboration (CRC) cards: names, responsibilities, collaborators
- when working with design, can help see how things hang together
- Understanding Root Causes
- growing a service is okay, but should split before that becomes to expensive
- hard to know where to start, but also challenges with running and deploying services
- Deployment
- A Brief Introduction to Continuous Integration
- keep everyone in sync and make sure code properly integrated
- use consistent artifact for testing and deploying
- automate creation of artifact and version control
- Are you really doing it?
- do you check into main line once a day?
- suite of tests to validate changes?
- is fixing a broken build #1 priority?
- Mapping Continuous Integration to Microservices
- monolithic build: requires lock step release
- prefer one CI build per service with one service repo resulting in one artifact
- Build Pipelines and Continuous Delivery
- multiple stages to build pipeline
- compile/fast tests, slow tests, UAT, performance testing, production
- production readiness of every check-in
- usually want one microservice per build, but major exception is if new team/project
- can keep services larger until understanding of domain stabilizes
- while breaking out if unable to get stability in service boundaries, merge into monolithic service
- Platform-specific Artifacts
- e.g. ruby gems, java jar/war, python eggs
- may not be enough for some e.g. ruby/python and still need process manager inside apache/nginix
- automation can hide differences in deploy mechanisms
- Operating System Artifacts
- redhat/centOS has RPMs, Ubuntu has deb, windows has MSI
- can be hard to build
- potential overhead of multiple different OS
- Custom Images
- can take a long time to provision instances from scratch
- bake in common dependencies to virtual machine image
- building can take long time and size can be large
- building for different platforms can be a challenge e.g. VMWare, AWS AMI, Vagrant, Rackspace
- can also bake service into image
- immutable services: avoid config drift and make config changes go through build pipeline
- Environments
- as you progress i.e. laptop -> build server -> UAT -> prod
- want to ensure environments are more and more production-like to catch problems sooner
- constance balance between cost and fast feedback
- Service Configuration
- ideally small otherwise run into problems occurring only in certain environments (very painful)
- use one single artifact and manage configs separately
- can also have dedicated system if dealing with large number of microservices
- Service-to-Host Mapping
- because of virtualization use term “host” over machine as general unit of isolation
- can put multiple services on one host, but makes monitoring, deploying, scaling harder
- recommend single server per host
- easier to monitor / remediate
- reduce single point of failure
- easier to scale single service
- focused security concerns
- use alternate deployment techniques
- cost may include more servers and hosts to manage
- PAAS can be good if works, but might not be customizable
- Automation
- automation is essential
- give devs ability to self-service and provision services
- picking tech that enables is highly important which starts with tools used to manage hosts
- From Physical to Virtual
- traditional virtualization: independent machines, but overhead / cost of splitting
- vagrant: deployment platform usually used for dev/test. can tax average machine and may not be able to run entire system.
- linux container:
- don’t need hypervisor, containers share same kernel (where process tree lives)
- faster and more lightweight than VM
- still need to route to outside world
- also processes can interact with other containers
- if you want isolation, VM may be better
- docker:
- hides underlying technology
- fast/lightweight
- deploy and install docker apps
- still need scheduling layer e.g. Kubernetes
- A Deployment Interface
- should have uniform interface
- like using single parameterized command-line call with name, version, environment e.g. “deploy artifact=catalog environment=local version=local”
- version is usually local, latest, or specific
- Fabric (python library) useful for mapping command line calls to functions / can pair with AWS Bote to automate AWS environment
- can use yaml files for environment definitions
- a lot of upfront work but essential for managing deployment complexity
- Testing
- Types of Tests
- tech facing: unit and property help developers
- business facing: acceptance and exploratory
- Test Scope
- test pyramid top to bottom: UI, service, unit
- unit tests:
- simple functions or methods
- not launching any services and limiting use of external files or network connections
- small individual scope, fast, helps refactor
- service tests:
- test services directly
- in monolithic app might be collection of classes
- only test service itself and stub external collaborators
- end-to-end tests:
- entire system
- high test scope and confidence
- too many of these can be slow and make builds hard
- Implementing Service Tests
- stubs return canned responses
- mocks make sure calls were made
- Mounteback: tool for stub/mocks
- Those Tricky End-to-End Tests
- easy to cover same ground
- slow and might be hard to diagnose issues
- Flaky and Brittle Tests
- can have failures not due to app and make test untrustworthy
- hard to determine who writes these tests
- don’t want another team to own completely
- treat like shared code base
- Test Journeys, Not Stories
- instead of every functionality, focus on small number of core journeys
- any functionality not covered here, use service tests
- should have low double digits even for complex systems
- Consumer-Driven Tests to the Rescue
- consumer defines expected behavior
- producer checks that incoming API call receives expected behavior
- run only against single producer in isolation
- sit at same level as service tests
- pact is open source tool that helps do this
- requires good communication between consumer and producer teams
- So Should You Use End-to-End Tests?
- could potentially get rid of, but can be good training wheels
- also depends on teams appetite to learn in production
- Testing After Production
- can always add more tests to catch errors but diminishing returns
- can’t reduce chance of failure to zero
- separate deployment from release
- blue/green: deploy and run smoke tests
- canary: trickle traffic / coexisting versions / sometimes copy production load
- sometimes more beneficial to get better at remediation of a release than adding more automated functional tests
- many orgs don’t spend effort on better monitoring or failure recovery
- in certain cases, e.g. trying to validate business idea, may not need tests
- Cross-Functional Testing
- acceptable latency, no users supported, security
- falls under property testing
- may want to track at service level and decide acceptable thresholds
- performance tests:
- more network calls
- tracking down sources of latency is important
- run in prod-like environment
- takes time maybe run a subset everyday
- make sure to look at number
- Monitoring
- Intro
- monitor the small things and use aggregation to see bigger picture
- Single Service, Single Server
- host: CPU, memory
- server: logs
- application: response time or errors
- Single Service, Multiple Servers
- monitor same things
- can use ssh multiplexor to check multiple at same time
- can use load balancer to track response times
- Multiple Services, Multiple Servers
- harder to do
- need collection and central aggregation
- Logs, Logs, and Yet More Logs…
- use centralized subsystem to make available centrally e.g. logstash
- kibana is tool for viewing and querying logs
- Metric Tracking Across Multiple Services
- look at metrics long-enough time frame to know “normal”
- should be able to look at aggregations at different levels and drill down
- graphite is tool that handles some of this and compacts logs
- can be helpful for capacity planning
- Service Metrics
- try to expose a lot of metrics
- 80% of software features never used
- metrics can help inform what is actually being used
- can be hard to know what will be useful in future so err on side of exposing everything
- Synthetic Monitoring
- fake events or synthetic transactions
- often better indicator that something is working than low level metrics
- can use end-to-end tests but be careful not to trigger things on accident
- Correlation IDs
- hard to track especially if things async
- generate GUID and pass along to subsequent calls
- put in soon, otherwise may not have it when you need it
- could have in thin shared client wrapper, but keep very thing
- The Cascade
- services might look healthy but can’t talk
- important to monitor integration points between services
- Standardization
- balance between narrow decisions vs standardization
- monitoring is one area where standardization is incredibly important
- must be able to view in holistic way
- write logs in std format
- have metrics in one place
- standardize metric names too
- Consider the Audience
- what do they need to know right now in order to react?
- what might people need later?
- how do people like consuming data?
- create big displays and make accessible
- The Future
- metrics siloed in many orgs/companies
- can think of metrics as events
- tying user and application logs can help real-time analysis
- many orgs moving away from specialized tool chains to more generic event routing systems
- Security
- Authentication and Authorization
- authentication: confirm principal is who they say they are
- authorization: allowed actions
- ideally not every service will need to handle separately
- common single sign-on implementations:
- SSO e.g. SAML or OpenID
- principal trying to access resource directed to authenticate with identity provider
- username/pw or 2-factor
- identity provider can be internal (LDAP) or external
- single sign on gateway:
- use gateway to act as proxy
- downstream services can get info about principal in headers
- harder to reason about service in isolation
- potentially single source of failure
- careful gateway layer doesn’t take on too much responsibility / results in coupling
- fine-grained authorization:
- can put people in groups
- let microservices make decisions based on them
- careful to not embed too much logic into group e.g. CALL_CENTER_50_DOLLAR_REFUND
- model around how your organization works
- Service-to-Service Authentication and Authorization
- allow everything inside perimeter:
- trust implicitly
- if attacker perpetrates little protection from man-in-the-middle attack
- HTTP(S) Basic Authentication:
- username/pw in header, but visible if HTTP
- HTTPS can mitigate but need to manage certificates for different machines
- traffic sent via SSL cannot be cached by reverse proxies
- can have load balancer terminate SSL traffic and cache behind it
- SAML or OpenID Connect:
- still need to route over HTTPS
- need accounts for clients
- keep scope narrow
- need to securely store credentials
- client certificates:
- more onerous than server-side certificates
- only want to use when especially concerned about sensitivity of data being sent
- HMAC over HTTP:
- hash-based messaging code
- body and private key are hashed to see if any modifications
- traffic can be hashed
- downside: 1) both client and server need shared secret 2) pattern not standard / can be hard to get right 3) only ensures no changes, still visible
- API Keys:
- use to identify and set restrictions
- common to use public and private key-pair
- The Deputy Problem:
- malicious party can try to trick and make calls they shouldn’t be able to
- no simple answer
- 1) implicit trust
- 2) verify identify of caller
- 3) asking caller to provide credentials of original principal
- Securing Data at Rest
- data is liability
- should be encrypted
- go with what is well known / DO NOT TRY TO IMPLEMENT YOURSELF
- It’s all about keys:
- use one solution is to use separate security appliance to encrypt and decrypt
- life cycle management of keys can be vital operation
- pick your targets: can limit encryption to sensitive data
- decrypt on demand: encrypt when first see and ensure not stored anywhere
- encrypt backups
- Defense in Depth
- firewalls: restrict traffic/access by port/IP
- logging: lets you see if something bad happened. make sure not to leak important info
- intrusion detection: monitor network/hosts sand report problems
- network segregation:
- VPC
- which can see each other
- routing traffic through gateways to proxy access results in multiple parameters
- OS:
- patch software regularly and automatically
- Be Frugal
- don’t store what you don’t need
- can’t be stolen or demanded from you
- The Human Element
- revoke credentials when someone leaves
- social engineering protection
- what damage can be done by a disgruntled employee?
- The Golden Rule
- Don’t write your own crypto!
- Baking Security In
- educate developers
- familiarize with OWASP Top ten list, OWASP security testing framework
- automated tools can probe for vulnerabilities
- can integrate into CI Builds
- External Verification
- external assessments
- penetration testing
- Conway’s Law and System Design
- Evidence
- loosely/tightly coupled orgs:
- commercial product firms produced less modular software than loosely coupled orgs e.g. distributed open source software
- for windows vista, microsoft found metrics associated with org structure proved to be most statistically relevant
- Netflix and Amazon
- amazon saw benefit of team owning lifecycle of system. 2 pizza team and AWS allowed self-sufficiency
- Netflix wanted small independent teams to optimize speed of change
- Adapting to Communication Pathways
- single teams with single service can make changes quicker due to low cost of communication
- geographically dispersed teams might have harder time resulting in hard to maintain code
- single team owning many services results in tight integration
- Service Ownership
- own requirement gathering, building, deploying, maintaining
- increased autonomy and speed of ownership
- incentives to make easy to deploy
- Drivers for Shared Services
- too hard to split: consider merging teams
- feature teams: cut across technical layers
- delivery bottleneck: can wait, add people to team needing help, or break into new service
- Internal Open Source
- if you can’t avoid sharing service
- have core committers that ensure quality and encourage good behavior
- need to balance their time
- hard to do if project not mature / may not know what good looks like
- tooling: version control, discussions, code review process
- Bounded Contexts and Team Structures
- good to have teams aligned by bounded contexts
- easier to grasp domain
- services within bounded context more likely to communicate making system design and release coordination easier
- The Orphaned Service?
- should still have owner if services separated by bounded contexts
- risk of polyglot approach is teams may not know tech stack of old service making change harder
- Conway’s Law in Reverse
- have seen anecdotally system design influencing org structure
- People
- developers must be more aware e.g. calls across network boundaries, implications of failure
- more accountability
- staff needs time to adjust
- possible you need different types of people
- Microservices at Scale
- Failure Is Everywhere
- at scale failure is certain
- spend less time trying to stop inevitable and more time dealing with things gracefully
- thinking this way helps you make different trade-offs
- How Much Is Too Much?
- know what failures you can tolerate per service
- don’t need to go overboard on everything
- define general cross functional requirements and override for particular use cases
- response / latency: how long should operation take? e.g. 90th percentile response time of 2 seconds when handling 200 concurrent requests
- availability: 24/7? can it be down?
- durability of data:
- how much data loss is acceptable?
- how long to keep?
- different based on data e.g. logs vs financial transactions
- Degrading Functionality
- need to know how to handle system failures
- shouldn’t block everything else
- often not technical decision and requires business context
- Architectural Safety Measures
- slow responses often worse than fast failures which are easier to detect
- look into timeout configs, connection pools, circuit breakers
- The Antifragile Organization
- netflix actively incites failures to ensure robust systems
- timeouts:
- balance between slowing whole system and killing req/res that would have worked
- pick default, log, look into, and adjust
- circuit breakers:
- after certain number of request failures
- can be timeout or 5XX code
- automatically stop sending / fail fast
- pick sensible defaults and change for specific cases
- options after 1) queue and try later, 2) fail and propagate error 3) degrade functionality subtly
- bulk heads:
- way to isolate failures
- connection pool for each downstream connection
- mandate circuit breakers for all synchronous downstream calls
- most important pattern, ensures resources aren’t constrained in first place
- isolation:
- less coupling and reliance on other services’ health
- requires less coordination between service owners
- Idempotency
- outcome doesn’t change after first application
- repeat operation w/o adverse impact
- more important that underlying business operations are idempotent vs entire state of system
- Scaling
- helps with 1) failure and 2) reduce latency and handle more load
- go bigger:
- vertical scaling
- can be more expensive and software might not be written in way to take advantage of CPU cores
- doesn’t help resiliency
- splitting workload:
- have services in independent hosts
- split critical from non-critical and manage load more effectively
- spreading your risk:
- host can be virtual / put on more than one physical machine
- more than in a single rack in data center
- distribute across more than one data center
- know SLA’s and if that works for you
- AWS has availability zones within regions
- diversify appropriately
- load balancing:
- hard ware and software based
- can distribute calls and remove unhealthy instances and add back when healthy
- SSL termination
- safer if using VLAN (virtual local area network) or VPC
- treat config like anything else and version control
- worker-based systems:
- workers work on same shared backlog of work
- good for batch or async jobs
- helps manage peaky loads
- improve throughput and resiliency and make sure nothing lost
- individual workers don’t have to be reliable but system containing work needs to be
- can handle with persisted message broker system like zookeeper
- starting again:
- jeff dean: “design for 10x growth, expect rewrite before 100x”
- building for scale at beginning can be disastrous because don’t know what we want to build
- need to be able to experiment rapidly and understand capabilities we need
- needing to change is sign of success
- Scaling Databases
- difference between available and durable
- scaling for reads:
- read replicas and writes go to primary node
- some stale data / eventual consistency
- suggest looking into caching first
- scaling for writes:
- sharding / makes queries more difficult
- query all nodes or have alternative read store
- usually handled by async mechanisms using cached results
- recent tech makes adding new node in live system do-able, but test throughly
- often teams change db tech when needing to scale writes
- shared database infrastructure:
- usually don’t want multiple services under different schemas within one database
- single point of failure
- consider risks if doing so and make sure db resilient as possible
- CQRS:
- command-query responsibility segregation
- commands represent changes in state
- can process as events / storing a list of commands (event sourcing)
- can better handle scale
- can be hard to get right
- Caching
- client-side proxy, and server-side:
- client can help reduce network calls / state invalidation can be tricky
- proxy e.g. squid/varnish. cache any http traffic
- server-side might be easier to reason about and can be good if multiple types of clients
- usually mix of all
- need to know what load to handle and how fresh data needs to be
- caching in http:
- cache-control directives say how long client should keep
- expires headers gives date
- helps prevent requests in first place
- ETags let you know if resource changed
- careful not to give conflicting info
- caching for writes:
- can be good if flushed to downstream source
- buffer and potentially batch writes
- can queue up writes if downstream unavailable
- caching for resilience:
- client can show stale data if downstream not available
- reverse proxy can send stale data
- may be better than returning nothing
- hiding the origin:
- for highly cacheable data, prevent requests from going to origin
- origin populates cache in background
- if cache miss fail original request
- fail fast to avoid taking up resources to prevent cascading failure downstream
- cache poisoning:
- if you put “Expires: Never” can be stuck in caches you can’t reach e.g. CDN, ISPs, browsers
- need to understand fullpath of data that is cached from source to destination to appreciate complexity and what can go wrong
- Autoscaing
- can scale by well-known trends
- usually better to have some extra capacity
- have good suite of load tests almost essential
- use to test autoscaling rules
- may want mix of predictive and reactive scaling
- suggest using autoscaling for failure conditions first while collecting data
- CAP Theorem
- consistency, availability, partition tolerance
- partition tolerance a given so pick consistency or availability
- sacrificing consistency: can have stale data. replication not instantaneous
- sacrificing availability: can’t respond until ensure all other agree and original node didn’t change. very hard to do. do not implement on your own.
- AP or CP?: AP scales more easily. need to know context and discuss trade-offs
- Not all or nothing even within a service
- real world:
- even a perfectly consistent system can be off because of things happening in real world (lost an order)
- AP systems end up being right in many situations
- Services Discovery
- where to find
- DNS:
- associate name with IP
- can have it resolve to host or load balancer
- can have diff env e.g. <service_name>-<env>.musiccorp.com
- DNS well understood/used format. issue is services for managing DNS not designed for highly disposable hosts
- DNS domain names have TTL (time to live) meaning clients can hold on to old IPs (avoid if pointing to LB)
- DNS round robin can be problematic because client hidden from underlying host (what if sick instance?)
- Dynamic Service Registries
- zookeeper:
- general use case: config management, synchronizing data between services, leader election, message queues, naming service
- runs a number of nodes (at least 3)
- replicates safely between nodes and keep consistent in failure
- can watch for changes
- may still need to build on top of it, but distributed coordination works which is hard to get right
- consul:
- exposes http interface for service discovery
- provides DNS server out of the box (serves SRV records)
- can do health checks, but more often use as source of info and pull into monitoring or alerting systems
- RESTful HTTP interface easy to integrate
- new but Hashicorp good track record
- Eureka:
- provides LB and rest-based endpoint and java client
- at netflix all clients use eureka
- roll your own:
- e.g. add tags to AWS and use query API
- make it easy for humans to access info too
- Documenting Services
- documentation can be out of date
- swagger:
- generate nice UI that lets you view documentation and interact with API via browser
- HAL and HAL browser:
- embedded on hypermedia controls
- must be using hypermedia or retrofit
- a lot of client libraries
- The Self-Describing System
- early days of SOA had standards UDDI, but many approaches heavy-weight
- Martin Fowler discussed concept of human registry or a place humans can record info i.e. simple wiki.
- getting picture of system is important
- access a lot of info through monitoring
- more powerful than simple wiki
- create custom dashboards
- can start with simple web server or wiki, but pull in info over time
- make this info readily available as its a key tool in managing emerging complexity
- Summary
- read “Release It!” by Michael Nygard
- essential reading
- Bringing it All Together
- Principles of Microservices
- model around business concepts:
- more stable than tech boundaries
- better reflect changes in business
- adopt culture of automation:
- adds complexity more moving parts
- front load effort and tooling
- automated testing, uniform deploy cli, CI/CD
- custom images can speed up deployment
- immutable servers easier to reason about
- hide internal implementation details:
- so services can change independently
- hide databases
- use data pumps or event data pumps to consolidate for reporting
- use tech agnostic APIS, consider REST
- decentralize all the things:
- constantly delegate decision making and control
- embrace self-service
- another team shouldn’t be deploying or testing
- use internal open source where appropriate
- conway’s law: align teams to organization
- avoid enterprise service bus or orchestration
- prefer choreography / dumb middleware and smart endpoints
- keep associated logic and data within service boundaries
- independently deployable:
- can coexist versioned endpoints
- let consumers change at own pace
- avoid tightly bound client/server stub generation
- consider blue/green or canary release to seperate deployment from release
- consumer-driven contracts can catch breaking changes before they happen
- isolate failure:
- need to plan for it else can have catastrophic cascading failure
- don’t treat remote calls like local
- set appropriate timeouts
- understand when and how to use bulkheads/ circuit breakers to limit fallout of failing components
- understand customer facing impact of one part misbehaving
- implications of sacrificing availability or consistency
- highly observable:
- need a joined up view of what is happening
- semantic monitoring by using synthetic transactions
- aggregate logs and stats but be able to drill down
- use correlation IDs to trace calls
- When Shouldn’t You Use Microservices?
- if you don’t understand domain hard to find proper bounded contexts
- in greenfield development consider starting with monolith
- need tooling and practices to do it well
- start gradually so you understand org’s appetite/ability to change
- Parting Words
- more options / more decisions
- you won’t get all decisions right (guaranteed)
- make decisions in small scope
- embrace concept of evolutionary architecture
- avoid big-bang rewrites in favor of series of changes
- learning to change and evolve system is most important lesson
- change is inevitable so embrace it
Like this:
Like Loading...