Building Microservices

Microservices
- Intro
  - Emerged in context of domain driven design, CI/CD, on-demand virtualization, infrastructure automation, small autonomous teams, systems at scale
  - more flexibility to respond faster is goal
- What are microservices?
  - small autonomous services that work together
  - service boundaries based on business boundaries
  - shouldn’t be too big (~2 weeks to rewrite)
  - separate entities ideally own own machine
  - communication via network calls, hide internal representations, allow loose coupling
  - should be able to make changes, deploy without changing anything else
- Key Benefits
  - tech heterogeneity: different tech for different needs e.g. language or data persistence. reduce risk of new tech and absorb new tech quicker.
  - resilience: system can keep working and handle failed services, but need to know how to handle machine and network failures
  - scaling: can scale just parts that need it
  - ease of deployment: can ship or rollback quickly. slow releases means changes build up increasing risk.
  - organizational alignment: small teams on small code base are more productive
  - composability: resuse. these days can interact with web, native, mobile web, devices, etc.
  - optimizing for replaceability: big legacy hard to change and risky
- What About Service-Oriented Architecture?
  - microservices is specific approach to SOA
  - microservices:SOA::XP/scrum:Agile
  - SOA doesn’t offer ways to ensure services don’t become too coupled or how to split up something big
- Other Decompositional Techniques
  - libraries: same language / platform. ability to deploy changes in isolation still reduced can become coupling point.
  - modules: some allow life cycle management, but module authors must deliver proper module isolation and can be big source of complexity.
  - having process boundary enforces clean hygiene
- No Silver Bullet
  - still hard and not always appropriate
  - must get better at deployment, testing and monitoring
The Evolutionary Architect
- Inaccurate Comparisons
  - architects need to have technical vision for how to deliver system to customers
  - scope can vary by size/company
  - can have huge impact but people often get wrong
  - industry is young and don’t know what good looks like
  - not quite engineering, no hard certifications, changing environment / requirements
  - “architect” terminology caused a lot of harm propagating image of planner who expects others to carry vision out
- An Evolutionary Vision for the Architect
  - requirement, tools, tech shift rapidly
  - need to react and adapt to users
  - need to create framework in which right system can emerge and continue to grow
  - like city planner, set zones and let people make specific decisions
  - can’t foresee everything so shouldn’t over-specify everything
  - Ecosystem includes users, developers, and other workers
  - should step in on highly specific implementation details only in limited cases
- Zoning
  - zones are like service boundaries or coarse-grained groups inside services
  - more important to be concerned about what happens between zones than within
  - getting things wrong here leads to all sorts of problems that are hard to correct
  - some constraints include high cost to maintain knowledge/expertise for many technologies plus hiring
  - there are some benefits of tooling/expertise around specific tech
  - can have a lot of issues around integration e.g. REST, protobuf, Java RMI
  - coding architect: need to understand impact of decision and what normal looks like. should be routine.
- A Principled Approach
  - how to make decisions
  - strategic goals: company goals. tech must be aligned
  - principles: rules that align what you are doing to strategic goals. fewer than 10 is good.
  - practices:
    - how to ensure principles are carried out.
    - practical guide that any developer should be able to understand
    - can change often and very technical
  - can combine, but for larger orgs may want distinction for different types of teams
- The Required Standard
  - need to think about how much variance allowed
  - think about a “good” service and what are common characteristics of well-behaved services?
  - might not want too much divergence here
  - need to balance optimizing autonomy and losing sight of bigger picture
  - system wide health monitoring is essential
  - should standardize how this and logging is done
  - keep 1 or 2 interfaces and know how you will use and define
  - architectural safety:
    - one bad service shouldn’t ruin everything
    - each may have own circuit breaker
    - each needs to play by rules consistently otherwise may have more vulnerable system
- Governance Through Code
  - exemplars: real world examples that are good. hard to go wrong if imitating
  - tailored service templates: out of box set of libraries that provide health checking, serving http, exposing metrics and team/company specific context
  - might subtly constrain language choices
  - try not to make centralized, teams should have joint responsibility
  - may hurt morale/productivity by enforcing a framework
  - should be about ease of use for developers
  - can result in source of coupling b/w services
- Technical Debt
  - sometimes cut corners but cost to debt
  - sometimes due to changing vision
  - need to understand balance and have view of debt
  - some teams let them decide and pay down, other places maintain a log and review regularly
- Exception Handling
  - keep track of when you make decisions that deviate from principles and practices
  - if happens enough may want to revisit rules
  - some places more strict
  - microservices about autonomy, if company places a lot of restrictions, microservices may not be for you
- Governance and Leading from the Center
  - ensure what we are building matches vision
  - do with governance group
  - should predominantly consist of people who are executing the work being governed
  - group as whole responsible for governance
  - usually architect follows decision of group unless very dangerous/risky, but can undermine position and make team feel like they don’t have a say
- Building a Team
  - not just tech decisions, but people and helping them grow
  - have them involved in shaping and implementing vision
  - microservices provides way for people to step up and accept more responsibility
- Conclusion
  - vision, empathy, collaboration, adaptability, autonomy, governance
  - constant balancing
  - worst is to be rigid
How to Model Services
- Introducing MusicCorp
  - helps to have an example
  - want to grow but make changes easily
- What Makes a Good Service?
  - loose coupling & high cohesion
  - loose coupling:
    - change in one service doesn’t affect another
    - tight integration style can lead to coupling
    - services should know little about others
    - limit types of calls
  - high cohesion:
    - related behaviors sit together
    - make behavioral changes in one place
    - making changes in multiple places slower and riskier
- The Bounded Context
  - in domain driven design model real world domains
  - there are multiple bounded contexts and there are things that should and shouldn’t be shared
  - each bounded context has explicit interface where it decides what models to share
  - bounded context: specific responsibility enforced by explicit boundaries
  - communicate using models
  - musiccorp has warehouse, reception desk, finance, ordering
  - shared and hidden models:
    - some items only relevant to one bounded context
    - will be shared items, but can have internal and external representations based on what needs to be shared
    - models with same name may have different meanings in different contexts
  - modules and services:
    - thinking about what should and shouldn’t be shared leads to tight coupling
    - bounded contexts lend themselves well to being compositional boundaries
    - using modules good place to start
  - premature decomposition:
    - very costly to be wrong with services, wait for things to stabilize
    - easier to have something existing and decompose
- Business Capabilities
  - for each bounded context, think about what capabilities it provides rest of domain
  - first, think about what does this context do
  - second, what data does it need?
  - these capabilities become key operations exposed over the wire
- Turtles All the Way Down
  - can subdivide contexts or break out contexts into higher level contexts
  - easier to test/stub nested bounded contexts
- Communication in Terms of Business Concepts
  - often want changes that business wants
  - if services decomposed along bounded contexts changes likely isolated
  - the same terms and ideas shared between parts of organization should be reflected in your interfaces
- The Technical Boundary
  - can slice horizontally by tech e.g. frontend, data access over db
  - not always wrong
  - may make sense to achieve certain performance
Integration
- Looking for the Ideal Integration Technology
  - Avoid breaking changes: consumers shouldn’t also need to make change
  - keep API’s tech agnostic
  - simple for consumers to use: can allow total freedom or provide libraries
  - hide internal implementation details
- Interfacing with Customers
  - more complex than CRUD app, can kick off multiple processes
- The Shared Database
  - most common form of integration
  - bounded by db internal implementation details and changes can break consumers
  - consumer tied to specific tech
  - update operation might be spread between consumers / lose cohesion
  - easier to share data but not behavior
- Synchronous Versus Asynchronous
  - important because guides implementation details
  - sync easier to reason about
  - async better for long-running or low latency but more involved
  - req/res is sync, but can make async with callbacks
  - event-based: client emits an event and expects subscribers to know what to do
  - client doesn’t have to know about subscribers / highly decoupled
- Orchestration Versus Choreography
  - orchestration: central brain can have too much authority and be central point
  - choreography: services subscribe and act accordingly. more decoupled, but need to build in monitoring and tracking
- Remote Procedural Calls
  - make remote call look like local call
  - some implementation details tied to specific network protocol
  - easy to use
  - can have tech coupling e.g. Java RMI, but thrift and protobuf have a lot of lang support
  - local calls are not like remote calls with different costs and reliability profiles users should be aware of
  - brittleness: clients and servers may have to be updated in lock-step. in practice objects used in binary serialization thought of as expand-only types
- REST
  - concept of resources
  - doesn’t talk about underlying protocol but mostly used with HTTP
  - large ecosystem and tools for caching proxies, load balancing, monitoring and security
  - hypermedia as the engine of application state:
    - piece of context has links
    - client performs interactions via links
    - greatly abstracted from underlying details
    - can be chatty, but good to start here and optimize later
  - json, xml, else?
    - json most popular
    - xml may have better link control and can also extract certain parts with xpath
  - too much convenience:
    - some frameworks take db representation and deserialize into in-process objects and directly expose
    - results in high coupling
    - one strategy is delaying data persistence until interface has stabilized / store in local disk
    - too easy to let db storage influence how we send models over the wire
    - defers work, but for new service boundaries acceptable trade-off
  - downsides to REST over HTTP:
    - cannot easily generate client stubs may result in RPC over http or shared client libraries
    - some frameworks don’t support all verbs
    - may not be as lean as binary protocols
    - might not be great for low latency
    - web sockets might be better for streaming
    - consuming payloads might be more work than some RPC implementations with advanced serialization
    - generally is a sensible default though
- Implementing Asynchronous Event-Based Collaboration
  - message brokers like rabbit mq
  - enterprise solutions try to pack other things, keep dumb
  - can use http, but not good at latency, may have to keep track of own polling schedule
  - leads to complexity, should specify max retry and have good monitoring and be able to trace requests
- Services as State Machines
  - microservice should own all logic associated with behavior in this context
  - avoid dumb services that only CRUD
  - shouldn’t let decisions about domain leak out of right service
  - have one place to manage state collisions or attach behaviors that correspond to changes
- Reactive Extensions
  - compose results of multiple calls and run operations on them
  - observe outcome and react when something changes
  - abstract out details of how calls are made
  - good for handling concurrent calls to downstream services
- DRY and the Perils of Code Reuse in a Microservice World
  - shared libraries can lead to too much coupling
  - may not be able to independently upgrade client and server
  - things like logging is ok / invisible to outside world
  - don’t violate DRY within a service, but be more relaxed across services
  - too much coupling worse than code duplication
  - if you have client libraries, have separation, maybe separate people do API vs library
  - keep logic out of client library
- Access by Reference
  - how to pass around info for domain entities
  - if in memory, can be out of date
  - can include a reference i.e. URI
  - may cause too much load / define freshness by timestamp and timestamp
  - no hard rule, but be weary about freshness
- Versioning
  - defer as long as possible by picking right integration tech and decoupling clients (e.g. tolerant reader)
  - catch breaking changes early with consumer-driven contracts
  - semantic versioning / MAJOR.MINOR.PATCH
  - co-exist different end points:
    - gives consumers time to migrate
    - expand and contract
    - can include version in headers or uri
  - multiple concurrent service versions:
    - usually not great idea
    - need to handle directing consumers and may need to manage/fix two services
    - may only want to do if short-lived
- User Interface
  - historically have been thick, moved to thin, and now thick again
  - in recent history, UI has become compositional layer that weaves together various capabilities
  - constraints vary by how people interact e.g. mobile must factor in physical usage, battery, network bandwidth, no right click
  - API composition: UI can directly interact with services, but can be very chatty and little ability to tailor responses by device
  - UI fragment composition: different services compose their widget. issues with consistency / UX. some capabilities may not fit as a widget. might not work with native.
  - backends for front ends:
    - have api gateway to handle, but don’t want one layer
    - better to have service for each frontend e.g. mobile, customer website
    - think as part of UI
    - careful it doesn’t take on logic it shouldn’t
    - business logic should reside in services
    - only contain behavior specific to UI
  - hybrid: some solutions work better for different things
  - don’t put too much behavior into intermediate layers
- Integrating with Third-Party Software
  - rational to use off-the-shelf software / buy when tool isn’t special
  - lose control to vendor / should be part of too selection process
  - customization may be hard or expensive
  - might have communication protocols that involve reaching into datastore
  - one solution: for CMS wrap (facade) into service and keep scope of what it does to a minimum and make easy to access
  - strangler pattern:
    - intercept calls to legacy system and decide route to new or legacy
    - replace functionality gradually
Splitting the Monolith
- It’s All About The Seams
  - in a monolith changes can impact rest of system and will have to redeploy entire system
  - seam: portion of code that can be treated in isolation
- Breaking Apart MusicCorp
  - identify high level bounded contexts
  - first create packages representing them
  - packages should interact and have dependencies similar to real work groups
  - work incrementally!
- The Reasons to Split the Monolith
  - start with where you will get most benefit
  - areas we expect lots of changes?
  - team organization
  - do some functionalities require more security e.g. finance?
  - would some service benefit from certain type of tech?
- Tangled Dependencies
  - find least depended on seam
  - view services as a DAG
- The Database
  - common area of integration
- Getting to Grips with the Problem
  - see which parts interact with db
  - usually via repository layer
  - split up repository to different parts corresponding to bounded contexts
- Example: Breaking Foreign Key Relationships
  - prevent services from directly accessing rows in a table
  - set up API from owning service
  - additional db calls, but may still be performant
  - lose foreign key and consistency checks
  - domain or people defining system should have ideas on how to represent cases with violations
- Example: Shared Static Data
  - e.g. list of countries
  - A) could duplicate in tables of services
  - B) config file or keep in code
  - C) separate service if sufficiently complex
  - generally config file is simplest
- Example: Shared Data
  - services have to reach into same table
  - can mean there is a domain concept that isn’t explicitly modeled
- Example: Shared Tables
  - may not be normalized or can break it up
- Refactoring Databases
  - first split schema
  - then split service
  - review and assess things first with new schema broken out
  - make easier to revert
- Transactional Boundaries
  - lose this safety
  - try again later: can have a queue
  - abort entire operation: need a compensation transaction. can be hard to reason about. what if it fails?
  - distributed transactions: transaction manager gets consensus and each system needs to OK to proceed / can be blocking
  - If you really need transactional safety, think about creating a concept to represent the transaction itself and keep logic here
- Reporting
  - Need to figure out how to make architecture work with existing processes
- The Reporting Database
  - have a read replica, but can’t structure in different way and comes in one format
- Data Retrieval via Service Calls
  - ok if simple, but not good for large volumes
  - can hurt load
- Data Pumps
  - each service pushes data to reporting db
  - teams owning service should implement pump and version/deploy together
  - schedule to run regularly
  - can aggregate via materialized view
- Event Data Pump
  - subscribe to events and pump data to reporting database
  - less coupling / bind to emitted event
  - more fresh data and less inserts
  - separate group can manage this easier
  - all required info must be broadcast and may not scale well
- Backup Data Pump
  - use backup data as source
  - still have coupling to dest. reporting schema
- Toward Real Time
  - is reporting data all going to one place?
  - there are dashboards, alerting, financial reports, user analytics
  - different tolerances for accuracy and timeliness
  - things moving to more generic eventing systems
- Cost of Change
  - promote incremental change
  - think about impact / you will make mistakes
  - make mistakes where impact will be lowest
  - cost of change and mistakes lowest at white board
  - think about how things interact. are there circular references? overly chatty apps?
  - class responsibility collaboration (CRC) cards: names, responsibilities, collaborators
  - when working with design, can help see how things hang together
- Understanding Root Causes
  - growing a service is okay, but should split before that becomes to expensive
  - hard to know where to start, but also challenges with running and deploying services
Deployment
- A Brief Introduction to Continuous Integration
  - keep everyone in sync and make sure code properly integrated
  - use consistent artifact for testing and deploying
  - automate creation of artifact and version control
  - Are you really doing it?
    - do you check into main line once a day?
    - suite of tests to validate changes?
    - is fixing a broken build #1 priority?
- Mapping Continuous Integration to Microservices
  - monolithic build: requires lock step release
  - prefer one CI build per service with one service repo resulting in one artifact
- Build Pipelines and Continuous Delivery
  - multiple stages to build pipeline
  - compile/fast tests, slow tests, UAT, performance testing, production
  - production readiness of every check-in
  - usually want one microservice per build, but major exception is if new team/project
  - can keep services larger until understanding of domain stabilizes
  - while breaking out if unable to get stability in service boundaries, merge into monolithic service
- Platform-specific Artifacts
  - e.g. ruby gems, java jar/war, python eggs
  - may not be enough for some e.g. ruby/python and still need process manager inside apache/nginix
  - automation can hide differences in deploy mechanisms
- Operating System Artifacts
  - redhat/centOS has RPMs, Ubuntu has deb, windows has MSI
  - can be hard to build
  - potential overhead of multiple different OS
- Custom Images
  - can take a long time to provision instances from scratch
  - bake in common dependencies to virtual machine image
  - building can take long time and size can be large
  - building for different platforms can be a challenge e.g. VMWare, AWS AMI, Vagrant, Rackspace
  - can also bake service into image
  - immutable services: avoid config drift and make config changes go through build pipeline
- Environments
  - as you progress i.e. laptop -> build server -> UAT -> prod
  - want to ensure environments are more and more production-like to catch problems sooner
  - constance balance between cost and fast feedback
- Service Configuration
  - ideally small otherwise run into problems occurring only in certain environments (very painful)
  - use one single artifact and manage configs separately
  - can also have dedicated system if dealing with large number of microservices
- Service-to-Host Mapping
  - because of virtualization use term “host” over machine as general unit of isolation
  - can put multiple services on one host, but makes monitoring, deploying, scaling harder
  - recommend single server per host
    - easier to monitor / remediate
    - reduce single point of failure
    - easier to scale single service
    - focused security concerns
    - use alternate deployment techniques
    - cost may include more servers and hosts to manage
    - PAAS can be good if works, but might not be customizable
- Automation
  - automation is essential
  - give devs ability to self-service and provision services
  - picking tech that enables is highly important which starts with tools used to manage hosts
- From Physical to Virtual
  - traditional virtualization: independent machines, but overhead / cost of splitting
  - vagrant: deployment platform usually used for dev/test. can tax average machine and may not be able to run entire system.
  - linux container:
    - don’t need hypervisor, containers share same kernel (where process tree lives)
    - faster and more lightweight than VM
    - still need to route to outside world
    - also processes can interact with other containers
    - if you want isolation, VM may be better
  - docker:
    - hides underlying technology
    - fast/lightweight
    - deploy and install docker apps
    - still need scheduling layer e.g. Kubernetes
- A Deployment Interface
  - should have uniform interface
  - like using single parameterized command-line call with name, version, environment e.g. “deploy artifact=catalog environment=local version=local”
  - version is usually local, latest, or specific
  - Fabric (python library) useful for mapping command line calls to functions / can pair with AWS Bote to automate AWS environment
  - can use yaml files for environment definitions
  - a lot of upfront work but essential for managing deployment complexity
Testing
- Types of Tests
  - tech facing: unit and property help developers
  - business facing: acceptance and exploratory
- Test Scope
  - test pyramid top to bottom: UI, service, unit
  - unit tests:
    - simple functions or methods
    - not launching any services and limiting use of external files or network connections
    - small individual scope, fast, helps refactor
  - service tests:
    - test services directly
    - in monolithic app might be collection of classes
    - only test service itself and stub external collaborators
  - end-to-end tests:
    - entire system
    - high test scope and confidence
    - too many of these can be slow and make builds hard
- Implementing Service Tests
  - stubs return canned responses
  - mocks make sure calls were made
  - Mounteback: tool for stub/mocks
- Those Tricky End-to-End Tests
  - easy to cover same ground
  - slow and might be hard to diagnose issues
- Flaky and Brittle Tests
  - can have failures not due to app and make test untrustworthy
  - hard to determine who writes these tests
  - don’t want another team to own completely
  - treat like shared code base
- Test Journeys, Not Stories
  - instead of every functionality, focus on small number of core journeys
  - any functionality not covered here, use service tests
  - should have low double digits even for complex systems
- Consumer-Driven Tests to the Rescue
  - consumer defines expected behavior
  - producer checks that incoming API call receives expected behavior
  - run only against single producer in isolation
  - sit at same level as service tests
  - pact is open source tool that helps do this
  - requires good communication between consumer and producer teams
- So Should You Use End-to-End Tests?
  - could potentially get rid of, but can be good training wheels
  - also depends on teams appetite to learn in production
- Testing After Production
  - can always add more tests to catch errors but diminishing returns
  - can’t reduce chance of failure to zero
  - separate deployment from release
  - blue/green: deploy and run smoke tests
  - canary: trickle traffic / coexisting versions / sometimes copy production load
  - sometimes more beneficial to get better at remediation of a release than adding more automated functional tests
  - many orgs don’t spend effort on better monitoring or failure recovery
  - in certain cases, e.g. trying to validate business idea, may not need tests
- Cross-Functional Testing
  - acceptable latency, no users supported, security
  - falls under property testing
  - may want to track at service level and decide acceptable thresholds
  - performance tests:
    - more network calls
    - tracking down sources of latency is important
    - run in prod-like environment
    - takes time maybe run a subset everyday
    - make sure to look at number
Monitoring
- Intro
  - monitor the small things and use aggregation to see bigger picture
- Single Service, Single Server
  - host: CPU, memory
  - server: logs
  - application: response time or errors
- Single Service, Multiple Servers
  - monitor same things
  - can use ssh multiplexor to check multiple at same time
  - can use load balancer to track response times
- Multiple Services, Multiple Servers
  - harder to do
  - need collection and central aggregation
- Logs, Logs, and Yet More Logs…
  - use centralized subsystem to make available centrally e.g. logstash
  - kibana is tool for viewing and querying logs
- Metric Tracking Across Multiple Services
  - look at metrics long-enough time frame to know “normal”
  - should be able to look at aggregations at different levels and drill down
  - graphite is tool that handles some of this and compacts logs
  - can be helpful for capacity planning
- Service Metrics
  - try to expose a lot of metrics
  - 80% of software features never used
  - metrics can help inform what is actually being used
  - can be hard to know what will be useful in future so err on side of exposing everything
- Synthetic Monitoring
  - fake events or synthetic transactions
  - often better indicator that something is working than low level metrics
  - can use end-to-end tests but be careful not to trigger things on accident
- Correlation IDs
  - hard to track especially if things async
  - generate GUID and pass along to subsequent calls
  - put in soon, otherwise may not have it when you need it
  - could have in thin shared client wrapper, but keep very thing
- The Cascade
  - services might look healthy but can’t talk
  - important to monitor integration points between services
- Standardization
  - balance between narrow decisions vs standardization
  - monitoring is one area where standardization is incredibly important
  - must be able to view in holistic way
  - write logs in std format
  - have metrics in one place
  - standardize metric names too
- Consider the Audience
  - what do they need to know right now in order to react?
  - what might people need later?
  - how do people like consuming data?
  - create big displays and make accessible
- The Future
  - metrics siloed in many orgs/companies
  - can think of metrics as events
  - tying user and application logs can help real-time analysis
  - many orgs moving away from specialized tool chains to more generic event routing systems
Security
- Authentication and Authorization
  - authentication: confirm principal is who they say they are
  - authorization: allowed actions
  - ideally not every service will need to handle separately
  - common single sign-on implementations:
    - SSO e.g. SAML or OpenID
    - principal trying to access resource directed to authenticate with identity provider
    - username/pw or 2-factor
    - identity provider can be internal (LDAP) or external
  - single sign on gateway:
    - use gateway to act as proxy
    - downstream services can get info about principal in headers
    - harder to reason about service in isolation
    - potentially single source of failure
    - careful gateway layer doesn’t take on too much responsibility / results in coupling
  - fine-grained authorization:
    - can put people in groups
    - let microservices make decisions based on them
    - careful to not embed too much logic into group e.g. CALL_CENTER_50_DOLLAR_REFUND
    - model around how your organization works
- Service-to-Service Authentication and Authorization
  - allow everything inside perimeter:
    - trust implicitly
    - if attacker perpetrates little protection from man-in-the-middle attack
  - HTTP(S) Basic Authentication:
    - username/pw in header, but visible if HTTP
    - HTTPS can mitigate but need to manage certificates for different machines
    - traffic sent via SSL cannot be cached by reverse proxies
    - can have load balancer terminate SSL traffic and cache behind it
  - SAML or OpenID Connect:
    - still need to route over HTTPS
    - need accounts for clients
    - keep scope narrow
    - need to securely store credentials
  - client certificates:
    - more onerous than server-side certificates
    - only want to use when especially concerned about sensitivity of data being sent
  - HMAC over HTTP:
    - hash-based messaging code
    - body and private key are hashed to see if any modifications
    - traffic can be hashed
    - downside: 1) both client and server need shared secret 2) pattern not standard / can be hard to get right 3) only ensures no changes, still visible
  - API Keys:
    - use to identify and set restrictions
    - common to use public and private key-pair
  - The Deputy Problem:
    - malicious party can try to trick and make calls they shouldn’t be able to
    - no simple answer
    - 1) implicit trust
    - 2) verify identify of caller
    - 3) asking caller to provide credentials of original principal
- Securing Data at Rest
  - data is liability
  - should be encrypted
  - go with what is well known / DO NOT TRY TO IMPLEMENT YOURSELF
  - It’s all about keys:
    - use one solution is to use separate security appliance to encrypt and decrypt
    - life cycle management of keys can be vital operation
  - pick your targets: can limit encryption to sensitive data
  - decrypt on demand: encrypt when first see and ensure not stored anywhere
  - encrypt backups
- Defense in Depth
  - firewalls: restrict traffic/access by port/IP
  - logging: lets you see if something bad happened. make sure not to leak important info
  - intrusion detection: monitor network/hosts sand report problems
  - network segregation:
    - VPC
    - which can see each other
    - routing traffic through gateways to proxy access results in multiple parameters
  - OS:
    - patch software regularly and automatically
- Be Frugal
  - don’t store what you don’t need
  - can’t be stolen or demanded from you
- The Human Element
  - revoke credentials when someone leaves
  - social engineering protection
  - what damage can be done by a disgruntled employee?
- The Golden Rule
  - Don’t write your own crypto!
- Baking Security In
  - educate developers
  - familiarize with OWASP Top ten list, OWASP security testing framework
  - automated tools can probe for vulnerabilities
  - can integrate into CI Builds
- External Verification
  - external assessments
  - penetration testing
Conway’s Law and System Design
- Evidence
  - loosely/tightly coupled orgs:
    - commercial product firms produced less modular software than loosely coupled orgs e.g. distributed open source software
  - for windows vista, microsoft found metrics associated with org structure proved to be most statistically relevant
- Netflix and Amazon
  - amazon saw benefit of team owning lifecycle of system. 2 pizza team and AWS allowed self-sufficiency
  - Netflix wanted small independent teams to optimize speed of change
- Adapting to Communication Pathways
  - single teams with single service can make changes quicker due to low cost of communication
  - geographically dispersed teams might have harder time resulting in hard to maintain code
  - single team owning many services results in tight integration
- Service Ownership
  - own requirement gathering, building, deploying, maintaining
  - increased autonomy and speed of ownership
  - incentives to make easy to deploy
- Drivers for Shared Services
  - too hard to split: consider merging teams
  - feature teams: cut across technical layers
  - delivery bottleneck: can wait, add people to team needing help, or break into new service
- Internal Open Source
  - if you can’t avoid sharing service
  - have core committers that ensure quality and encourage good behavior
  - need to balance their time
  - hard to do if project not mature / may not know what good looks like
  - tooling: version control, discussions, code review process
- Bounded Contexts and Team Structures
  - good to have teams aligned by bounded contexts
  - easier to grasp domain
  - services within bounded context more likely to communicate making system design and release coordination easier
- The Orphaned Service?
  - should still have owner if services separated by bounded contexts
  - risk of polyglot approach is teams may not know tech stack of old service making change harder
- Conway’s Law in Reverse
  - have seen anecdotally system design influencing org structure
- People
  - developers must be more aware e.g. calls across network boundaries, implications of failure
  - more accountability
  - staff needs time to adjust
  - possible you need different types of people
Microservices at Scale
- Failure Is Everywhere
  - at scale failure is certain
  - spend less time trying to stop inevitable and more time dealing with things gracefully
  - thinking this way helps you make different trade-offs
- How Much Is Too Much?
  - know what failures you can tolerate per service
  - don’t need to go overboard on everything
  - define general cross functional requirements and override for particular use cases
  - response / latency: how long should operation take? e.g. 90th percentile response time of 2 seconds when handling 200 concurrent requests
  - availability: 24/7? can it be down?
  - durability of data:
    - how much data loss is acceptable?
    - how long to keep?
    - different based on data e.g. logs vs financial transactions
- Degrading Functionality
  - need to know how to handle system failures
  - shouldn’t block everything else
  - often not technical decision and requires business context
- Architectural Safety Measures
  - slow responses often worse than fast failures which are easier to detect
  - look into timeout configs, connection pools, circuit breakers
- The Antifragile Organization
  - netflix actively incites failures to ensure robust systems
  - timeouts:
    - balance between slowing whole system and killing req/res that would have worked
    - pick default, log, look into, and adjust
  - circuit breakers:
    - after certain number of request failures
    - can be timeout or 5XX code
    - automatically stop sending / fail fast
    - pick sensible defaults and change for specific cases
    - options after 1) queue and try later, 2) fail and propagate error 3) degrade functionality subtly
  - bulk heads:
    - way to isolate failures
    - connection pool for each downstream connection
    - mandate circuit breakers for all synchronous downstream calls
    - most important pattern, ensures resources aren’t constrained in first place
  - isolation:
    - less coupling and reliance on other services’ health
    - requires less coordination between service owners
- Idempotency
  - outcome doesn’t change after first application
  - repeat operation w/o adverse impact
  - more important that underlying business operations are idempotent vs entire state of system
- Scaling
  - helps with 1) failure and 2) reduce latency and handle more load
  - go bigger:
    - vertical scaling
    - can be more expensive and software might not be written in way to take advantage of CPU cores
    - doesn’t help resiliency
  - splitting workload:
    - have services in independent hosts
    - split critical from non-critical and manage load more effectively
  - spreading your risk:
    - host can be virtual / put on more than one physical machine
    - more than in a single rack in data center
    - distribute across more than one data center
    - know SLA’s and if that works for you
    - AWS has availability zones within regions
    - diversify appropriately
  - load balancing:
    - hard ware and software based
    - can distribute calls and remove unhealthy instances and add back when healthy
    - SSL termination
    - safer if using VLAN (virtual local area network) or VPC
    - treat config like anything else and version control
  - worker-based systems:
    - workers work on same shared backlog of work
    - good for batch or async jobs
    - helps manage peaky loads
    - improve throughput and resiliency and make sure nothing lost
    - individual workers don’t have to be reliable but system containing work needs to be
    - can handle with persisted message broker system like zookeeper
  - starting again:
    - jeff dean: “design for 10x growth, expect rewrite before 100x”
    - building for scale at beginning can be disastrous because don’t know what we want to build
    - need to be able to experiment rapidly and understand capabilities we need
    - needing to change is sign of success
- Scaling Databases
  - difference between available and durable
  - scaling for reads:
    - read replicas and writes go to primary node
    - some stale data / eventual consistency
    - suggest looking into caching first
  - scaling for writes:
    - sharding / makes queries more difficult
    - query all nodes or have alternative read store
    - usually handled by async mechanisms using cached results
    - recent tech makes adding new node in live system do-able, but test throughly
    - often teams change db tech when needing to scale writes
  - shared database infrastructure:
    - usually don’t want multiple services under different schemas within one database
    - single point of failure
    - consider risks if doing so and make sure db resilient as possible
  - CQRS:
    - command-query responsibility segregation
    - commands represent changes in state
    - can process as events / storing a list of commands (event sourcing)
    - can better handle scale
    - can be hard to get right
- Caching
  - client-side proxy, and server-side:
    - client can help reduce network calls / state invalidation can be tricky
    - proxy e.g. squid/varnish. cache any http traffic
    - server-side might be easier to reason about and can be good if multiple types of clients
    - usually mix of all
    - need to know what load to handle and how fresh data needs to be
  - caching in http:
    - cache-control directives say how long client should keep
    - expires headers gives date
    - helps prevent requests in first place
    - ETags let you know if resource changed
    - careful not to give conflicting info
  - caching for writes:
    - can be good if flushed to downstream source
    - buffer and potentially batch writes
    - can queue up writes if downstream unavailable
  - caching for resilience:
    - client can show stale data if downstream not available
    - reverse proxy can send stale data
    - may be better than returning nothing
  - hiding the origin:
    - for highly cacheable data, prevent requests from going to origin
    - origin populates cache in background
    - if cache miss fail original request
    - fail fast to avoid taking up resources to prevent cascading failure downstream
  - cache poisoning:
    - if you put “Expires: Never” can be stuck in caches you can’t reach e.g. CDN, ISPs, browsers
    - need to understand fullpath of data that is cached from source to destination to appreciate complexity and what can go wrong
- Autoscaing
  - can scale by well-known trends
  - usually better to have some extra capacity
  - have good suite of load tests almost essential
  - use to test autoscaling rules
  - may want mix of predictive and reactive scaling
  - suggest using autoscaling for failure conditions first while collecting data
- CAP Theorem
  - consistency, availability, partition tolerance
  - partition tolerance a given so pick consistency or availability
  - sacrificing consistency: can have stale data. replication not instantaneous
  - sacrificing availability: can’t respond until ensure all other agree and original node didn’t change. very hard to do. do not implement on your own.
  - AP or CP?: AP scales more easily. need to know context and discuss trade-offs
  - Not all or nothing even within a service
  - real world:
    - even a perfectly consistent system can be off because of things happening in real world (lost an order)
    - AP systems end up being right in many situations
- Services Discovery
  - where to find
  - DNS:
    - associate name with IP
    - can have it resolve to host or load balancer
    - can have diff env e.g. <service_name>-<env>.musiccorp.com
    - DNS well understood/used format. issue is services for managing DNS not designed for highly disposable hosts
    - DNS domain names have TTL (time to live) meaning clients can hold on to old IPs (avoid if pointing to LB)
    - DNS round robin can be problematic because client hidden from underlying host (what if sick instance?)
- Dynamic Service Registries
  - zookeeper:
    - general use case: config management, synchronizing data between services, leader election, message queues, naming service
    - runs a number of nodes (at least 3)
    - replicates safely between nodes and keep consistent in failure
    - can watch for changes
    - may still need to build on top of it, but distributed coordination works which is hard to get right
  - consul:
    - exposes http interface for service discovery
    - provides DNS server out of the box (serves SRV records)
    - can do health checks, but more often use as source of info and pull into monitoring or alerting systems
    - RESTful HTTP interface easy to integrate
    - new but Hashicorp good track record
  - Eureka:
    - provides LB and rest-based endpoint and java client
    - at netflix all clients use eureka
  - roll your own:
    - e.g. add tags to AWS and use query API
  - make it easy for humans to access info too
- Documenting Services
  - documentation can be out of date
  - swagger:
    - generate nice UI that lets you view documentation and interact with API via browser
  - HAL and HAL browser:
    - embedded on hypermedia controls
    - must be using hypermedia or retrofit
    - a lot of client libraries
- The Self-Describing System
  - early days of SOA had standards UDDI, but many approaches heavy-weight
  - Martin Fowler discussed concept of human registry or a place humans can record info i.e. simple wiki.
  - getting picture of system is important
  - access a lot of info through monitoring
  - more powerful than simple wiki
  - create custom dashboards
  - can start with simple web server or wiki, but pull in info over time
  - make this info readily available as its a key tool in managing emerging complexity
- Summary
  - read “Release It!” by Michael Nygard
  - essential reading
Bringing it All Together
- Principles of Microservices
  - model around business concepts:
    - more stable than tech boundaries
    - better reflect changes in business
  - adopt culture of automation:
    - adds complexity more moving parts
    - front load effort and tooling
    - automated testing, uniform deploy cli, CI/CD
    - custom images can speed up deployment
    - immutable servers easier to reason about
  - hide internal implementation details:
    - so services can change independently
    - hide databases
    - use data pumps or event data pumps to consolidate for reporting
    - use tech agnostic APIS, consider REST
  - decentralize all the things:
    - constantly delegate decision making and control
    - embrace self-service
    - another team shouldn’t be deploying or testing
    - use internal open source where appropriate
    - conway’s law: align teams to organization
    - avoid enterprise service bus or orchestration
    - prefer choreography / dumb middleware and smart endpoints
    - keep associated logic and data within service boundaries
  - independently deployable:
    - can coexist versioned endpoints
    - let consumers change at own pace
    - avoid tightly bound client/server stub generation
    - consider blue/green or canary release to seperate deployment from release
    - consumer-driven contracts can catch breaking changes before they happen
  - isolate failure:
    - need to plan for it else can have catastrophic cascading failure
    - don’t treat remote calls like local
    - set appropriate timeouts
    - understand when and how to use bulkheads/ circuit breakers to limit fallout of failing components
    - understand customer facing impact of one part misbehaving
    - implications of sacrificing availability or consistency
  - highly observable:
    - need a joined up view of what is happening
    - semantic monitoring by using synthetic transactions
    - aggregate logs and stats but be able to drill down
    - use correlation IDs to trace calls
- When Shouldn’t You Use Microservices?
  - if you don’t understand domain hard to find proper bounded contexts
  - in greenfield development consider starting with monolith
  - need tooling and practices to do it well
  - start gradually so you understand org’s appetite/ability to change
- Parting Words
  - more options / more decisions
  - you won’t get all decisions right (guaranteed)
  - make decisions in small scope
  - embrace concept of evolutionary architecture
  - avoid big-bang rewrites in favor of series of changes
  - learning to change and evolve system is most important lesson
  - change is inevitable so embrace it

Building Microservices

Published by yujinjc

Leave a comment Cancel reply

Share this:

Published by yujinjc

Leave a comment Cancel reply