Software Architecture and Decision-Making Book notes

Photo by Alex wong on Unsplash

Software Architecture and Decision-Making Book notes

·

11 min read

Book Core concepts

It explains principles and concepts I believe a senior architect must understand deeply and discusses how to employ those principles to manage uncertainty

Books for leadership

  • The Hard Things About Hard Things by Ben Horowitz,

  • Trillion Dollar Coach by Eric Schmidt et al.,

  • Team of Teams:

  • New Rules of Engagement for a Complex World by Stanley McChrystal

  • Good Strategy, Bad Strategy by Richard Rumel

Three layers of architecture of The Open Group Architecture Framework (TOGAF)

Two prominent approaches to system architecture:

  1. Waterfall ->identify the system’s requirements in full detail beforehand and start building

  2. Agile -> Iterative way (collaborating with users to refine requirements and construct a system that can genuinely benefit the user)

Understanding Systems, Design, and Architecture

When writing a cloud app, we have two choices: We can choose a single cloud, taking advantage of its unique strengths for our application, or we can make the application portable across several cloud providers

How to Design a System

Five questions and 7 Principles to understand the context of system we building

  1. When is the best time to market?

    • If the feature need to go urgent becasue of market urgent. we can design the way simple and fast. and we also ready to re-write it.
  2. What is the skill level of the team?

  3. What is our system’s performance sensitivity?

  4. When can we rewrite the system?

  5. What are the hard problems?

Seven principles

  1. Drive everything from the user’s journey.

  2. Use an iterative thin slice strategy

    • Unless you have a specific reason, always start with simple architectural choices. Measure the system, find the bottlenecks, and improve the system later.
  3. On each iteration, add the most value for the least effort to support more users

  4. Make decisions and absorb the risks

  5. Design deeply things that are hard to change but implement them slowly

  6. Eliminate the unknowns and learn from the evidence by working on hard problems early and in parallel

  7. Understand the trade-offs between cohesion and flexibility in the software architecture

Mental Models for Understanding and Explaining System Performance

Eight mental models that help us think about and understand performance

  1. Cost of Switching to the Kernel Mode from the User Mode

    • Every time anapplication enters kernel mode, a context switch occurs, which adds nonessential costs to the system, such as time to save the stack and to rest the cache. To improve performance, we need to reduce the number of system calls.
  2. Operations Hierarchy

OperationTime/Speed (Example)Description
L1 Cache Reference0.5 nsAccessing data from Level 1 cache in the CPU.
L2 Cache Reference7 nsAccessing data from Level 2 cache in the CPU.
L3 Cache Reference20 nsAccessing data from Level 3 cache in the CPU.
Main Memory (RAM) Access100 nsRetrieving data from the system's RAM.
SSD Storage Access1,000,000 ns (1 ms)Accessing data from Solid State Drive (SSD).
HDD Storage Access10,000,000 ns (10 ms)Accessing data from Hard Disk Drive (HDD).
Network Packet Round Trip (US to India)150,000,000 ns (150 ms)Sending a packet from the US to India and receiving the acknowledgment.
CPU Processing (Instruction)0.2 ns per instructionExecuting a single CPU instruction.
GPU Processing (Parallel)50 ns per parallel taskExecuting parallel tasks on a Graphics Processing Unit (GPU).
PCIe Data Transfer (Device to CPU)2,000 ns (2 us)Transferring data between a peripheral device and the CPU via PCIe.
Context Switch (Kernel)1,000 ns (1 us)Switching between different processes in the kernel.
RAM-to-Cache Transfer20 nsCopying data from RAM to cache in the CPU.
Database Query (Local)10,000,000 ns (10 ms)Executing a database query on a local server.
Database Query (Remote)50,000,000 ns (50 ms)Executing a database query on a remote server.
System Call (Linux)1,000 ns (1 us)Initiating a system call in a Linux environment.
I/O Operation (Disk Write)10,000 ns (10 us)Writing data to a disk.
I/O Operation (Network Send)1,000,000 ns (1 ms)Sending data over a network.
I/O Operation (Network Receive)1,000,000 ns (1 ms)Receiving data over a network.
  1. Context Switching Overhead

    • Switching processes adds an overhead cost of about 5–7 microseconds
  2. Amdahl’s Law

    • is used to predict speed up of a task execution time when it’s scaled to run on multiple processors. It simply states that the maximum speed up will be limited by the serial fraction of the task execution as it will create resource contention.

    • one woman nine months to make one baby, “nine women can’t make a baby in one month

    • Assume if have 3 thread that doing a single task it take 2 sec but the most of the time will spend on maintining the 3 thread and resources sharing

    • Parellel process are efficient when they are independ

  3. Universal Scalability Law

    • says that actual speedup is even worse than Amdahl’s law due to shared variables. USL defines a new parameter coherency, which is the overhead added by communication between multiple processes, threads, or nodes.
  4. Latency and Utilization Trade-offs

    • Only a single thread can use the most resources at a given time, which forces threads to wait and take turns
  5. Designing for Throughput with the Maximal Useful Utilization (MUU) Model

  6. Adding Latency Limits

Optimization Techniques

To optimize, we need to decide where the bottlenecks are. Bottlenecks usually come in one of three forms:

  • One of the resources (e.g., CPU, I/O, or memory) is the bottleneck.

  • Thread models are causing critical resources to be idle.

  • Resources are wasted on nonessential tasks (e.g., context switches, GC).

CPU Optimization Techniques
  • Optimize Individual Tasks

  • Optimize Memory

  • Maximize CPU Utilization

I/O Optimization Techniques
  • Avoid I/O (use a cache)

  • Buffering

  • Send Early, Receive Late, Don’t Ask but Tell

  • Prefetching

  • Append-Only Processing (Kafka)

Memory Optimization Techniques

  • Too Many Cache Misses

Latency Optimization Techniques

  • Do Work in Parallel

  • Reduce I/O

CPU Utilization is Wrong

  • CPU utilization is a metric that measures the percentage of time the CPU spends executing non-idle tasks. It doesn't necessarily mean the CPU is busy with computation; it's more about the time the CPU is not running the idle thread.

  • The idle thread is a special task that runs when the CPU has no other tasks to perform. The operating system kernel tracks CPU utilization during context switches.

  • So high %CPU to mean that the processing unit is the bottleneck, which is wrong because CPU is capable of doing the process it may wait for I/O or something .

  • Source: https://opensource.com/article/18/4/cpu-utilization-wrong

The USE Method

The Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system

For every resource, check utilization, saturation, and errors.

  • resource: all physical server functional components (CPUs, disks, busses, ...)

  • utilization: the average time that the resource was busy servicing work

  • saturation: the degree to which the resource has extra work which it can't service, often queued

  • errors: the count of error events

Understanding User Experience (UX)

Few concepts or principles that help us to design a good UX.

  • Understand the Users

  • Do as Little as Possible

  • Good Products Do Not Need a Manual: Its Use Is Self-Evident

  • Think in Terms of Information Exchange

    • Users come to our system to get something done. The faster they can find what they need to do and do it, the happier they will be. If we can provide that UX without asking them anything, that’s even better.
  • Make Simple Things Simple

  • Design UX Before Implementation

Note: having UX expertise on the team is a must

Macro Architecture

Spliting for serivce is macro Architecture.

Macro Architectural Building Blocks are

  • Data Management (DB,)

  • Routers and Messaging (API Gateway,loadbalancer,message broker)

  • Executors (Actual server)

  • Security

  • Communication (Distributed hash tables,Gossip architectures,Tree of responsibility patterns)

Macro Architecture: Coordination

Drive flow from the client

Call All API calls to service from client

Another service

Have seprate service which will cordinate with other and return the result.

Implement Choreography

Event driven system, where each participant in the process listens to different events and carries out their individual part. Each action generates asynchronous events, which trigger participants downstream

Macro Architecture: Preserving Consistency of State

Two-phase commit protocol

Approaches to going beyond transactions

  1. Redefining the Problem to Require Lesser Guarantees

    • Figure out a way to resolve complex situations.

    • Provide a button for users to forcefully refresh the page if they can tell that it is outdated

  2. Using Compensations

    • Starbucks Does Not Use Two-Phase Commit

    • The key idea is that if an action fails, you can compensate.

    • Use compenstations if below 3 have statisfy

      1. Each individual operation can be verified.

      2. The operation is idempotent. If we repeat the operation with the same data, no additional side effects occur.

      3. We can handle the failure or take compensative actions

Macro Architecture: Handling High Availability and Scale

Before going to scale the system to N Do the POC of single system capacity how much they can handle and if possible can tweak that to improve performance.

There are four tactics (techniques) to keep communication low, and they help us scale.

  1. Share Nothing

  2. Distribution

  3. Caching

  4. Async Processing

Macro Architecture: Microservices Considerations

Microservices let us split the structure into many loosely connected parts that can be developed, released, improved, and managed independently.

The decisions to make before going microservice

  1. Handling Shared Database(s)

    • Each microservice should have its own database, and two microservices must not share data via the same database. This rule removes a common cause that leads to tight coupling between services

    • If we want to share then make sure only one Service do update to avoid data reduntency

    • Use transaction if we want both service to update

  2. Securing Microservices

  3. Coordinating Microservices

  4. Avoiding Dependency Hell

    • One service depolyment not to break other service if we have dependency use Backward Compatibility or Forward Compatibility

    • Backward Compatibility: If we have updated the API to v2 which will accpet query params but make sure that API support without query params that's how before it work's.

    • Forward Compatibility: If V2 not working try V1 (it just temporary solution)

    • Avoid dependency hell by Be conservative in what you do, be liberal in what you accept from others

Server Architectures

some guidelines for writing efficient and simple services

  • Do not reinvent the wheel

  • Use pools to reuse complex objects (need to see how moongose have pool of connections what they doing)

  • Make service operations idempotent whenever possible

Service Architecture

  • Thread-per-Request Architecture

  • Event-Driven (Nonblocking) Architecture

  • Staged Event-Driven Architecture (SEDA)

classification of applications based on their resource usage

  • CPU-bound applications (CPU >> Memory and no I/O)

  • Memory-bound applications (CPU + Bound Memory and no I/O)

  • Balanced applications (CPU + Memory + I/O)

  • I/O-bound applications (I/O > CPU)

Building Stable Systems

scenarios that affect stability:

  • Unexpected workloads

  • Resource and dependency failures, plus service-level agreement violations

  • Software bugs in both ours and borrowed code

Handling Unexpected Load

  • understand the load and set up a system that can keep up with the load most of the time. This is called capacity planning

  • Autoscaling

  • Admission Control: set of policies, procedures, and checks that regulate the admission or acceptance of incoming entities or requests into a system example: Assume server is already overloaded instead of accepting new connection we through error.

  • Noncritical Functionality: Turn off the Noncritical Functionality

  • Disaster recovery (have a plan for what to when X go down)

Handling Human Changes

  • Blue and green Deployment (Having two system one with new updated one and one with old one if new one fails redirect traffic to old one)

  • Canary deployments (small percentage of users to the new system and gradually increase the load to that system)

Handle Unknown Errors

  • Observability (Application Performance Monitoring (APM) tool, Open telementry)

Building and Evolving the Systems

  • Get the Basics Right

  • Understand the Design Process

  • Conway law: organizations design systems that mirror their own communication structure rather than fighting it

  • Make Decisions and Absorb Risks

  • Create checklists of questions that are useful for different situations and use them.

  • Communicating the Design

  • Growth hacking funnel (find where we stuck does in user side or in development are we not pusing not much feature etc..)

  1. https://fs.blog/mental-models/

  2. http://highscalability.com/blog/2012/5/16/big-list-of-20-common-bottlenecks.html

  3. https://opensource.com/article/18/4/cpu-utilization-wrong

  4. https://en.wikipedia.org/wiki/Queueing_theory

  5. https://www.brendangregg.com/usemethod.html [bottleneck]

  6. https://medium.freecodecamp.org/what-makes-apache-kafka-so-fast-a8d4f94ab145

  7. An Analysis of Web Servers Architectures Performances on Commodity Multicore https://hal.inria.fr/hal-00674475/document

  8. https://ocw.mit.edu/courses/6-851-advanced-data-structures-spring-2012/video_galleries/lecture-videos/

  9. https://www.thoughtworks.com/insights/blog/scaling-microservices-event-stream

  10. Scalability! But at What COST? (https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf)

  11. http://martinfowler.com/articles/microservices.html

  12. http://martinfowler.com/articles/microservice-trade-offs.html

  13. http://martinfowler.com/bliki/MicroservicePremium.html

  14. https://cramonblog.wordpress.com/2014/02/25/micro-services-its-not-only-the-size-that-matters-its-also-how-you-use-them-part-1/

  15. https://www.infoq.com/news/2015/06/taming-dependency-hell/

  16. https://news.ycombinator.com/item?id=9705098

  17. https://stackoverflow.com/questions/3570610/what-is-seda-staged-event-driven-architecture.

  18. https://berb.github.io/diploma-thesis/

  19. https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html.

  20. https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html.

  21. https://martinfowler.com/bliki/CQRS.html

  22. https://blog.thinkst.com/2022/08/always-be-hacking.html.

Note : this article originaly published on here follow me for more insights like this