Note: I've written a series of weblog entries on O'ReillyNet about this:
- Make decisions based on knowledge, not guesswork
- Delay optimization/pessimization decisions
- Limit the (user) code impact of changing your mind later
- Don't make it hard for others to make different decisions than you
- Know what you need to optimize for
- Total run time
- Response time
- Latency
- Concurrency
- Throughput
- CPU utilization
- Memory usage
- Limited passes over data
- Real resource limitation
- Artificial API limitation
- Know what the optimization goal is
- Within hard limit
- Usually within soft limit
- As small / large as possible
- Scalable (across data sets, across hardware, etc.)
- When in doubt, optimize for scalability, but still try not to suck on the low end
- Know when the result is good enough, and can return to implementing features or working on other code
- Profile and benchmark with real (or statistically reasonable) data
- P&B many different scenarios, with different data set / client / task / etc. sizes, compositions, shapes, ...
- P&B on many different platforms, with many different configs
- Design runs so that profiling / benchmarking overhead does not seriously skew results
- Perform many runs, and statistically merge the results
- Clear out background noise
- Browsers
- Mail checkers
- Music players
- rsync daemons
- Other users
- Make sure you are benchmarking what you think you are
- For repeatability, may avoid sources of randomness by using hardcoded values and disabling AI code, DB access, etc.
- But don't be surprised that the system now runs much faster, as it is no longer doing nearly as much work
- Determine the minimum possible work, and benchmark it as a baseline; for example, a log file analyzer must at a minimum read the entire logfile at least once, splitting it into records (and possibly fields) for analysis
- If there is a known bottleneck detection process (such as exists for GPUs), use it
- Determine whether slowness is localized or pervasive
- Understand how all relevant supporting systems work (and why)
- Interpreter / compiler
- Virtual machine
- Application frameworks
- Libraries
- OS subsystems (FS, gfx, sound, networking, scheduler, ...)
- Hardware components and busses
- (Co-)processor architectures
- Know the major performance gotchas in above
- Filesystems that can't handle huge directories
- Schedulers that don't do high concurrency or real time
- Busses that have high latency or asymmetric bandwidth
- VMs with slow, leaky, or resource-hogging garbage collection
- Understand computing trends that may affect performance
- Understand trend correlations that will affect scalability
- CPU speed growing much faster than memory latency is falling
- GPU throughput increasing faster than CPU throughput
- Cores per PU is growing faster than throughput per core
- Deeply grok the memory heirarchy, and what it means for performance
- Atomic widths: register, cache line, VM page, disk cluster
- Memory manager chunk and pool sizes
- Bandwidth and latency differences
- Read/write asymmetries
- Coherency protocols
- Cache relevancy algorithms
- Random access v. sequential asymmetries
- Build a mental performance model of your computing environment
- Cost of integer / float / double / string / etc. ops
- Cost of function / method / library / system call and return
- Cost of exception handling
- Inlining behavior
- GC and memory allocation behavior
- Benchmark many, many facets of mental performance model
- Use introspection and disassembly tools to discover reasons for discrepencies from expectation
- Constantly practice "back of the envelope" calculation
- Maintain a large toolbox, so that you can use the right tool for each (sub-)problem
- Understand Big-O notation, and its limitations
- For very large data sets, the big O wins
- For very small data sets, the constant often wins
- Average and worst-case run time are often different
- Algorithmic attacks: inputs that force worst-case behavior
- Recognize that rules of thumb are only guides, and only profiling and benchmarking can show reality
- Recognize that optimization goals have complex interrelationships; for example:
- Improved concurrency could improve reponse time (work units take less time, so more can be done in unit time),
- Or just the opposite (finer-grained locking, causing more lock contention per work unit),
- Or have no effect at all on response time (using more CPUs, each of which remains at the same speed)
- Profile first, then benchmark, then optimize; lather, rinse, repeat
- Optimize the problem space first, then the algorithm, then (and only then) micro-optimize if the system is still not fast enough
- Cater to the strengths of your environment
- Change to (or link to) a different environment if your current environment is a poor match to the problem space
- Design flexible interfaces, with swappable implementations
- Design to make best use of new technologies, while not sucking on old ones
- Make it easy to find and fix bottlenecks
- Avoid general slowness (see below)
- Encapsulate both the whole and the parts (allow optimization of individual parts or replacement of entire design)
- Don't sprinkle slowness everywhere
- Multiple layers of thin wrappers that don't optimize away
- Always using fine-grained APIs instead of course-grained where possible
- Allowing poor, inefficient coding practices to pervade code base
- Minimize performance effects of logging, debugging, and assertions
- Linear core algorithm + quadratic assertion code == OOPS
- Logging should not be heavy
- Consider "smart comments" or other normally-free tools
- Include an install time or (often better) run time self- optimization phase (of data formats, microoptimizations, algorithmic choice, etc.)
- Don't ignore multithreading / multiprocessing
- Don't assume clever tricks will beat fast brute force, but don't assume the inverse either
- Don't optimize away important features
- PRNG strength (and of course, never memoize a PRNG!)
- Correctness under concurrency
- Resilience to bad inputs
- Error recovery abilities
- Don't force users to call an API inefficiently
- Provide both fine-grained and course-grained interfaces
- Provide aggregate-at-a-time calls
- Provide optimized forms of common algorithms applied to API
- Provide meta-description APIs (OpenGL tesselators/evaluators)
- If API is stateful:
- Allow large subsets of state to be set or read at once
- Provide push / pop state stacks
- Allow explicit control of performance tradeoffs
- Delay consistency checks to end of update sequence
- Turn off GC or trigger immediate GC
- Switch variants of algorithm / backing store / caching strategy / etc.
- Explicit command caching (OpenGL display lists)
- Explicit data caching (OpenGL texture objects)
- Perform extra optimization now to improve performance later
- Curried closures as an optimization of setup
- Use fast paths, and expose them (optionally) to callers where appropriate
- It's OK for course-grained / aggregate interface implementations to just do the same fine-grained calls at first, because then they are encapsulated, and if they become bottlenecks, they can be optimized
- Provide tracing tools, if appropriate
- Provide special interfaces to display internal optimization choices, if these are very complex and user error may cause optimization to fail or produce a suboptimal result (e.g. SQL EXPLAIN)
- Make use of performance-enhancing features provided by the API
- Make use of performance tradeoff controls as needed
- Prefer higher-level interfaces over lower-level ones, all things being equal
- Sometimes asking for more work than strictly necessary is actually faster, because it allows a fast path to be used
- Push work into hardware-abstraction APIs, such as OpenGL, even if brute force CPU is currently faster; periphery hardware tends to scale faster, and your code will suddenly scale with no recoding
- Call overhead itself can be brutal; ordered by increasing overhead:
- Subroutines
- Methods
- Library functions
- System calls
- Avoid unnecessary API calls
- Batch multiple single-item calls into an aggregate call
- Rearrange API calls to minimize number of setup / breakdown operations are needed on objects or state
- Cache values read from API, instead of constantly rereading
- If data will be handed to the API over and over (and caching the data in the API is infeasible), pretransform data sets to match the format most preferred by the API
- Some APIs claim all data formats are equal, but in reality implementations tend to have fast paths for the most common formats
- Beware that client code contortions done to make use of special interface features have not overwhelmed performance gains from using those special features
- If there is a standard way to see how an API will process a request (e.g. SQL EXPLAIN), use it to make sure your requests are being handled efficiently
- If the API provides a trace tool (e.g. *nix strace), use it; make sure your code does not perform more calls than you expect
- Convert to algorithms with better worst-case behavior or friendlier computation and memory access patterns
- Sorting: merge sort, radix sort
- Searching: hashing, wide search trees
- Fast-path loop exits
- Keep most likely situations on fast path
- Make sure code to recognize and follow fast path does not net slow the system down
- Too much work to identify fast path case
- Fast pathing an uncommon case
- Separate verify and update / operate stages
- Use identities to make simpler expressions
- Convert dispatch if chains to table driven dispatch
- Optimize boolean/number mixtures
- Convert numeric if chains to algebraic expressions
- Use booleans that numify as the numbers they numify to
- Take advantage of -1 == all ones == ready made AND mask
- If chained tests are really necessary, put common force-yes or force-no conditions first
- Use sentinel values in data structures to simplify algorithmic termination conditions
- Memoize pure functions
- Cache cacheable results
- Use compiler optimization hints
- Inlining hints
- Type annotations
- Allocation hints
- Use types appropriate to the problem, rather than "long double everywhere"
- Avoid non-trivial type coersion
- Use packed data structures when appropriate, but beware of overhead involved in packing and unpacking
- Don't make the GC / mem allocator work harder than necessary
- Reduce temporary variable churn
- Be aware of hidden temps
- Allocate and release large hand-managed pools (though this should be benchmarked; some memory managers are very pool-aware already)
- Try to convert special cases to normal cases (even noop normal cases, if that's what makes them special)
- If you can convert a calculation (almost) entirely to a data lookup, it's worth giving it a try; in some environments, such as GPU coding, data lookups can be vastly faster than calculation, and may even include free extra work, such as linear interpolation
- Move constant run time calculations to compile time
- Allow checking, testing, and debug harness stuff to be turned completely off (in a way that results in zero run time cost), or as much as is safe