- The switchover to 64 bit processors is snowballing; soon only handheld devices will still be 32 bits (and not even all handhelds)
- Multimedia extensions are pretty well established with 128 bit SIMD registers, and 256 bit SIMD registers are starting to be seen, especially on specialized hardware and GPUs
- ISAs generally specify an equal number of integer and FP registers, 8-32 of each in the 32 bit world and 16-128 of each in the 64 bit world
- Less than 32 registers per file usually results in enough register contention to have noticeable performance penalties in most applications
- Newer ISAs tend to specify more registers, but legacy ISAs tend to stick around for a very long time
- There is still no resolution to the big-/little-endian debate and frankly probably won't be
- Upper heirarchy levels (closer to CPU) get faster quicker than lower ones; over time, this results in more levels being added to try to even the (geometric?) spacing between performance of various levels
- The size of upper heirarchy levels grows very slowly, while low levels grow very quickly. Sometimes middle levels grow, and sometimes they just multiply.
- Cache line lengths grow slowly, probably because the size of fundamental data types and structures that get used all at once tends to grow slowly
- Conversely, even the smallest cache lines hold 2-4 words of data and larger ones contain considerably more, so algorithms that have a long basic stride (e.g. reading only the first word of many 16-word structures) will waste considerable memory bandwidth
- A slow processing core can make up for this deficiency by having a very fast memory subsystem, especially one with large upper levels, as long as the application relies more on large data sets than heavy computation per data element
- After 2005-2006, most high-performing machines will have multiple CPU cores, which may be on the same die or multiple dies
- The total count of cores per system is expected to rise rapidly; it is not yet clear whether this trend will level off, or maximum parallelism will eventually occur (every pixel on the display gets a dedicated pipeline or CPUs support many thousands of threads, or the like)
- Also, an increasing number of cores will support more than 1 thread "simultaneously"
- Threadcount per core will rise slowly, as beyond a certain point this becomes inefficient compared to simply adding more cores
- Multi-core CPUs must have smaller cores in order to fit in the same die area; they normally achieve this by returning to the days of in-order, scalar or very small superscalar core designs
- Certain multi-core designs take advantage of identical cores to offset the decrease in yield from larger dies, by simply binning down dies with a bad core or literally slicing off non-working cores.
- In-order designs can be great for inline SIMD code, but tend to be poor at unoptimized integer/branch code, unless their pipelines are very short
- NUMA architecture becoming increasingly common
- Steadily increasing in generality and programmability
- While becoming more CPU-like in programming model, have very different set of performance metrics and optimization choices
- Steadily increasing bag of optimization tricks; many of these should always be enabled by software regardless of actual graphics core because the techniques will help on new cores, and do no harm on old cores, thus leading to free scalability
- Throughput increase per year much faster than CPU throughput improvement
- Clock rates relatively low (GPUs tend to favor IPC over MHz)
- Latency and asynchrony very high
- Direct3D allows graphics drivers to feed the GPU up to 3 frames ahead of current display
- OpenGL asynchrony appears unbounded save by sync APIs and resource usage
- Current drivers often 1+ frames ahead already
- GPU pipeline hundreds or thousands of cycles long, at relatively low clock rate
- At their fastest when everything can be done:
- In SIMD mode
- Without readback (especially readback to main CPU)
- With most data local to graphics memory
- With work balanced between pipeline stages
- Maximum and optimum linear resolutions increase slowly
- Maximum and optimum color resolutions increase VERY slowly
- Handheld devices tend to follow the numerical trends of desktop cousins, but sophisticated usage comes quicker due to industry body of experience
- With advent of PCI Express, more graphics vendors supporting virtual graphics memory and similar concepts (some even allow swapping all the way to disk!)
- Microsoft is requiring that card vendors support full OS control of graphics and graphics-accessible memory regions for Longhorn
- Speed and bandwidth of local graphics RAM are more important than size of local RAM past a certain minimum point, which depends on particular variation on virtual memory concept used, and to an apparently lesser extent game engine and game content
- Many systems include sound cards or motherboard sound chips
- They're all very different
- Several APIs available to cover these differences
- Expectations for sound are rising steadily
- Most engines support (most of) 2.0, 2.1, 3.1, 4.1, and 5.1
- Some engines support 7.1
- Modern engines support a large set of physically-based sound filters and modulators
- Modern engines support many sound samples per basic effect, so that people don't notice the exact same sound each time
- Rumors are surfacing of physics coprocessors in near future
- Unreal 3 and future Bioware games will support the PhysX CPU (through the NovodeX API, which falls back to software)
- Valve Software keeps a set of PC gaming system survey statistics through their Steam service at http://www.steampowered.com/status/survey.html
- Splits
- nVidia 52%, ATI 42%
- Intel 52%, AMD 48%
- Some rough ranges from Valve's charts, as of 2005-08-24
- dead = <~5% have equal or worse
- dying = <~20% have equal or worse
- normal = most systems in this range
- rising = <~20% have equal or better
- edge = <~5% have equal or better
General Stats for Gaming PCs
| Stat | Dead | Dying | Normal | Rising | Edge |
| Internet speed | ISDN | 256K | 768K-2M | 10M | |
| System RAM | 128M | 256M | 256M-1G | 1-1.5G | >1.5G |
| CPU count | | | 1 | | >=2 |
| CPU Hz, Intel | <1.2G | <2.0G | 2.0-3.0G | >3.0G | >=3.3G |
| CPU Hz, AMD | <1.2G | <1.7G | 1.7-2.1G | >2.1G | >2.3G |
| GPU Class | <DX7 | DX8 | DX8.1-9+ | DX9b/c | |
| Free HD | <1G | <10G | 10-80G | >80G | >150G |
| Total HD | 30G | <60G | 60-190G | >190G | |
| Optical Drive | | CD | DVD | | |
| Horizontal res | <640 | <800 | 800-1024 | >1024 | >1280 |
| Color depth | | | 16-32 | | |
Note: Last two above are from older survey; they appear to not be captured now.
Video Card Classes for Gaming PCs
| Class | ATI Models | nVidia Models | % |
| Pre-DX7 | <LE,IGP320-340,<7200 | <GeForce 256 | 1.13 |
| DX7 (Fixed T&L) | LE/SDR/DDR,7200-7500 | 256,2,4MX | 11.55 |
| DX8 (PS1.3, VS1.1) | | 3,4Ti=4200-4800 | 5.99 |
| DX8.1 (PS1.4, VS1.1) | 8500-9250 | | 6.87 |
| DX9 (PS2.0, VS2.0) | 9500-X600 | | 26.88 |
| DX9 (PS2.0+, VS2.0+) | | FX=5200-5950 | 18.38 |
| DX9 (PS2.0b, VS2.0) | >=X700 | | 5.87 |
| DX9.0C (PS3.0, VS3.0) | | 6200-7800 | 14.46 |
| Unknown | | | 1.50 |
| Intel 845 (DX7,GL1.2) | | | 1.35 |
| Intel 915G (DX9,GL1.4) | | | .46 |
| Sis 661FX-M760 (DX7) | | | .40 |
| Intel 852/855 (DX7,GL1.3) | | | .38 |
| SiS 650/651/M650/740 (DX7) | | | .36 |
| S3 ProSavageDDR (DX7?) | | | .31 |
| Intel 810 (T&L, | stencil) | | | .24 |
| Other | | | 3.91 |
- Typical layouts:
- usually 2 or more register files, split by data type
- often 2 L1 caches, one each instructions and data
- B=bytes, C=Cycles
- (64)=most common in range for current designs
Memory Heirarchy Layouts (per CPU)
| | REG | L1 | L2 | L3 | Main | HD |
| Hit lat., C | 1 | 2-4 | 8-30 | 20-40 | 50-300 | huge |
| Read BW, B/C | max | 8 | 2-4 | ? | 1-3 | 1/50 |
| B/line | 4-32 | 4-(64) | 4-128(64) | ? | 4-32(16) | 512 |
| Total size | <=1K | 8-128K(16-64K) | 64K-2M(1M) | (0)-256M | 256M-16G(1G) | 40+G |
- Expected per-chip trend for CPUs:
- Now: 1-2 cores, each supporting 1-2 threads
- Near: 2-8 cores, each supporting 1-4 threads
- Mid: 4-32 cores, each supporting 1-8 threads
- Long: ???
- Various platforms for 2005-2006 (cores per chip * threads per core):
- AMD64: 2 * 1
- Pentium D/XE: 2 * 1, 2 * 2
- Merom: 2 * 1, 4 * 1
- UltraSPARC: 1 * 2, 2 * 1
- SPARC Niagra: 8 * 4
- SPARC Rock: 16 * 4?
- POWER 5: 2 * 2
- PowerPC: 2 * 1?
- Xenon (PPC, XBox 360): 3 * 2
- Cell (PPC, PS3): 1 * 2 + 7 * 1 (8 SPE - 1 expected bad)
- Cell and Xenon proving to be 1/3 to 1/10 speed per clock for AI code
- Maximum linear resolution:
- Handhelds: 128x128 to 800x480
- Mini-laptops: 640x480 to 1024x768
- Desktops: 1024x768 to 1920x1200
- HD Consoles: 852x480, 1280x720, 1920x1080
- Workstations: 1600x1200 to huge multi-screen
- Maximum color resolution:
- Oldest consumer 3D accelerators, 15 or 16 bits (5-5-5 or 5-6-5)
- Medium old consumer devices, 24 bits (8-8-8)
- Newer consumer devices have floating point color and destination alpha, for 48, 64, 96, or 128 bits (16-16-16, 16-16-16-16, 32-32-32, 32-32-32-32)
- Code 32-/64-bit and big-/little-endian clean
- Try to keep algorithms within as small a minimum register footprint as possible
- Separate computation kernels from surrounding code, and allow them to be easily converted to hand-coded versions for various ISAs
- Take advantage of SIMD registers and APIs that use them
- Don't treat memory as equal, flat, and fast
- Design so that app's memory usage mirrors system's heirarchy:
- Small set of register variables
- Data structures and algorithms that are cache-line friendly
- Heirarchy of tightly-packed/frequently-used data through loosely-packed/rarely-used data
- Expect both poor bandwith and poor latency for data at lower levels in heirarchy, and design to hide this
- Aggressively pursue multiprocessing and multithreading
- If optimisations are near-free that allow "nearby" threads to get locality speedups, do so
- To maintain efficiency for older devices, handhelds, and multithreaded cores (which are fairly inefficient) make sure multiprocessing has as little overhead as possible when used with fewer cores
- Beware of lock contention and data motion bottlenecks
- Convert many types of "optimized brute force" algorithms that try to branch around a few percent of work per element to true unbranched brute force, which is friendlier to in-order non-superscalar small cores
- Allow, and be efficient with, a wide range of linear resolutions
- Handle different aspect ratios cleanly
- Create LOD and multi-aspect assets
- Support manual and automatic LOD and aspect ratio alteration
- Assume 24 bit color standard
- Degrade well to 15/16
- Use floating point color if available
- Use destination alpha if available
- Use high-level APIs pervasively
- Throw as much work to the coprocessor as possible
- Avoid readback, or hide latency as much as possible
- Some readback algorithms (especially optimization tricks) are still effective if previous frame's data is used instead of current frame
- Use most portable API that will do the job
- Support full range of speaker counts and arrangements
- Support most free codecs
- Handle full sound effect processing
- Create deep effect asset library
- Data sets increasing rapidly, both in count of objects, and in detail per object
- This applies pretty much across the board, from texture resolution to actors visible on screen, to linear size of levels, and so on
- Increasing trend toward level-less games that load and unload needed data on the fly
- In games where story makes sense, more companies are using published authors -- some even keep them permanently on staff, as conversion from pure linear to game non-linear takes experience not to be wasted
- Unreal 3 engine uses base models of 8-14 million triangles, which are then reduced to 100K(?)
- Bioware's RPG story formula:
- Intro -- <1% of game, like Half Life opening movie; do little or nothing, just soak in the world/mood
- Prelude -- ~5%, tell player who they are, introduce play mechanics and character motivation
- The Linear Start -- ~10%, ease player into game, fill in more game mechanics and features, give clear short term goal
- The Wide Open World -- 70%, clear but non-immediate goals, search and explore for answers
- The Linear Finale -- 15%, main goal completed, start "drop of doom" to the end game
- Bioware's RPG character types:
- Information giver
- Quest giver
- Storekeeper
- Ambient character (just hanging out, fun to talk to)
- One-liner (says something funny)
- Plot advancer (tells you where to look when stuck)
- Villian's henchmen
- Comedian
- Villian
- Ignoramus (just annoys the player, but it also pulls them in)