- The switchover to 64 bit processors is snowballing; soon only handheld and smaller devices will still be 32 bits (and not even all handhelds)
- Multimedia extensions are pretty well established with 128 bit SIMD registers, and 256 bit SIMD registers are starting to be seen, especially on specialized hardware and GPUs; Intel/AMD CPUs will support 256 bit SIMD registers once AVX processors arrive in 2011
- ISAs generally specify an equal number of integer and FP registers, 8-32 of each in the 32 bit world and 16-128 of each in the 64 bit world
- Less than 32 registers per file usually results in enough register contention to have noticeable performance penalties in most applications
- Newer ISAs tend to specify more registers, but legacy ISAs tend to stick around for a very long time
- There is still no resolution to the big-/little-endian debate and frankly probably won't be
- Upper heirarchy levels (closer to CPU) get faster quicker than lower ones; over time, this results in more levels being added to try to even the (geometric?) spacing between performance of various levels
- The size of upper heirarchy levels grows very slowly, while low levels grow very quickly. Sometimes middle levels grow, and sometimes they just multiply.
- Cache line lengths grow slowly, probably because the size of fundamental data types and structures that get used all at once tends to grow slowly
- Conversely, even the smallest cache lines hold 2-4 words of data and larger ones contain considerably more, so algorithms that have a long basic stride (e.g. reading only the first word of many 16-word structures) will waste considerable memory bandwidth
- A slow processing core can make up for this deficiency by having a very fast memory subsystem, especially one with large upper levels, as long as the application relies more on large data sets than heavy computation per data element
- Most high-performing machines have multiple CPU cores, which may be on the same die or multiple dies
- The total count of cores per system is expected to rise rapidly; it is not yet clear whether this trend will level off, or maximum parallelism will eventually occur (every pixel on the display gets a dedicated pipeline or CPUs support many thousands of threads, or the like)
- Also, an increasing number of cores will support more than 1 thread "simultaneously"
- Threadcount per core will rise slowly, as beyond a certain point this becomes inefficient compared to simply adding more cores
- Multi-core CPUs must have smaller cores in order to fit in the same die area; they often achieve this by returning to the days of in-order, scalar or very small superscalar core designs
- Certain multi-core designs take advantage of identical cores to offset the decrease in yield from larger dies, by simply binning down dies with a bad core or literally slicing off non-working cores.
- In-order designs can be great for inline SIMD code, but tend to be poor at unoptimized integer/branch code, unless their pipelines are very short
- NUMA architecture becoming increasingly common
- Steadily increasing in generality and programmability
- While becoming more CPU-like in programming model, have very different set of performance metrics and optimization choices
- Steadily increasing bag of optimization tricks; many of these should always be enabled by software regardless of actual graphics core because the techniques will help on new cores, and do no harm on old cores, thus leading to free scalability
- Throughput increase per year much faster than CPU throughput improvement
- Clock rates relatively low (GPUs tend to favor work per clock over raw MHz)
- Latency and asynchrony very high
- Direct3D allows graphics drivers to feed the GPU up to 3 frames ahead of current display
- OpenGL asynchrony appears unbounded save by sync APIs and resource usage
- Current drivers often 1+ frames ahead already
- GPU pipeline hundreds or thousands of cycles long, at relatively low clock rate
- At their fastest when everything can be done:
- In SIMD mode
- Without readback (especially readback to main CPU)
- With most data local to graphics memory
- With work balanced between pipeline stages
- Maximum and optimum linear resolutions increase slowly
- Maximum and optimum color resolutions increase VERY slowly
- Handheld devices tend to follow the numerical trends of desktop cousins, but sophisticated usage comes quicker due to industry body of experience
- With advent of PCI Express, more graphics vendors supporting virtual graphics memory and similar concepts (some even allow swapping all the way to disk!)
- Microsoft is requiring that card vendors support full OS control of graphics and graphics-accessible memory regions for Vista
- Speed and bandwidth of local graphics RAM are more important than size of local RAM past a certain minimum point, which depends on particular variation on virtual memory concept used, and to an apparently lesser extent game engine and game content
- Most systems include sound cards or motherboard sound chips
- They're all very different
- Several APIs are available to cover these differences
- Expectations for sound are rising steadily
- Most engines support (most of) 2.0, 2.1, 3.1, 4.1, and 5.1
- Some engines support 7.1
- Modern engines support a large set of physically-based sound filters and modulators
- Modern engines support many sound samples per basic effect, so that people don't notice the exact same sound each time
- A few games supported the discrete PhysX PPU (through the NovodeX API, which could fall back to software)
- Unfortunately the PPU as a discrete device was a market failure; the PhysX device was badly bottlenecked and suffered from a chicken-and-egg market problem
- Nowadays physics processing is performed using either GPU resources (if available) or CPU threads (if no GPU is available)
- Several major physics libraries exist, including at least two open source libraries; no physics library has overwhelming market share yet
- Valve Software gathers a set of PC gaming system survey statistics through their Steam service, collated at http://www.steampowered.com/status/survey.html
- Below data culled on 2006-12-12 (from survey started 2006-11-15)
- Around 500,000 systems surveyed
- Splits
- nVidia 54.4%, ATI 39.0%, Other 6.6%
- AMD 50.5%, Intel 49.5%
- Some rough ranges from Valve's charts:
- dead = <~ 5% have equal or worse
- dying = <~20% have equal or worse
- normal = most systems in this range
- rising = <~20% have equal or better
- edge = <~ 5% have equal or better
General Stats for Gaming PCs
| Stat | Dead | Dying | Normal | Rising | Edge |
| Internet speed | ISDN | 256K | 768K-2M | 10M | |
| System RAM | <256M | | 256M-<1G | 1G | 2G |
| CPU count | | | 1 | 2 | >2 |
| CPU Hz, Intel | <1.5G | <2.0G | 2.0-3.3G | 3.3G | 3.7G |
| CPU Hz, AMD | <1.3G | <1.7G | 1.7-2.2G | 2.2G | 2.7G |
| Free HD | <1G | <10G | 10-120G | 120G | 210G |
| Total HD | <30G | <70G | 70-230G | 240G | |
| Optical drive | CD | | DVD | | |
| GPU class | <DX7 | DX8.1 | DX9-9.0C | | DX10 |
| GPU VRAM | 32M | 96M | 128-256M | 512M | >512M |
| GPU bus | | AGP4x | AGP8x/PCIe | | |
| GPU count | | | 1 | | 2 |
| Color depth | 16 | | 32 | | |
| Display aspect | | | 4:3 | 16:9 | Multi |
| Display refresh | | | 60-75 | 85 | 100 |
| Horizontal res | 800 | | 1024-1280 | 1440 | 1680 |
- In Valve Source engine, some video cards are handled out of class:
- Intel 915G is treated as DX7
- nVidia FX is treated as DX8
- Some cards that used to be tracked are now in 'Other':
- All pre-DX7 cards
- ATI DX7 before 7000
- nVidia DX7 before GeForce 2
- Intel 810 (T&L, no stencil)
- S3 ProSavageDDR (DX7?)
- SiS 661FX-M760 (DX7)
- SiS 650/651/M650/740 (DX7)
Video Card Classes for Gaming PCs
| Class | SM | PS | VS | GS | ATI Models | nVidia Models | % |
| Other | | | | | | | 4.79 |
| DX7 | - | | | | LE/SDR/DDR,7000-7500 | 256,2,4MX,4Go | 8-10 |
| DX8 | 1 | 1.3 | 1.1 | | | 3,4Ti=4200-4800 | 2.36 |
| DX8.1 | 1 | 1.4 | 1.1 | | 8500-9250 | | 5.80 |
| DX9 | 2 | 2.0 | 2.0 | | 9500-X600 | | 19.71 |
| DX9 | 2 | 2.0+ | 2.0+ | | | FX=5200-5950 | 11.76 |
| DX9 | 2 | 2.0b | 2.0 | | X700-X850 | | 7.46 |
| DX9.0C | 3 | 3.0 | 3.0 | | X1300-X1950 | 6100-7950 | 39.39 |
| DX10 | 4 | 4 | 4 | 4 | | 8800 | .22 |
| Intel 845: DX7,GL1.2 | | | .81 |
| Intel 852/855: DX7,GL1.3 | | | .28 |
| Intel 915G: DX9,GL1.4 | | | .65 |
- Windows Vista requirements:
Windows Vista Hardware Requirements
| Stat | "Capable" | "Premium Ready" |
| CPU MHz | 800 | 1000 |
| RAM MB | 512 | 1024 |
| GPU class | DX9 | PS 2.0 |
| GPU WDDM | | Yes |
| GPU VRAM MB | | 128 |
| Horizontal res | 800 | 800 |
| Color depth | 16 | 32 |
| Total HD GB | 20 | 40 |
| Free HD GB | 15 | 15 |
| Optical drive | CD-ROM | DVD-ROM |
| Audio out | | Yes |
| Internet | | Yes |
- For comparison, Windows XP required a 300 MHz CPU and 128 MB RAM
- Typical layouts:
- usually 2 or more register files, split by data type
- often 2 L1 caches, one each instructions and data
- B=bytes, C=Cycles
- (64)=most common in range for current designs
Memory Heirarchy Layouts (per core or chip)
| | REG | L1 | L2 | L3 | Main | HD |
| Hit lat., C | 1 | 2-4 | 8-30 | 20-40 | 50-300 | huge |
| Read BW, B/C | max | 8 | 2-4 | ? | 1-3 | 1/50 |
| B/line | 4-32 | 4-(64) | 4-128(64) | ? | 4-32(16) | 512 |
| Total size | <=1K | 8-128K(16-64K) | 64K-8M(1-4M) | (0)-256M | 256M-16G(1G) | 40+G |
- Expected per-chip trend for CPUs:
- Now: 1-16 cores, each supporting 1-8 threads
- Mid: 4-256 cores, each supporting 1-8 threads, possibly asymmetric
- Long: dozens to thousands of possibly asymmetric cores
- Various platforms for 2005-2006 (cores per chip * threads per core):
- AMD64: 2 * 1
- Pentium D/XE: 2 * 1, 2 * 2
- Core: 1 * 1, 2 * 1
- Core 2: 1 * 1, 2 * 1, 4 * 1
- UltraSPARC: 1 * 2, 2 * 1
- SPARC Niagra: 8 * 4
- SPARC Rock: 16 * 4?
- POWER 5: 2 * 2
- PowerPC: 2 * 1?
- Xenon (PPC, XBox 360): 3 * 2
- Cell (PPC, PS3): 1 * 2 + 7 * 1 (8 SPE - 1 expected bad)
- Cell and Xenon proving to be 1/3 to 1/10 speed per clock for AI code
- Maximum linear resolution:
- Handhelds: 120x120 to 800x480
- Mini laptops: 640x480 to 1200x900
- Full laptops: 1024x768 to 1920x1200
- Desktops: 1024x768 to 2560x1600
- HD Consoles: 852x480, 1280x720, 1920x1080
- Workstations: 1600x1200 to huge multi-screen
- Maximum color resolution:
- Mobile devices: 15, 16, or 24 bits
- Oldest consumer 3D accelerators: 15 or 16 bits (5-5-5 or 5-6-5)
- Medium old consumer devices: 24 bits (8-8-8)
- Newer consumer devices have floating point color and destination alpha: 48, 64, 96, or 128 bits (16-16-16, 16-16-16-16, 32-32-32, 32-32-32-32)
- Code 32-/64-bit and big-/little-endian clean
- Try to keep algorithms within as small a minimum register footprint as possible
- Separate computation kernels from surrounding code, and allow them to be easily converted to hand-coded versions for various ISAs
- Take advantage of SIMD registers and APIs that use them
- Don't treat memory as equal, flat, and fast
- Design so that app's memory usage mirrors system's heirarchy:
- Small set of register variables
- Data structures and algorithms that are cache-line friendly
- Heirarchy of tightly-packed/frequently-used data through loosely-packed/rarely-used data
- Expect both poor bandwith and poor latency for data at lower levels in heirarchy, and design to hide this
- Aggressively pursue multiprocessing and multithreading
- If optimisations are near-free that allow "nearby" threads to get locality speedups, do so
- To maintain efficiency for older devices, handhelds, and multithreaded cores (which are fairly inefficient) make sure multiprocessing has as little overhead as possible when used with fewer cores
- Beware of lock contention and data motion bottlenecks
- Make use of lock-free algorithms where possible
- Convert many types of "optimized brute force" algorithms that try to branch around a few percent of work per element to true unbranched brute force, which is friendlier to in-order non-superscalar small cores
- Allow, and be efficient with, a wide range of linear resolutions
- Handle different aspect ratios cleanly
- Handle multiple monitors cleanly
- Create LOD and multi-aspect assets
- Support manual and automatic LOD and aspect ratio alteration
- Assume 24 bit color standard
- Degrade well to 15/16
- Use floating point color if available
- Use destination alpha if available
- Use high-level APIs pervasively
- Throw as much work to the coprocessor as possible
- Avoid readback, or hide latency as much as possible
- Some readback algorithms (especially optimization tricks) are still effective if previous frame's data is used instead of current frame
- Use most portable API that will do the job
- Support full range of speaker counts and arrangements
- Support most free codecs
- Handle full sound effect processing
- Create deep effect asset library
- Data sets increasing rapidly, both in count of objects, and in detail per object
- This applies pretty much across the board, from texture resolution to actors visible on screen, to linear size of levels, and so on
- Increasing trend toward level-less games that load and unload needed data on the fly
- Beginning of trend to fight increased data size with greater percentage of procedurally-generated or procedurally-amplified content
- Conversely, beginning of trend to support "unique texturing", using virtual texturing to allow every pixel to have completely unique texture across entire level or world
- In games where story makes sense, more companies are using published authors -- some even keep them permanently on staff, as conversion from pure linear to game non-linear takes experience not to be wasted
- Unreal 3 engine uses base models of 8-14 million triangles, which are then reduced to 100K(?)
- XBox 360 titles fill DVD9 (9 GB) disks
- PS3 launch titles nearly fill single layer Blu-Ray disks (25 GB)
- Future PS3 titles are expected to fill dual layer Blu-Ray (50 GB)
- Many such large titles need to cache several GB to HD in order to reduce monstrous optical disk load times
- Bioware's RPG story formula:
- Intro -- <1% of game, like Half Life opening movie; do little or nothing, just soak in the world/mood
- Prelude -- ~5%, tell player who they are, introduce play mechanics and character motivation
- The Linear Start -- ~10%, ease player into game, fill in more game mechanics and features, give clear short term goal
- The Wide Open World -- 70%, clear but non-immediate goals, search and explore for answers
- The Linear Finale -- 15%, main goal completed, start "drop of doom" to the end game
- Bioware's RPG character types:
- Information giver
- Quest giver
- Storekeeper
- Ambient character (just hanging out, fun to talk to)
- One-liner (says something funny)
- Plot advancer (tells you where to look when stuck)
- Villian's henchmen
- Comedian
- Villian
- Ignoramus (just annoys the player, but it also pulls them in)