The following outline some use cases that will help define the ways in which the main engine must be modularized, and some of the functionality that it will need to provide.
Comic book format, sketch quality only. Artists need to be able to add visual elements not part of the scene itself, which describe the visual language of film. Need to match storyboard frames with lines in the script, and see overviews of entire scenes. Desirable to ease conversion to Animatic. Uses a small portion of the budget of a production, but may take a fair percentage of the time if many iterations are required before main production begins.
Storyboard with visual language elements converted to actual animation, and synchronized to preliminary soundtrack. Various simple 2D and 3D animation techniques must be easy to specify and synchronize. In 2D, often uses storyboard frames or cutouts from frames; in 3D, uses extremely simplified standins for environment, objects, and actors.
Final output form is digital film, at medium to very high resolution with a fixed (and low) frame rate, for later display on a huge screen. Utter realism is expected. Extremely complex environments are commonplace. Artist-guided physical simulation is necessary to handle complex materials and interactions without astronomical artistic workload. Repeatability must be guaranteed. Offline rendering is acceptable (but no more than "overnight"), and significant pre-render optimization work may pay off. Artists desire interactive rendering as close to final as possible, to improve artistic feedback loop. Content known in advance, and heavy tweaking to fit artistic and directorial vision is expected. Budgets range from ten million to several hundred million dollars.
The final output form will be digital video, at low to medium resolution with a fixed (medium) frame rate, for later display in a home theater environment. Requirements are generally similar but simpler than with Render to Film. Overnight rendering is often considered too slow, except possibly for final render. Budgets are generally considerably smaller than film budgets, perhaps 1% as large.
Special effects and graphical elements are added on the fly to a slightly delayed broadcast, with resolution and frame rate as per Render to Video. Often used for sports and news reporting. Rendering must work in real time. Artists require tools that allow them to select and customize chosen elements in just a few seconds. Optimization based on properties of elements in toolbox is encouraged, but on-the-fly optimization is limited by allowable turnaround time and realtime rendering requirements. Budgets likely even smaller than standard video rendering.
Similar to Render to Video, but scripted and rendered entirely within a Computer Game engine. Currently limited to independent films, with attendant extreme budget limits.
Many rendering styles, including cel-rendered, sketched, anime-influenced, hyper-realistic, old film, and so on. Pre-rendered intros and cut scenes can be considered a special case of Render to Video, in which video file size is at a premium and the video is likely to be scaled at odd ratios to fit the player's chosen resolution. Game engine intros and cut scenes can instead be considered Machinima. In either case, these precreated scenes are limited to using only a fraction of the budget of the computer game as a whole, but have the advantage of being amenable to considerable optimization, as well as using much heavier resources to create than can be applied on the user's system in real time.
Resolution and frame rate independence are expected, as well as the ability to gracefully degrade and improve quality along many axes to fit the desires of the user and available system resources. Different genres tend to have vastly different optimization profiles; strategy games require highest available resolution while allowing low frame rates, preferring instead to spend system resources on AI, pathfinding, and so on. Conversely, shooter games depend on very high frame rates to improve responsiveness, and often spend more non-rendering resources on physics simulation than on AI.
Budgets were long relatively small, and still sometimes are in true independent shops, but are now ballooning past Render to Video into Render to Film territory.
Unlike all previous use cases -- except, in a limited way, Stream for Broadcast -- computer games must deal with interactivity. As a consequence, there is a constantly varying line between creator- and user-determined activity. In some games, the entire setting may be changed by the user, prohibiting certain valuable techniques that depend on knowing the content in advance.
It is valuable to move as much control of the virtual world as possible outside the renderer, preferably into one or more dedicated modules. Aside from all of the usual benefits of a modular architecture, this makes it much easier to maintain a consistent shared reality in multiuser systems. The stronger this modularization, the more exact the consistency and "fairness" of the system.
Ideally, the rendering engine would be purely a viewer client for a "world server", which controls every aspect of the virtual world. Unfortunately, the bandwidth, memory, and processing power necessary to make this pure design workable are currently out of reach of the average consumer (and according to various improvement curves, unlikely to ever reach the consumer). While this design can be used for high budget and completely non-interactive purposes, such as final film rendering and scientific simulation, a different approach must be used for interactive situations.
One possible method for mitigating the downsides of the pure architecture is to allow the server to speak to clients in a sort of shorthand; the server merely supplies parameters that clients can extrapolate to determine the current world state. This can be seen as an extremely specialized compression method, compressing all of the information that the server knows and client needs into a vastly shorter format that the client then "decompresses" for use by the renderer.
Nearly all multiuser games do this, to varying degrees. For example, first person shooter games usually operate within a few premade settings, and the server can merely tell the client at the beginning of a match the name (or even just number!) of the current setting; the client is then expected to load all relevant setting data from this single detail. During play, the server merely relates position, orientation, velocity, and attributes for each moving object, along with miscellaneous other bits of shared state, for each simulation step.
Unfortunately, to further hide bandwidth and latency issues that would severely affect interactivity, the server must provide more information about the world state than the client should normally know, such as the exact state of currently hidden enemies. Cheaters use this information to give them uncanny perception and targeting accuracy.
Another issue is the question of what state must be shared at all: is it actually important that two clients see mist swirl in exactly the same way at the same moment? (The answer to this may depend on whether the swirling mist interacts with the rest of the world; if the mist is mere decoration, consistency is probably not necessary. On the other hand, if sudden swirling in previously calm mist is a warning of an approaching enemy, consistency would be much more important. This is even more true if the swirling actually reflects the position and movements of the enemy.)
The following are some thoughts on various attributes of a hybrid client-server architecture intended to be most things to most people.
Several methods for transmitting static content and details of the game setting might make sense:
When the client is not connected to the server for the lifetime of the simulation, the connected client must be updated with deltas against the static content (or most recent shared state) to get back into sync with the server before play can begin. Depending on the importance of the data, and whether it can materially affect gameplay, this may be batch downloaded before the connection is allowed to complete, or streamed to the client during the first few minutes of play.
Dynamic data can be broken into several different classes:
It's very useful if extrapolated and procedurally generated data has certain properties: