Asynchronous compute is a hardware performance feature of the Polaris architecture supported by DirectX® 12, Vulkan™ and virtual reality headsets. Asynchronous compute extracts “free” performance from Radeon™ RX 400 Series GPUs by breaking up complex game workloads into smaller tasks that can run more quickly in parallel.
How Asynchronous Compute Works
Important in-game effects like shadowing, lighting, artificial intelligence, physics and lens effects often require multiple stages of computation before determining what is rendered onto the screen by a GPU’s graphics hardware.
In the past, these steps had to happen sequentially. Step by step, the graphics card would follow the API’s process of rendering something from start to finish, and any delay in an early stage would send a ripple of delays through future stages. These delays in the pipeline are called “bubbles,” and they represent a brief moment in time when some hardware in the GPU is paused to wait for instructions.
Pipeline bubbles happen all the time on every graphics card. No game can perfectly utilize all the performance or hardware a GPU has to offer, and no game can consistently avoid creating bubbles when the user abruptly decides to do something different in the game world.
What sets the Radeon™ RX 400 Series apart from its competitors, however, is the Polaris architecture’s ability to pull in useful compute work from the game engine to fill these bubbles. For example: if there’s a rendering bubble while rendering complex lighting, Radeon™ RX 400 Series GPUs can fill in the blank with computing the behavior of AI instead. Radeon™ RX 400 Series graphics cards don’t need to follow the step-by-step process of the past or its competitors, and can do this work together—or concurrently—to keep things moving.
Filling these bubbles improves GPU utilization, input latency, efficiency and performance for the user by minimizing or eliminating the ripple of delays that could stall other graphics cards. Only Radeon™ graphics with the Graphics Core Next and Polaris architectures currently support this crucial capability in DirectX® 12, Vulkan™ and VR, and the performance gains can be significant.
Evolving Asynchronous Compute with the Quick Response Queue
Today’s graphics engines provide many opportunities to take advantage of asynchronous compute, but some tasks can still struggle to reach peak benefit because they don’t explicitly know when the graphics card has started executing new work. In these cases, we can resolve the problem by guaranteeing to the game engine that a task will start and end in a certain amount of time.
In order to meet this requirement, these time-critical tasks must be given higher priority access to processing resources than other tasks. One way to accomplish this is through the use of preemption, which works by temporarily suspending all other GPU tasks until the high-priority task can be completed. However, preemption often causes costly time delays as the old tasks are wound down before the new one is started; this can potentially manifest as undesirable stuttering in games.
Instead, the Polaris architecture supports another method for handling time-critical tasks called the Quick Response Queue (QRQ). Tasks submitted to this queue get preferential treatment from GPU resources, while running asynchronously, so they can overlap with other workloads. The game developer can even control how, when, and how much of the GPU is being used by a QRQ task through a hardware component of the Polaris architecture called the Hardware Scheduler (HWS).
Asynchronous Compute for VR with Asynchronous Time Warp
Virtual reality rendering provides a great use case for the quick response queue. For example, the production release of the Oculus Rift VR headset implements a technique known as Asynchronous Time Warp (ATW) to reduce latency and prevent image judder caused by dropped frames.
In VR, dropped frames—perceived by the user as stuttering—can occur when a frame takes too long to render and misses a refresh of the head-mounted display. The effect is jarring and destroys the sense of presence that is essential to VR. While there are a variety of ways to address this problem (including application tuning, reducing image quality, or upgrading to a more powerful graphics card), these can be costly in time or money to developers and users. Instead, Oculus’ ATW solution is designed to be automatic and transparent to users as well as to developers.
ATW works by manipulating the size/shape/perspective on the last frame that has finished rendering, correcting for any head movement that takes place after that rendering work was initiated. This warping operation is executed on the GPU using compute, and can be scheduled concurrently with other work on the VR Ready Premium Radeon™ RX 480 GPU. Scheduling this operation every frame ensures that there is always an updated image available to display, even if it is only a warped version of a previously displayed frame.
Execution of the ATW task must be timed carefully in order to be useful. Ideally, the warp should happen as late as possible in the life of one frame, allowing just enough time for the warp to complete before the next display update. If the warp happens too early, then additional head movement can occur before the display refresh, causing a noticeable lag. If it happens too late, then the warp may miss the refresh and allow visible juddering to occur.
This time sensitivity is where the QRQ of the Radeon™ RX 480 GPU comes into play. Putting the ATW shader on the quick response queue gives it priority access to the GPU’s compute units, making it far more likely to complete before the next refresh, even when it is submitted late in the time of one frame’s life. And since it doesn’t need to pre-empt other graphics tasks already in flight, it allows the GPU to start working on the next frame quickly.
This is just one example of how providing more precise control over when individual tasks execute on GPUs can open the door to entirely new ways of exploiting the massive computational power they offer. We are already experimenting with other latency-sensitive applications that can take advantage of this, such as high fidelity positional audio rendering of virtual environments on the GPU. We’re also looking at providing more scheduling controls for asynchronous compute tasks in the future.