Dev:2.8/Source/Viewport/Optimization

提供: wiki
移動先: 案内検索

Viewport Optimization

Discussed from 22 to 26th of May 2017 with Clément Foucalt, Sergey Sharybin and Dalai Felinto at the Blender Institute.

Introduction

  • The ultimate goal of 2.8 Viewport project is to get the maximum playback performance for animators, with realistic lighting and reflections.
  • The bare-minimum expected is to have a viewport that is as fast and responsive for playback animation as Blender 2.7x

Blender 2.7x

Slow parts, equally responsible:

  • Depsgraph
  • Drawing (CPU? GPU?)

Blender 2.8

Slow parts as perceived by users:

  • Resize editors
  • Switch edit modes
  • Switch engines
  • Init time (intel gpus)
  • Shading tweaks/animation
  • Skinned mesh playback
  • Hair playback

Slow parts as per implementation areas:

  • Batch creation
  • Shader compilation
  • Probes generation
  • Framebuffer textures re-generation
  • Shadow update tagging
  • Frustum (tiled) light culling
  • Draw manager cache discarding

Batch creation

  • Batch creation threaded on object level.
  • We are over-cleaning batch cache at the moment. It can be improved with tagging.
  • Investigate looping over mesh data only once per cache, regardless if we need normals, positions, or vertex flag.
  • Investigate if writing directly into the buffer improves over the current wrapper that calculates offset and memcpy for each element.
    • Compiler performs the copy with size known at compile time.
    • In many cases this avoids an intermediate copy step.
  • Edit-mode creates vert & loop normals even when not used.
  • Sharing VBO’s between batches where possible (already done but could be done in more places).
  • New hair struct will be GPU friendly
    • The existing overhead of creating batches is only required because VBOs require data blocks that are different than how we store mesh data.
    • For hair that’s also true, however, the new hair system will take that into account to design its optimal data struct.

Things we can’t have at all:

  • No partial VBO update.
  • No hardware “skinning”.

Last resource:

  • Cache copy-on-write object over time for fast re-playback.
  • This shouldn’t get on the way over other optimizations.
  • That also means the viewport population of draw passes won’t be optimized.

Batch creation refactor

It can be threaded at the object level, so each object is processed in parallel.

It’s still not clear where the bottleneck is. Whether it is iterating over the meshes multiple times, unaligned memory access in batch creation routines, ...

So this needs to be profiled before moving on with a better design (and shared meshes are to be considered here as well).

All meshes:

  • mesh→batch_cache_flag = 0.

For each engine, all objects:

  • ((Mesh *)ob→data)→batch_cache_flag |= get_engine_flag(engine, scene_layer, ob);

All meshes [potentially threaded]:

  • Clear the batch and VBOs no longer required, create the others.

Shader compilation

  • Separate context to compile builtin and “mode” shaders as a job on Blender init, before drawing anything (not only viewport, anything at all).
  • At file open won’t compile all the possible GPU shaders for all the possible engines.
    • First time you switch to Eevee engine, the shaders will be compiled.
    • Checkboard fallback shader for uncompiled materials, use UV when present.
  • Always have non-optimized shader (only uniforms, no consts).
  • Compile optimized shader on “shader changes”, taking into account if values are animated (animated values are uniforms, others are const).

Probes Generation

  • Probes can exclude objects from it (dynamic objects, manually tagged as such).
  • Probes can be updated based on depsgraph callback, in a dependency to all the objects that are on it.
  • Probes shouldn’t update on playback.
  • Planar reflections are not probes.
  • Planar reflections need to be updated every frame.
  • The world probe is to be used as fallback for un-generated probes.
    • Probe drawing should be unblocking. There is no threaded drawing in OpenGL but we can draw one probe per frame. (or multiple probes per frame)
    • New generated probes should then replace the previously generated probe or its fallback option if it’s the first run.

Framebuffer textures regeneration

The main issue now is that Eevee is storing the cubemaps (and the shadows) in the viewport, and has to re-calculate them upon resize.

The probes and shadows storage will move from viewport to SceneLayer. This will make performance much better on editor size changes.

That said, planar reflection is still dependent on viewport. Once we get its first implementations, this should be revisited.

If we still need to optimize this, we can do:

  • At resize init, create FBOs as large as the window. When done resizing, re-create FBOs at viewport size
  • GPU_framebuffer API should be able to handle over-allocation (screen space UV coordinates mismatch), similar to non-power-of-two support.

Shadow Update Tagging

Rendering Shadowmaps are costly. We should only regenerate them only when needed.

  • When Depsgraph is evaluating objects it will flag the objects as updated.
  • Then Eevee will just compare previous shadow casters that were inside the shadow casting lamp’s influence.
    • If there is change (object added to / remove from lamp’s shadow casters) or if the lamp has moved, then the shadows for this lamp will have to be updated.
    • Update Lamp’s Shadow caster list.

We can also go further and only render lamps that shadows are inside the view frustum.

Frustum (Tiled) Light Culling

This technique allows to have really high number of lights spread in a scene without having really bad performance issue.

The idea is to divide the view into little frustum and store which lights affects each mini frustum. This way we don’t iterate and compute the light influence of every lamp for every pixels.

This technique is commonly computed with compute shader but since we can’t rely on them on all hardware, we need a fallback method.

We can still run this calculation at the start of every frame update on a fragment shader and store the result in a texture. Their will be a limit to the number of lights per tile, but this limit can be tweak (increasing max number of light will decrease performance). This is not an hard limit, it just means that additional lights will not get the optimization.

Bandwidth can be an issue using this. Because we are going to use a texture to load the culling result, we need to take care that the texture fetch latency is not killing performance. If we go for a compute shader approach, we could use SSBO instead of texture (only for code quality, performance would be much the same).

Draw Manager Cache Discarding

At the moment of writing (May 2017), the Draw Manager still recreate it’s cache every single frame. This introduce a cost of a few precious milliseconds per frame that may do the difference between laggy and smooth viewport. We need to refresh this cache on any user interactions (selecting object, changing values...) but not when navigating or using playback animation.

To do this we need to split the calculation of the runtime view data (stored per object) and the cache construction (that is referencing these datas). Updating the runtime data should be done when evaluating the object inside the depsgraph. This is mainly an issue for non meshes and bones objects.