Motion estimation is specific to the encoder. It's always
the most complicated part of the system, and can absorb huge
system resources, so methods have to be found to produce
short-cuts. Dirac adopts a 3-stage approach. In the first
stage, motion vectors are found for every block and
each reference to pixel accuracy using hierarchical
motion estimation. In the second stage, these vectors
are refined to sub-pixel accuracy. In the third stage,
we do mode decision, which chooses which predictor to use,
and how to aggregate motion vectors by grouping blocks
with similar motion together.
Motion estimation is most accurate when all three
components are involved, but this is more expensive in
terms of computation as well as more complicated. Dirac
only uses the luma (Y) component.
Hierarchical motion estimation
Hierarchical ME speeds things up by repeatedly
downconverting both the current
and the reference frame by a factor of two in both
dimensions, and doing motion estimation on smaller pictures.
At each stage of the hierarchy, vectors from lower levels
(smaller versions of the picture)
are used as a guide for searching at higher levels. This
dramatically reduces the size of searches for large motions.
Dirac has four levels of downconversion. The block size remains
constant (and the blocks
will still overlap at all resolutions) so that at each level
there are only a quarter as many blocks and each block
corresponds to 4 blocks at the next higher resolution; and so
each block provides a guide motion vector to 4 blocks at the next
higher resolution layer.
At each resolution, block matching proceeds by
searching in a small range around the guide vector for the best
match using the RDO metric (which is described below).
Search strategies in hierarchical
ME
The hierarchical approach dramatically reduces the
computational effort involved in motion estimation for an
equivalent search range. However it risks missing small
motions and it might not make good decisions when there are
a variety of motions near to each other.
To mitigate this, the codec also
always uses the zero vector (0,0) as
another guide vector - this allows it to track slow- as well
as fast-moving objects. Finally, the motion vectors
already found in neighbouring blocks can also be used
as guide vectors, it they have not already been tried.
Since each layer has twice the horizontal and vertical
resolution of the one below it, it would appear to make sense to
just search in an area +/-1 pixel of the guide vectors. In
fact,the search ranges are always larger than this because this
could cause the motion estimator to get trapped in a local minimum.
Sub-pixel refinement and upconversion
Dirac supports variable levels of motion vector accuracy. In the
software currently, these are hard-wired in the code at 1/4 pixel
but 1/8 is possible with the current software and even higher
resolutions could be defined. The MV precision is signalled with
each frame.
Sub-pixel refinement operates hierarchically also. Once
pixel-accurate motion vectors have been determined, each block
will have an associated motion vector
(V0,W0) where V0 and
W0 are multiples of 4 (for quarter-pel accuracy) or
8 (for eighth-pel accuracy). 1/2-pel accurate vectors are found by
finding the best match out of (V0,W0) and
its 8 neighbours: (V0+4,W0+4),
(V0,W0+4),
(V0-4,W0+4),
(V0+4,W0),
(V0-4,W0),
(V0+4,W0-4),
(V0,W0-4),
(V0-4,W0-4). This in turn produces a new
best vector (V1,W1), which provides a
guide for 1/4-pel refinement, and so on until the desired accuracy. The process is
illustrated in the figure below.
Figure: sub-pixel motion-vector refinement
The sub-pixel matching process is complicated slightly
since the reference is only upconverted by a factor of 2 in
each dimension, not 8, and so more accurate vectors
require frame component values to be calculated on the fly by linear
interpolation.This means that the 1/2-pel interpolation filter has
a bit of pass-band boost to counteract the sag introduced by doing
linear interpolation. It was designed to produce the lowest
interpolation error across all the phases. The taps are (scaled to 5 bits):