The key to making good decisions in compression is to be
able to trade off the number of bits used to encode some part
of the signal being compressed, with the error that is produced
by using that number of bits. There is no point striving hard
to compress one feature of the signal if the degradation it produces
is much more significant than that of compressing some other feature
with fewer bits. In other words, one wishes to distribute the
bit rate to get the least possible distortion overall. So how
can this be done?
Rate distortion can be described in terms of
Lagrangian multipliers. It can also be described by the Principle of Equal
Slopes, which states that the coding parameters should be
selected so that the rate of change of distortion with respect
to bit rate is the same for all parts of the system.
To see why this is so, consider two independent components
of a signal. They might be different blocks in a video frame,
or different subbands in a wavelet transform. Compress them at
various rates using your favourite coding technique, and you
tend to get curves like those in the figure below. They show that
at low rates, there is high distortion (or error) and at high
rates there is low distortion, and there is generally a
smooth(ish) curve between these points with a nice convex
shape.
Figure: Rate-distortion curves for two signal
components
Now suppose that we assign B1 bits to component X and B2
bits to component Y. Look at the slope of the rate-distortion
curves at these points. At B1 the slope of X's distortion with
respect to bit rate is much higher than the slope at B2, which
measures the rate of change of Y's distortion with respect to
bit rate. It's easy to see that this isn't the most efficient
allocation of bits. To see this, increase B1 by a small amount
to B1+Δ and decrease B2 to B2-Δ. Then the total
distortion has reduced even though the total bit rate hasn't
changed, due to the disproportionately greater drop in the
distortion of X.
The conclusion is therefore that for a fixed total bit rate,
the error or distortion is minimised by selecting bit rates for
X and Y at which the rate-distortion curves have the same
slope. Likewise, the problem can be reversed and for a fixed
level of distortion, the total bitrate can be minimised by
finding points with the same slope.
Two questions arise in practise: firstly, how does one find
points on these curves with the same slope; and secondly, how
does one hit a fixed overall bit budget? The first question can
be answered by the succeeding figure: the intercept of the tangent to the
rate-distortion curve at the point (R0,D0) to the D-axis is the
value D0+λR0 where -λ is the slope at the point
(R0,D0). Furthermore it is the smallest value of D+λR
for all values of (R,D) that lie on the curve. So in selecting,
for example, a quantizer in a given block or subband, one
minimises the value D(Q)+λR(Q) over all quantizers Q,
where D(Q) is the error produced by quantizing with Q and R(Q)
is the rate implied.
Figure: minimisation of the Lagrangian cost
function
In order to hit an overall bit budget, one needs to iterate
over values of the Lagrangian parameter λ in order to
find the one that gives the right rate. In practise, this
iteration can be done in slow time given any decent encoding
buffer size, and by modelling the overall rate distortion curve
based on the recent history of the encoder. Rate-distortion
optimisation (RDO) is used throughout Dirac, and it has a very beneficial effect on
performance. Control of the example Dirac encoder is by a single
parameter ("-qf") that effectively sets Lagrangian parameters
for each part of the encoding process.
This description makes RDO sound like a science:
in fact it isn't and the reader will be pleased to learn that
there is plenty of scope for engineering ad-hoc-ery of all
kinds. This is because there are some practical problems in
applying the procedure:
1) There may be no common measure of distortion. For
example: quantising a high-frequency subband is less visually
objectionable than quantising a low-frequency subband, in
general. So there is no direct comparison with the significance
of the distortion produced in one subband with that produced in
another. This can be overcome by perceptual weighting, in which
the noise in HF bands is downgraded according to an estimate of
the Contrast Sensitivity Function (CSF) of the human eye, and
this is what we have done. The problem even occurs in
block-based coders, however, since quantisation noise can be
successfully masked in some areas but not in others. Perceptual
fudge factors are therefore necessary in RDO in all types of
coders.
2) Rate and distortion may not be directly
measurable. In practice, measuring rate and distortion for,
say, every possible quantiser in a coding block or subband
cannot mean actually encoding for every such quantiser and
counting the bits and measuring MSE. What one can do is
estimate the values using entropy calculations or assuming a
statistical model and calculating, say, the variance. In this
case, the R and D values may well be only roughly proportional
to the true values, and some sort of factor to compensate is
necessary in using a common multiplier across the encoder.
3) Components of the bitstream will be
interdependent. The model describes a situation where the
different signals X and Y are fully independent. This is often
not true in a hybrid video codec. For example, the rate at
which reference frames are encoded affects how noisy the
prediction from them will be, and so the quantisation in
predicted frames depends on that in the reference frame.
Even if elements of the bitstream are logically independent,
perceptually they might not be. For example, with Intra coding,
each frame could be subject to RDO independently, but this
might lead to objectionally large variations in quantisation
noise between frames with low bit rates and rapidly changing
content.
Incorporating motion estimation into RDO is also tricky,
because motion parameters are not part of the content but
have an indirect effect on how the content looks. They also have
a coupled effect on the rest of the coding process, since the
distortion measured by prediction error, say, affects both
the bit rate needed to encode the residuals and the
distortion remaining after coding. This is discussed in more
detail later.