Table of Contents |
Up-conversion: Optimisation: Arithmetic | Up-conversion: Optimisation: Other
The CPU has to store the images in memory and when dealing with up-conversion the amount of memory for successive images can rise quite rapidly. With large IO requirements careful attention should be paid to the amount of memory used and how. Whilst images are stored in two-dimensional arrays the PC's memory is not two-dimensional and can be visualised as a one-dimensional array or line. How the PC performs the 2D to 1D mapping is very important when trying to optimise code that loops over large amounts of data.
With a two-dimensional array there are two easy ways to convert the information for one-dimensional storage. The most obvious way is to simply concatenate all the lines, which is effectively what we do when reading lines of words. Unfortunately this is not the way in which C++ stores its standard arrays.
A two-dimensional C++ array is an array of arrays which is accessed by the syntax My_Array[x_index][y_index]. The value x_index gives an index into a one-dimensional array of pointers to other arrays. Having resolved the pointer the value y_index is an index into the 'sub-array' which actually contains a column of pixel values. This means that C++ stores two-dimensional array with columns of data contiguous in memory and rows of data non-contiguous.
Loops that access successive values in memory are efficient, as the CPU simply needs to increment a pointer each time round the loop. Using standard arrays we would like to loop down each column so we can take advantage of this fact. The other memory issue that must be considered arises because the PC has several layers of memory.
When the CPU needs a variable it will first look in its own registers, then the level one cache followed by level two cache, RAM and then swap space. As it progresses down this hierarchy the amount of time taken to retrieve the variable increases dramatically. If the CPU cannot find the data in the cache it issues a command to fetch it from slower memory. However, it doesn't just fetch the variable, it will grab a chunk of memory of which the variable is only a small part, this is pre-fetching.
If we know that a column is stored contiguously in memory then it is likely that when the CPU requests a variable the whole column (and possibly neighbouring columns) will be brought up the memory hierarchy. In this case it would be sensible to process a column at a time to make optimal use of the memory. Processing the image a line at a time would mean that the CPU is constantly jumping through memory trying to grab the relevant data.
Memory access is the main bottleneck in the up-conversion code so it can be used to justify the choice to only up-convert by a factor of two. Benchmarks on a machine similar to the test machine clocked the RAM's maximum transfer rate at a little over 600 MB/s. Consider up-converting a CIF image (352 x 288 pixels) by a factor of eight, assume that we only address each pixel in the final image once. This means we have are dealing with about 26MB data (assuming integer storage). If we use half of the RAM transfer rate we could expect to up-convert around 12 images every second. This is not going to give us a real-time coder as our frame rate is low and we haven't even considered the operation of the rest of the codec.
The loops in the up-conversion code are quite complicated so the best way to describe them is (as they were developed) in stages. Initial code versions used Blitz arrays as I was already familiar with the syntax involved. Whilst trying out different methods and implementations I converted the code to use standard C++ arrays. The syntax for standard arrays is a lot more cumbersome but this forces the developer to think about how they are organised in memory. The change from Blitz to standard arrays had little effect on the execution time so I stuck with the standard type. The standard arrays should be more acceptable to developers who want to modify the code. Insisting that the developer use a certain array container for no good reason will not win many fans.
Firstly, a quick reminder of how the basic method works. Figure 2 shows how the filter works along a line, copying one pixel to the new image and then working out the next new pixel by multiplying pixels from the old image by the filter taps. In the example I assume an eleven-tap half-band filter so only six pixels from the old image form part of the calculation. The remaining 5 taps have a value of zero and apply to the pixels in-between the numbered ones (which only exist in the new image).
Figure 2 - The up-conversion process
The first version of the loop simply applied this process to each line producing an image of twice the width that needed to be stored in an intermediate array. Then the same process was applied to each column in the intermediate array to generate the final image. This second stage is quite efficient as the loop is filtering each column, which is contiguous in memory. The next change was to make the first stage equally efficient.
For the moment we retain the idea of filtering all the lines and writing an intermediate image before filtering the columns. This time altering the first stage so it is now as follows:
Again the result is an intermediate image that is twice as wide and now needs filtering by columns to generate the final image. However this time the first stage process looped over the columns of the original image even though it was actually filtering the lines. Each calculation in the first stage will use several pixel values from a line of the original image. The example in figure 2 used six pixels in the calculation so for this to be quick the CPU needs to have 6 columns of data in the cache. For a CIF image this would involve approximately 7Kb of space which is perfectly reasonable.
For the final optimised version we can eliminate the intermediate array and make even better use of the cache. The extra optimisation involves combing the two filtering stages into one and the process becomes a little more complicated. Combing the stages means that we can perform all the operations that need a particular column of data whilst it is still in the cache. Breaking the process down into stages:
Figure 3 - Writing the new image