Optimised C++ routines for image up-conversion and down-conversion

Table of Contents | Up-conversion: Optimisation: Arithmetic | Up-conversion: Optimisation: Other

Up-conversion

Optimising the conversion: IO

The CPU has to store the images in memory and when dealing with up-conversion the amount of memory for successive images can rise quite rapidly. With large IO requirements careful attention should be paid to the amount of memory used and how. Whilst images are stored in two-dimensional arrays the PC's memory is not two-dimensional and can be visualised as a one-dimensional array or line. How the PC performs the 2D to 1D mapping is very important when trying to optimise code that loops over large amounts of data.

With a two-dimensional array there are two easy ways to convert the information for one-dimensional storage. The most obvious way is to simply concatenate all the lines, which is effectively what we do when reading lines of words. Unfortunately this is not the way in which C++ stores its standard arrays.

A two-dimensional C++ array is an array of arrays which is accessed by the syntax My_Array[x_index][y_index]. The value x_index gives an index into a one-dimensional array of pointers to other arrays. Having resolved the pointer the value y_index is an index into the 'sub-array' which actually contains a column of pixel values. This means that C++ stores two-dimensional array with columns of data contiguous in memory and rows of data non-contiguous.

Loops that access successive values in memory are efficient, as the CPU simply needs to increment a pointer each time round the loop. Using standard arrays we would like to loop down each column so we can take advantage of this fact. The other memory issue that must be considered arises because the PC has several layers of memory.

When the CPU needs a variable it will first look in its own registers, then the level one cache followed by level two cache, RAM and then swap space. As it progresses down this hierarchy the amount of time taken to retrieve the variable increases dramatically. If the CPU cannot find the data in the cache it issues a command to fetch it from slower memory. However, it doesn't just fetch the variable, it will grab a chunk of memory of which the variable is only a small part, this is pre-fetching.

If we know that a column is stored contiguously in memory then it is likely that when the CPU requests a variable the whole column (and possibly neighbouring columns) will be brought up the memory hierarchy. In this case it would be sensible to process a column at a time to make optimal use of the memory. Processing the image a line at a time would mean that the CPU is constantly jumping through memory trying to grab the relevant data.

Memory access is the main bottleneck in the up-conversion code so it can be used to justify the choice to only up-convert by a factor of two. Benchmarks on a machine similar to the test machine clocked the RAM's maximum transfer rate at a little over 600 MB/s. Consider up-converting a CIF image (352 x 288 pixels) by a factor of eight, assume that we only address each pixel in the final image once. This means we have are dealing with about 26MB data (assuming integer storage). If we use half of the RAM transfer rate we could expect to up-convert around 12 images every second. This is not going to give us a real-time coder as our frame rate is low and we haven't even considered the operation of the rest of the codec.

The loops in the up-conversion code are quite complicated so the best way to describe them is (as they were developed) in stages. Initial code versions used Blitz arrays as I was already familiar with the syntax involved. Whilst trying out different methods and implementations I converted the code to use standard C++ arrays. The syntax for standard arrays is a lot more cumbersome but this forces the developer to think about how they are organised in memory. The change from Blitz to standard arrays had little effect on the execution time so I stuck with the standard type. The standard arrays should be more acceptable to developers who want to modify the code. Insisting that the developer use a certain array container for no good reason will not win many fans.

Firstly, a quick reminder of how the basic method works. Figure 2 shows how the filter works along a line, copying one pixel to the new image and then working out the next new pixel by multiplying pixels from the old image by the filter taps. In the example I assume an eleven-tap half-band filter so only six pixels from the old image form part of the calculation. The remaining 5 taps have a value of zero and apply to the pixels in-between the numbered ones (which only exist in the new image).


Figure 2 - The up-conversion process

The first version of the loop simply applied this process to each line producing an image of twice the width that needed to be stored in an intermediate array. Then the same process was applied to each column in the intermediate array to generate the final image. This second stage is quite efficient as the loop is filtering each column, which is contiguous in memory. The next change was to make the first stage equally efficient.

For the moment we retain the idea of filtering all the lines and writing an intermediate image before filtering the columns. This time altering the first stage so it is now as follows:

  1. Start on the first line of the original image at the first pixel.
  2. Copy the pixel to the intermediate image.
  3. Calculate the next pixel in the intermediate image by filtering the original image.
  4. Move down one line in the original image and apply the same process.
  5. When the process has been applied to the last line - move back to the top line but start one pixel further along the line than the previous time.
  6. Go to instruction 2 and continue until every pixel in the original image has been copied.

Again the result is an intermediate image that is twice as wide and now needs filtering by columns to generate the final image. However this time the first stage process looped over the columns of the original image even though it was actually filtering the lines. Each calculation in the first stage will use several pixel values from a line of the original image. The example in figure 2 used six pixels in the calculation so for this to be quick the CPU needs to have 6 columns of data in the cache. For a CIF image this would involve approximately 7Kb of space which is perfectly reasonable.

For the final optimised version we can eliminate the intermediate array and make even better use of the cache. The extra optimisation involves combing the two filtering stages into one and the process becomes a little more complicated. Combing the stages means that we can perform all the operations that need a particular column of data whilst it is still in the cache. Breaking the process down into stages:

  1. As with the previous method. Start at the first pixel on the first line and copy a pixel before calculating the next. This time write the results to the final image.
  2. Continue the process moving down the column of the old image but write the result to alternate lines in the final image, see figure 3(a).
  3. When the bottom of the column is reached both columns that have been written to the new image can be filtered. Apply the filter to both columns to fill in the blank pixels, see figure 3(b).
  4. Go back to instruction 1 but start on the next pixel on the first line.


Figure 3 - Writing the new image

SourceForge.net Logo