[theora-dev] What goes to Hardware ?

Sun Jul 2 15:10:00 PDT 2006

Hi people,

As I said before: I did the IDCT to run on the FPGA.

My friends from university did the Reconstruction routines running on the FPGA.
I'm helping with the LoopFilter, and it is almost there.

(all VHDL)

I did a small profiling of the libTheora running on a Altera Stratix II device:

The processor used was the NIOS II with 8Kb of data and instruction
cache, branch prediction and hardware divider. (this is the more
roubust NIOS II version).

I decoded some frames of a 320x240 theora stream.

Decoding all frames only in software (without the hardware modules)
I got 44 ms per frame.

>From this 44 ms:
    The IDCT takes 7 ms
    The Reconstruction routines takes 6 ms

As you know, the ReconRefFrames routine is the caller of the IDCT,
Reconstruction and LoopFilter.

The ReconRefFrames wastes 31 ms from the total 44 ms.
This is more than 66% of the decoding time.

If I run the libtheora without the software IDCT , and using the IDCT
hadware module,
I get 46 ms of decoding time per frame.

You can say that this makes no sense: Why with the help of a hardware
module the time can increase ?

The increase of time can be explained by two factors:

1) The overhead of data transfer on the bus is too expensive, this bus
is shared with normal memory access (by the processor) too.

2) I did a sequencial test: software sends data to IDCT, waits for
data be ready, and Read the data from IDCT.
Its bad, because I cut the hardware paralelism.
But this is just a small test, the final version will have a buffer to
receive and send to IDCT, without to have to stop the software.

So, you must consider about 2 ms of data-transfer overhead, and 7 ms
of IDCT processing time.
We could get 5 ms less if the IDCT hardware module run in parallel.

But the important thing to see from these numbers are:

Even if the hardware IDCT had no data transfer overhead,
we could get only 7 ms (15%) less of decoding time per frame.

But,
If we have all the ReconRefFrames routine on harware, we can have 31
ms (66%) less.
It will be very good. Just 33% CPU-time of the algorithm will be
running on software.

And better:
If we have the ReconRefFrames on hardware, we can send the output of
the ReconRefFrames hardware module direct to the screen (without pass
through software).

So, this way, the libTheora software will just copy the data to the
hardware module, and the hardware output will be sent direct to a
screen buffer (another hardware module like a video board will present
the frame on a video monitor).

So, this way will need only the overhead of 1 transfer (just send),
and not 2 like the way I did in IDCT (send and receive).

To put all ReconRefFrames routine in hardware I will need at least 3
big buffers:

Current Frame
Last Frame
Golden Frame

On a 320x240 stream, it represent about 150 Kbyte of each buffer, so I
will need about 500 Kbytes of memory.

It is too much to use FPGA internal memory.
So I'm planning use a external SRAM of 500Kbytes.
SRAM data sheet: http://www.olimex.com/dev/pdf/71V416_DS_74666.pdf

Another alternative is to use a PC100 SDRAM of 16 Mb:
http://download.micron.com/pdf/datasheets/dram/sdram/128MbSDRAMx32.pdf
http://www.altera.com/literature/ds/ds_sdram_ctrl.pdf

My Altera Stratix Dev. Kit has this SRAM and SDRAM.
See:
http://www.altera.com/literature/manual/mnl_nios2_board_stratixII_2s60.pdf

Please,
comments and sugestions are wellcome.

Best Regards,
felipe

-- 
________________________________________
Felipe Portavales <portavales at gmail.com>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://www.lsc.ic.unicamp.br