[Theora] FPGA implementation in the camera

Thu Sep 9 09:38:45 PDT 2004

> As j indicated, the spec is your best option here. Please let us know if
> you find any discrepancies/bugs/unclear sections.
>
> Good luck, and keep us posted!
>
>  -r

I've read all those 206 pages couple times - my only complaint so far
would be that it is written only for the decoder so some amount of
"reverse engineering" is still required.

Considering limited resources of the hardware and specifics of the network
cameras - most common application is probably a digital replacement for
the video security cameras (they usually don't move, so background is
stable) I'm trying to simplify the scope of the first implementation as
much as possible. And as I'm using reconfigurable FPGA (not custom ASIC) I
do not need to to have a perfect implementation at the time the hardware
is released - it is possible to implement incrementally.

So I'm thinking to start with the following:

1. No motion vectors.
2. Always use INER_NOMV coding mode (if not golden frames)
3. No loop filter (is it the same as putting zeros in the limits?)
4. EOB runs are limited to a single block

Does it make sense?
I can work on loop filter if there will be enough resources left, and the
external (to FPGA) memory bandwidth will not be saturated (32MB - 16Mx16,
DDR 120MHz - peak 480 MB/s).  The same is true for always referencing to
the previous frame - it will be easy to change if the memory bandwidth
will be enough to transfer both (golden and previous) to the FPGA.

I see the following structure of the compressor implemented in the FPGA
(Xilinx Spartan 3 1000K gates):
1. Data from the external frame buffer (FB) memory goes to the
Bayer-to-YCbCr (4:2:0) converter in overlapping 20x20 tiles that produce 6
8x8 blocks (one macroblock) on the output.
2. Corresponding 6 blocks from the previous frame are fetched from the
same FB in parallel, subtracted from the new frame (if it is not a golden)
and processed by the DCT and quantizator.
3. After the quantizator data in one branch goes through dequantizator,
IDCT and back to FB to be fetched with the next frame.
4. In parallel to (3) 64 coefficients are RLL encoded and saved to the FB.
At least at first - no EOB runs covering several blocks as the blocks will
be processed in a single-pass macroblock order, not plane order.
5. Separate process will fetch tokens (or just their fixed-length
RLL-encoded equivalents) from FB in the index order, the bitstream will be
built and transfered to the system (separate from the FB) memory using DMA
channel. CPU will run software to add all the required headers,
encapsulate the stream and send it out.