[Theora-dev] Theora Decoding on FPGA

Wed May 31 23:47:08 PDT 2006

Hello people

My name is Felipe and I sent a proposal to the Google Summer of Code
that the goal is to get a FPGA embeded system decoding Theora Streams
in real-time.
It was accepted and the mentor is the Ralph Giles.

The proposal can be viewd here:

http://atlas.lsc.ic.unicamp.br/~portavales/wp-content/uploads/2006/05/soc_proposal.txt

There is also a presentation with a better division of the hardware modules:

http://svn.xiph.org/trunk/theora-fpga/doc/hard_theora.pdf

Now, I'm working on it, and today I did a simple implementation of the
IDctSlow procedure as a VHDL module.

This module run and decode samples correctly, but It consumes a lot of
FPGA resources (logic cells, multipliers, etc..)
I will optimize this module for area, to get better results.

The testbench uses the GHDL tool to simulate and can be download from the svn:

http://svn.xiph.org/trunk/theora-fpga/idctslow/

Just run:
$make
$make run
$make compare
to see the testbench working and validating the module data output.

This IDctSlow implementation was synthesized to the Altera Stratix II
FPGA. The report is below:

------------------------------------
Analysis & Synthesis Status : Successful - Thu Jun  1 02:15:09 2006
Quartus II Version : 5.1 Build 176 10/26/2005 SJ
Revision Name : idctslow
Top-level Entity Name : IDctSlow
Family : Stratix II
Total combinational functions : 13782
Total registers : 3451
Total pins : 54
Total virtual pins : 0
Total memory bits : 2,048
DSP block 9-bit elements : 230
Total PLLs : 0
Total DLLs : 0
------------------------------------

These numbers are no good.
Im using (on this first version) a RAM like an array, acessing every
time , without worry.
But, It inferrs flipflops for each memory position, and big muxes to control it.

So, to solve this problem, I will use a syncronous memory model, That
will inferr Block RAMS (FPGA specialized blocks). This is like small
SRAMs into the FPGA chip.

I think that using it, the area can drop down to 3% to 5% of the
Stratix FPGA slices. (estimated by looking other detailed synthesis reports)

And I'm using a lot of multipliers to do all calculations in just one
clock cycle (this is easier), but (to save multipliers) I can break
the operations in several clock cycles and use the same multiplier
across them.

Now I'm working on these optimizations.

Bye
--felipe

-- 
________________________________________
Felipe Portavales <portavales at gmail.com>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://www.lsc.ic.unicamp.br