[Theora-dev] Theora Decoding on FPGA
Felipe Portavales Goldstein
portavales at gmail.com
Wed May 31 23:47:08 PDT 2006
Hello people
My name is Felipe and I sent a proposal to the Google Summer of Code
that the goal is to get a FPGA embeded system decoding Theora Streams
in real-time.
It was accepted and the mentor is the Ralph Giles.
The proposal can be viewd here:
http://atlas.lsc.ic.unicamp.br/~portavales/wp-content/uploads/2006/05/soc_proposal.txt
There is also a presentation with a better division of the hardware modules:
http://svn.xiph.org/trunk/theora-fpga/doc/hard_theora.pdf
Now, I'm working on it, and today I did a simple implementation of the
IDctSlow procedure as a VHDL module.
This module run and decode samples correctly, but It consumes a lot of
FPGA resources (logic cells, multipliers, etc..)
I will optimize this module for area, to get better results.
The testbench uses the GHDL tool to simulate and can be download from the svn:
http://svn.xiph.org/trunk/theora-fpga/idctslow/
Just run:
$make
$make run
$make compare
to see the testbench working and validating the module data output.
This IDctSlow implementation was synthesized to the Altera Stratix II
FPGA. The report is below:
------------------------------------
Analysis & Synthesis Status : Successful - Thu Jun 1 02:15:09 2006
Quartus II Version : 5.1 Build 176 10/26/2005 SJ
Revision Name : idctslow
Top-level Entity Name : IDctSlow
Family : Stratix II
Total combinational functions : 13782
Total registers : 3451
Total pins : 54
Total virtual pins : 0
Total memory bits : 2,048
DSP block 9-bit elements : 230
Total PLLs : 0
Total DLLs : 0
------------------------------------
These numbers are no good.
Im using (on this first version) a RAM like an array, acessing every
time , without worry.
But, It inferrs flipflops for each memory position, and big muxes to control it.
So, to solve this problem, I will use a syncronous memory model, That
will inferr Block RAMS (FPGA specialized blocks). This is like small
SRAMs into the FPGA chip.
I think that using it, the area can drop down to 3% to 5% of the
Stratix FPGA slices. (estimated by looking other detailed synthesis reports)
And I'm using a lot of multipliers to do all calculations in just one
clock cycle (this is easier), but (to save multipliers) I can break
the operations in several clock cycles and use the same multiplier
across them.
Now I'm working on these optimizations.
Bye
--felipe
--
________________________________________
Felipe Portavales <portavales at gmail.com>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://www.lsc.ic.unicamp.br
More information about the Theora-dev
mailing list