<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=us-ascii" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

Aaron Colwell wrote:

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <pre wrap="">On Mon, Apr 18, 2005 at 12:19:52PM -0400, Steve Kann wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">   Aaron Colwell wrote:

 On Mon, Apr 18, 2005 at 11:30:26AM -0400, Steve Kann wrote:

 Hi, List,

    I've been working on building an implementation of a

 video-conferencing endpoint using Theora, and have been working with the

 draft-kerr-avt-theora-rtp-00 spec.

    I've also read the archives of this list, about some of the proposed

 changes.   I'd like to describe here what I'm planning on doing, and see

 how this might fit into your design.

    Basically, what I'm working with is a project called "iaxclient". 

 iaxclient is a library for a VoIP softphone, which presently supports

 only audio, but I am extending to support video as well.  It uses the

 IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use

 RTP.  However,  the payload format for IAX2 is generally compatible with

 the payload format for RTP.  Asterisk (the open-source PBX) includes

 support for RTP-based VoIP protocols (SIP, H.323, etc), as well as

 non-RTP-based VoIP protocols (IAX2, others).

    There are basically two use cases for users making videoconferencing

 calls using the application:

     1) Point-to-Point calls:  This case seems to be pretty easy to

 handle, and fits into most of the designs I've seen so far:

    2) Multi-party conferences:  This is where some of the designs I've

 seen so far seem to work well, and some of them do not.

    The basic idea for multi-party conferences is that each user

 maintains a virtual connection to a "conference engine"  (this is

 already in place for audio conferences).  The conference engine

 intelligently  receives audio from the clients and sends audio to the

 clients, so each client can hear the audio of any other speaking

 participants.

    The idea for video is that the clients each send their video to the

 conference engine, and the conference engine will send zero or one video

 stream to each participant, in one of two "modes"

       a) Automatic mode:   The conference engine will use some

 heurestics to decide whose video should be shown to the participant --

 Generally, this will be the only participant who is presently speaking 

 (in the case of multiple active speakers, or zero active speakers, there

 will be some secondary criteria).

      b) Request mode:  The client itself will notify the conference

 engine (perhaps out-of-band) and request to see a particular speaker's

 video.

    What this means for the video stream (and this works just fine for

 any other video format, (i.e. h.26x, etc), is that we would like to be

 able to change the video source at any time (or, at any keyframe at

 least).

 The whole setup headers business, of course, makes this design

 particularly difficult.   With the present draft-kerr-avt-theora-rtp-00

 format, though, I think I could probably (with a great deal of

 unnecessary overhead), send the setup headers occassionally, and then

 switch at any time.  The clients could then use "header caching", and,

 if they've seen these headers before (matching CRC32), they could use

 their cached copy, and if not, they'd just have to wait a few seconds to

 get them before they could start decoding.

    *Note:  I also suspect, but I haven't researched, that if all the

 clients are using the same version of the theora encoder, and the same

 settings, that their setup headers would likely be the same;  If this is

 the case, then their CRC32's would be the same, and they could start

 decoding at any keyframe..

 With the latest idea I've read, though, it makes this process much more

 inconvenient, because _each_ client would have their own 16bit "chain

 ID", and these chain ID's would be duplicated in the streams sent by

 each client, and therefore the server would need to deeply understand

 and parse each of the streams in order to put them together, etc.

 What did you do in the case where the CRC32 was different from each of the

 clients? This is basically the same scenario isn't it?

   In that case, the "setup header ident" of the payloads would be different,

   and the receiver would know that it needs to wait until it receives setup

   headers matching the CRC of these frames before decoding.  So, the only

   time we could accidentally try to decode frames using an incorrect set of

   setup headers would be in the case of a CRC collision (P ~= 0).

   If the payload format only includes a "chain ID", then the chances of two

   streams having the same "chain ID" when coming from different sources is

   pretty much P==1.  So, the server that's doing the switching would need to

   actually muck with the payload in order to give each sender a different

   Chain ID, and then keep track of which was which, etc.  It makes it

   impossible to just switch senders in the server without the server

   understanding the internals of the codec payload.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Ok now I see.

  </pre>

  <blockquote type="cite">

    <pre wrap=""> I know nothing about IAX2, but I would assume that it has some sort of

 offer/answer model to negotiate codec parameters and such. You could easily

 put the chain ID in this negotiation so that all users in the conference use

 the same codebook.

   Presently, it's pretty simple, where it allows negotiation of the codec,

   but not codec parameters.   In practice, it hasn't been necessary to do

   that.  In the future, it might need to be extended to do so.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

I see. I suppose you could append the codebook hash to the codec name. Instead

of just "Theora" you could have "Theora-2982394872479842". I don't know if

there are any limitations on codebook name.

  </pre>

</blockquote>

See below..<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">   But, consider that users will join and leave the conference at arbitrary

   times, so the conference engine can't know in advance all the codebooks

   that might be used. 

    </pre>

  </blockquote>

  <pre wrap=""><!---->

This isn't necessarily a problem. It would just need to know the codebooks that

it allows to be used in the conference. That's basically what happens with all

other codecs. It's just implicit in their case instead of explicit like it 

would have to be for Theora.

  </pre>

</blockquote>

Right, but presently, there's no way to force the encoder to use a

particular codebook;&nbsp;&nbsp; Presently, it seems like the encoder presently

uses the same codebook all the time depending on compile, not run time

stuff (not sure about this, though).<br>

<br>

<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <blockquote type="cite">

    <pre wrap="">   Also, as you elude to below, there's no way to seed an encoder with a

   particular codebook (AFAIK).

 I think that my use case isn't all that unusual though; it's somewhat

 like the properties you might have in multicasting, I think.

 1) It would be ideal if the RTP payload format could be made independent

 of SDP.

 It is currently independent of SDP if you use inline codebook transmission. The

 info in the SDP just allows you to know ahead of time what the info and setup

 headers are going to be for each chain. It also provides a mechanism to grab

 the codebooks ahead of time. You also can save bits if you don't want

 to periodically transmit codebooks.

   I don't think so, I thought the latest proposal called for replacing the

   "setup ident" field (32 bits) with a "chain ID" field (16 bits or so),

   where the "chain ID" field would refer to a "chain-info" item in the SDP.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The chain ID really just represents a group of packets that use a particular

ident &amp; codebook pair. You don't have to refer to the SDP. If the client

knows that there is going to be inline header and codebook transmission it

can just wait for the ident and codebook for that chain to arrive. In most

situations those would arrive before any data packets, but in your switching

situation, that might not happen.

  </pre>

  <blockquote type="cite">

    <pre wrap="">   This would mean, that even for a RTP and SDP based conference application

   like mine, if a client joined the conference with a different codebook,

   then all the clients would need to re-fetch the SDP in order to identify

   the codebook that's needed. 

    </pre>

  </blockquote>

  <pre wrap=""><!---->

No it wouldn't. It can just wait for the inline headers to arrive just like it

would in the CRC32 case.

  </pre>

  <blockquote type="cite">

    <pre wrap="">   But, I guess, I haven't seen (maybe it hasn't been written yet), how the

   inline codebook transfer would work.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The way I envision it is that ident and codebook packets would be transmitted

just like data packets. They would have a chain ID associated with them as 

well. This allows you to determine what headers go with which stream.

  </pre>

</blockquote>

OK, if this is the case, switching can happen without needing to

reference SDP, but then the server _still_ needs to understand, and

modify the streams that it sends out to each client.&nbsp; In particular, it

would need to:<br>

<br>

1) Parse all the packets looking for in-line setup headers.<br>

2) Keep a mapping between a "conference" chain ID, and the

&lt;sender&gt;&lt;chain-ID&gt; for all the codebooks it has seen.<br>

3) For each frame that comes in, it would need to re-write the

chain-IDs for each video frame, as well as each setup-header,

translating the sender's chain-ID to a conference chain-ID.<br>

<br>

This seems like it will be quite some amount of work..&nbsp; If instead, the

CRC-32 of the codebook set was used (i.e., like it is now, except using

both the codebooks and "info" headers), none of this would be

necessary..<br>

<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <blockquote type="cite">

    <pre wrap="">   In general, I don't think that my issues are unique to IAX2;  Nor do I

   think that they are things that can't be made to work with whatever format

   you have.   But the questions are 1) how complex will these

   implementations need to be, and 2) how will they perform.

   Most video codecs have the property where a "switch" (which is basically

   what my conferencing application is), can "switch" between streams from

   different sources, at any keyframe, as long as the width, height, and

   framerates are the same (and in some cases even if they're not), without

   needing to negotiate with the receiver at all.  This can make a switch

   fairly simple;  It only needs to know, for each frame, whether it's a

   keyframe or not, and then treat the whole thing as opaque data.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

In these cases the switch needs to know datatype specific info about the codec.

I'm assuming it cracks open the payload to determine the frame size and frame

rate. All Theora does is add codebook to the switch criteria. How are you

doing switching right now for Theora?</pre>

</blockquote>

I'm not doing this switching at all yet;&nbsp; At the moment, app_conference

(the switch) handles audio only, and not video.&nbsp;&nbsp;&nbsp; At the moment, the

released version of my client supports audio only;&nbsp; In my development

code, I have video capture, encoding, packetization, depacketization,

decoding and display working [with plenty of shortcuts still present,

like implicitly only supporting YUV420P format, etc].&nbsp; <br>

<br>

The "switch" is a module that goes into asterisk.&nbsp; Asterisk does know a

bit about some audio codecs (it includes translators

[encoders/decoders] for some), but for other audio formats, and for

video data, it just treats them as opaque, and will pass them through

if both sides agree that they support them.<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <pre wrap=""> The server would still need to keep state

for each client since the frame size &amp; frame rate is not in the frame data.

  </pre>

</blockquote>

I think that for other video codecs, it is (I'm not sure about this,

though) [frame rate and size].&nbsp; I'm not actually sure that frame rate

is needed at all, though, since frames are all timestamped with a

timestamps synchronized to audio.&nbsp; <br>

<br>

Apparently (although I haven't played with this stuff myself), asterisk

is able to (a) connect party-to-party calls, and (b) store and play

back "video voice mail", for video codecs H.261, H.263, H.263+, without

knowing anything at all about the video stream other than the

timestamps on individual packets, and the format for the stream.<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <pre wrap="">Are you not enforcing that criteria for Theora? Is this something that is

determined at the time the client connects to the server?

  </pre>

</blockquote>

<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">   Theora has already moved away from this goal a bunch with the whole

   codebook thing, but it would be nice to at least minimize the

   inconvenience of dealing with the codebooks as much as possible.

     [the present theora rtp format exhibits this property; if you use

 periodic inline

        setup header transmission]

 2) It would be ideal if the RTP payload format continued to allow inline

 setup header

    transmission.

 To my knowledge we weren't going to get rid of inline transmission. I had

 always intended to keep it.

   Would the format be the same as it is now (+- the setup header ident

   field)?  Would there be some way outside of SDP to indicate which

   codebooks belonged to which "chain id?"

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The ident and codebook packets would have a chain ID in them.

  </pre>

  <blockquote type="cite">

    <pre wrap="">  

 It would be most convenient, if there were a "fixed setup" mode for

 theora, where you could ask the theora-encoder to use fixed setup header

 set, and have it act like other codecs in this respect.  I understand

 the flexibility that the setup headers give you in encoder design, but

 it would be nice if there were a way to configure it otherwise..

 If the encoder allowed you to specify a codebook on initialization, you could

 effectively do this. Basically your app could just always specify the same

 codebook to the encoder and then sent the hash to the other participants.

 They would then verify that your hash matches the hash of their codebook and

 then your done. This is basically the codebook cache hit scenario. If you get

 a miss then you just make connection to the conference fail.

   Right. Something like this would allow for the most bit-efficient method,

   because we could (rarely, if ever) retransmit codebooks if we can control

   all the clients, and force them to use the same codebook.

   One of the other things I'll need to do eventually is to "record" these

   conferences, into some container, and make that format

   forward-compatible;  If all the clients use the same codebooks, that also

   makes things much simpler, because we could write this all out as one

   "chain".

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Thanks for bringing this up. It's nice to have input from a different use case.

I'm fine with making changes to the current thinking, but I just want to make

sure that we have a good understanding of the problems.

  </pre>

</blockquote>

Thanks for discussing it with me as well.&nbsp; I'm not necessarily set in

my thinking about things, and the idea that I had in mind might not be

the best.&nbsp; Basically, I think that the whole setup-header business is

going to make the implementation of Theora into programs a lot more

complicated than it is to drop in another codec which doesn't require

all this extra stuff to happen.&nbsp; <br>

<br>

In one particular use case, (off-line encoding to .ogg files), all this

isn't much of a headache.&nbsp; But for use-cases like this, and perhaps for

many others, this is quite a headache.&nbsp; For example, If I had all this

working with h.263 (or h.264), and I wanted to switch to theora, it

would be quite a job, because compared to the design of most video

codecs, theora is a square peg when you might have a round hole..<br>

<br>

Of course, the upside is, patent licensing headaches are probably

bigger headaches than codebook transmission stuff :)<br>

<br>

<blockquote cite="mid20050418170226.GC18963@real.com" type="cite">

  <pre wrap="">The whole reason that we went from the CRC32 -&gt; chain ID thinking was because

there was concern about collisions in this value. Unique chain IDs fix that

problem, but cause a problem for you because you want a system that doesn't 

have to worry about the chainID -&gt; codebook mapping.

What isn't clear to me is whether your problem space actually needs to allow

arbitrary codebooks. It seems to me that allowing this causes more headaches

that it's worth since you could potentially waste a ton of bits on codebook

transmission if every client in the conference uses a slightly different 

codebook.

  </pre>

</blockquote>

Absolutely, it would be much easier to do, if I could just use the

theora implementation with fixed codebooks, and not have to worry about

any of this stuff.&nbsp;&nbsp; If VP3 codebooks were an option, that would be

excellent.&nbsp; <br>

<br>

&nbsp;I suspect that if all the clients are using the same theora

implementation, and the same settings (framerate, frame size, etc),

then, even as theora improves, they'll end up with the same codebooks.&nbsp;

In that case, with the CRC32 method, they would be able to avoid

getting codebooks altogether (the codebooks they'd generate themselves

would have the same CRC32 as the codebook they get from the very first

packet, and they'd be able to feed themselves).<br>

<br>

-SteveK<br>

<br>

</body>

</html>