[theora-dev] First steps towards a simple text stream format.

Philip Jägenstedt philipj at telia.com
Tue Aug 12 10:49:54 PDT 2003

Hello, good evening!

On Tue, 12 Aug 2003 02:32:49 -0400
Arc <arc at indymedia.org> wrote:

> Hey gang,
> I have to put my .02 cents here somewhere, this conversation is
> frustrating me as far as seeing my needs in development not be met.

Great, more input is always a Good Thing(tm).

> First, I have a problem with XML being put into Ogg.  Ogg is not a
> format that you can simply open with a text editor, thus the plaintext
> aspect of XML is completely lost, but more importantly it seems like it
> will only add to the requirements of the players while making it more
> difficult to get all the functionality we need into it.

Well, I believe that a player using annodex would only rely on
libannodex. And a player using my theoretical format would rely on only
one library as well. However, the extra baggade that CMML and the XML
parser make out might be a disadvantage if one ever hopes to have
hardware support for ogg theora video with text subtitles.

> Second, I strongly feel this needs to be in Ogg itself, with Unicode as
> a standard.  Embedded font or not, with or without colors, having it be
> multilingual (in the same logical bitstream, otherwise you multiply the
> Ogg overhead by the number of languages) and support non-romanic
> languages is important for the uses we need out of Ogg Theora.

I also think it has to be UTF-8, and preferrably UTF-8 should be the
only allowed encoding. annodex allows any encoding that XML does I
believe, and obviously some characters need to be SGML-escaped, like >
become &lt; and so on.

As for storing all subtitle translations in the same stream, I totally
agree. I didn't figure this out until a few days, but now it is my
fullest intention to have them all in the same logical stream.

At any rate, both annodex and the format I've been drafting should meet
the requirements you mention.

> Third, it needs to be text, not graphic (ie MNG).  It's far easier to
> have a translator team go through a text transcript and piece by piece
> translate it than to have to export graphics that they need to read and
> make new graphics from.  Also, the text can be used for a transcript
> search engine, whereas MNG subtitles would have to be OCR'ed which makes
> it much less realistic.

Yes, there certainly will not be any images in my _simple_ text stream
format, nor are there any in annodex.

> > Then you use the related Theora video file and run it through "anxenc" 
> > and you've got the synchronised file. 
> That sounds like a good XML "source" format.  anxenc could do what I
> described above, compiling the XML into a binary-unicode based Ogg
> bitstream that's synced with the Vorbis and Theora pages.

For the record, I was thinking that if (yes, if) I create a new format,
it'd have a simple encoder, and then I'd add support for it to oggmerge
so that is simply one among the other Xiph formats.

> We (Indymedia) already have a large group of translation volunteers for
> more than 25 languages (everything from Chineese to Bulgarian).  What I
> plan to facilitate the job of translating all these video programs to
> all these languages with is a web-based translation tool, something that
> will open a document, quickly extract the source language's text, and
> allow those subtitles to be translated line for line to a new language.
> This tool would then edit the source file by changing the Ogg pages for
> the bitstream with the additional data included.

The way I was thinking my format could work is that in the header
packet, the available streams would be enumerated. Each has a language
tag in the form of rfc 3066 and a description string, so that could for
example be (en-US, "American English"), (i-klingon, "Klingon) or (sv,
"Swedish"). The description string is in UTF-8, so if you like you can
write the name of the lanuage (although the description string can be
something other than language) in the native language, like whatever
symbols mean "chinese" in chinese.

> It's sounding like alot of these ideas are close to what I above
> described, but it feels like the discussion is moving away from all of
> this in some aspects.

Thanks for this feedback. I'm unsure if I should go ahead with my own
format, or make use of annodex -- I'm still tending towards my own since
they are different things. My format would be a simple text stream
format that is nothing other than a text stream format. Annodex is alot
of other cool things, and it could be used for subtitles, but I would
say that its real strength lies structuring an ogg for temporal linking
and searchability, not subtitling.

If there is anyone else who feels strongly one way or the other about
annodex vs. a new format, please speak up so that I may make an informed
decision that isn't just going to lead to a lot of confusion in the end.

// Philip Jägenstedt
--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.

More information about the Theora-dev mailing list