[theora-dev] First steps towards a simple text stream format.

Silvia.Pfeiffer at csiro.au Silvia.Pfeiffer at csiro.au
Sat Aug 23 20:01:00 PDT 2003

Hi Arc, all,

Arc wrote:
> First, I have a problem with XML being put into Ogg.  Ogg is not a
> format that you can simply open with a text editor, thus the plaintext
> aspect of XML is completely lost,

No, not quite. It's nice as an authoring language and a "grep"-like 
action on an annodexed ogg file is actually quite attractive as far as 
searching goes. :)

> but more importantly it seems like it
> will only add to the requirements of the players while making it more
> difficult to get all the functionality we need into it.

With annotation streams there is a need to extend functionality of 
players anyway. Whatever way you format the annotations, some parsing 
will have to be done. XML seems a simple solution to parsing text.

I guess, creating a time-synchronous annotation stream for theora is an 
immediate problem and that's fair enough. OggWrit seems to fulfill that 
need (thanks for pointing me to it). We looked a bit beyond that and 
therefore our stuff solves a somewhat broader problem. XML is part of 
that problem because it is the standard way to create mark-up for files 
on the Web. URIs are part of that problem as they are the standard way 
to address content (such as media clips) on the Web. Ogg is part of the 
problem as it is an excellent bitstream format for multiple logical but 
time-synchronised bitstreams.

<p>> Second, I strongly feel this needs to be in Ogg itself, with Unicode as
> a standard.  Embedded font or not, with or without colors, having it be
> multilingual (in the same logical bitstream, otherwise you multiply the
> Ogg overhead by the number of languages) and support non-romanic
> languages is important for the uses we need out of Ogg Theora.

As for addressing these requirements with Annodex, I believe CMML with 
style sheet solves them. UTF-8 is the standard with XML anyway. Also, we 
have tried to make sure to support i18n in CMML (please let us know if 
something is missing!).

<p>> Third, it needs to be text, not graphic (ie MNG).  It's far easier to
> have a translator team go through a text transcript and piece by piece
> translate it than to have to export graphics that they need to read and
> make new graphics from.  Also, the text can be used for a transcript
> search engine, whereas MNG subtitles would have to be OCR'ed which makes
> it much less realistic.

Hmm, that was also what we thought was necessary.

<p>> On Tue, Aug 12, 2003 at 12:45:19PM +1000, Silvia.Pfeiffer at csiro.au wrote:
>  >
>  > another aim for annodex was to make things as simple as possible for
>  > users and application programmers. XML solves the problems of language
>  > handling and character sets (Unicode is default). So, you won't have to
>  > worry about these any more. Annodex solves the problem of synchronising
>  > text with media bitstreams using ogg so you won't have to worry about
>  > this any more.  Annodex players are rare yet but we're working on it :)
> I think having some tools export XML for editing would be cool, and I
> think graphical editors would common too.  You could have a XML source
> file and a command-line "encoder" that plugs the data into an existing
> Ogg file, or have an ogg encoder grab the XML source file and "encode"
> it at the same time it encodes the Vorbis and Theora bitstreams.

Yes, that was the idea.

>  Other
> subtitle editors could edit this data as-is without ever needing to
> touch XML.

You still have to keep the structure of the data in some way and whether 
you offer that in an XML format or some other way doesn't really matter. 
OggWrit has a struct for that, too, I guess. With Annodex, you can also 
offer CMML data in a struct to a subtitle editor program without ever 
needing to touch XML. So, the use of XML is not really a disadvantage there.

<p>> We (Indymedia) already have a large group of translation volunteers for
> more than 25 languages (everything from Chineese to Bulgarian).  What I
> plan to facilitate the job of translating all these video programs to
> all these languages with is a web-based translation tool, something that
> will open a document, quickly extract the source language's text, and
> allow those subtitles to be translated line for line to a new language.
> This tool would then edit the source file by changing the Ogg pages for
> the bitstream with the additional data included.

Cool! But be that with annodex or an upcoming implementation of 
libOggWrit - it'll be great to see the tool that you're talking about.


--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'theora-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.

More information about the Theora-dev mailing list