[vorbis] xml transcript stream proposal

Ralph Giles giles at snow.ashlu.bc.ca
Mon Aug 28 02:53:50 PDT 2000

Ok, here's what I've been thinking of in terms of the scrolling lyrics
format for Ogg. An xml stream, it matches the head-body-[body-]-tail
structure I suggested for packetization.

I'm happy with the lyrics aspect, and it maps cleanly onto the existing
formats. I also think it will handle the talk transcript, subtitle, and
karaoke requirements well. I call it a 'transcript' stream as the most
general of these terms.

We treat synchonized and staic transcripts of equal footing. The
timestamps themselves are marked by optional attributes on just about any
xml tag, so they can easily be added to a static transcript, or ignored
for static display of a syncronized one. The difference probably will be
flagged in the stream description metadata, though.

I think it wise to partition the timestamp attributes into a separate xml
namespace, to facilite their use in other xml streams.

What I'm not happy with is the extension of this to scripts and
screenplays. I think it would be really cool if we could move
transparently from a screenplay or script to subtitles, just by changing a
stylesheet, but their structure is more complicated and it will mean a
number of additional tags. Not least is the problem of the various (yet
allegedly rigidly specified) formatting and structure conventions for
screenplays, stage plays, and radio scripts.

At this point, you might as well look at the attached examples. Below is a
summary of the tags, but if you're familiar with xml, the examples should
be enough.

After the xml declaration a <transcript> tag containing the entire
document. Stream type identification would happen here if it isn't
obvious from the declaration line.

Next is an <info> element with basic metadata, similar in spirit to the
vorbis comment header. Probably with the same tags, too, plus things like
"transcriber" and "translator".

The close of the <info> element ends the "head" part of the document. An
ogg packet boundary will probably occur here, defining the tree depth for
any further splits.

The coarse structure of the transcript is given by a nestable <section>
element. We use the general 'class' attribute to distinguish the various
levels and types of grouping: verse from chorus, scene from act.
Typically, class is used only as a formatting hook for stylesheets, but
this is imposing some semantic content to the value. A more traditional
SGML approach would be to have different elements for each level of
grouping: <act><scene>...</scene>...</act>, <chorus> and <bridge>. 

I'm essentially trading for simplicity here, on the assumption that most
files will be very flat anyway, and dumb parsers will mostly be ignore the
hierarchical structure. The <section> tags are for pretty-printing and
machinability. The former is entirely covered by the class attibute, but
I'm not sure of the propriety for the second.

The first thing inside the <section> element is an optional header, with
things like who's speaking the lines, or the location of the shot if it's
a scene. This is a bit ugly, in that we can't say what goes in the scene
heading of a screenplay verses the chorus of a song. Hence the traditional

Inside the innermost section come a series of <line> elements, each
marking a line of the song, or a line of dialog. This is the most specific
of the structural elements and cannot be nested. At the same level we'll
probably want things like <action> to describe the blocking and maybe
something like <sfx>, though that could be handed as another actor.

Inside the line-level tags, we have inline markup. <emph> for emphasis,
something like <character> (can you think of a better name?) for marking
names and props, which are often specially formatted for clarity, a
<peren> for perenthetical direction. That sort of thing. All of these
could have a class attribute, of course.

We also have a <span> tag that exists pretty much exclusively for adding
attributes to bits of text. This can be used for additional formatting
keyed to a stylesheet, or to put syllable-by-syllable timestamps on
karaoke tracks.

We also generally allow an 'id' attribute on any element for
cross-referencing and unique identification. Allowing XLink/XPointer would
also be a reasonable idea, though I wouldn't require that the parser
support following the links.


I suggest three timestamp attributes: a start time, and optionally either
a stop time or a duration. If there's only a start time, the player can
just display it as until the next stamped element comes up.

The timestamps can be nested where their associated element tags can be.
In these cases the higher-level stamps should encompass the lower ones,
and the lower-level ones take precedence in display as more specific.
Exactly how this is handled is up to the player, but for example, a
karaoke application might use <line> level timestamps to display a line at
a time, but hilight each syllable as it goes by according to the span

For the value format I'd like to allow just a few options. Normal time
relative to the start of the track, with precision given by
extension to decimal seconds or smpte subframes. "2:32" or "0:34:63:14"
A raw integer, defaulting to the elapsed time in milliseconds, possibly 
in units of an arbitrary 'timebase' specified as an attribute of the
opening transcript tag. Finally, I want to include absolute ISO
timestamps for marking live events. "1999-10-15T17:54:19.78Z" or

That's about it. Yes, I'm invoking stylesheets and lots of other
complicated machinery. I want to have that power for future flexibility
and high-quality output; that's the whole point of xml. But I maintain we
can still write a small dedicated parser that just throws up the <line>
tags as they come. That's a pretty broad spectrum. Comments welcome!


giles at ashlu.bc.ca

<LI>TEXT/PLAIN attachment: example-2.xml
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example-2.xml
Type: application/octet-stream
Size: 2925 bytes
Desc: not available
Url : http://lists.xiph.org/pipermail/vorbis/attachments/20000828/75a1ffa7/example-2-0001.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example-3.xml
Type: application/octet-stream
Size: 3910 bytes
Desc: not available
Url : http://lists.xiph.org/pipermail/vorbis/attachments/20000828/75a1ffa7/example-3-0001.obj

More information about the Vorbis mailing list