[Icecast] Use case question

Thu Sep 24 09:40:56 UTC 2015

Good morning,

On Wed, 2015-09-23 at 15:26 -0500, Orion Jensen wrote:
> Hi Guys,
> 
> 
> Can anyone provide more details about where the lag would occur if I
> did try to pursue a push to talk scenario with Icecast.  I'm not sure
> exactly how it works now so I'll outline the two potential discussion
> flows I have in mind.
> 
> 
> Perhaps someone could elaborate on between which steps the 5-10
> seconds of lag would come into play.
> 
> 
> Mumble does look like it might be a better fit for what I'm trying to
> do, but I still am trying to get a rough understanding of the
> limitations of the different options.

As this as well as similar questions are asked often let's take a little
bit of time to look into the difference and see why there is no
'universal' solution.

First have a look at a classical Icecast2 Setup:

[Signal Source] -*> [OS] -syscalls> [Encoder] -*> [libshout] -HTTP/TCP>
[Icecast2] -HTTP/TCP> [HTTP Engine] -*> [Player] -syscalls> [OS] -*>
[Signal Sink]

Connections marked '-*>' depend on the actual situation.

Here a [Signal Source] could be anything from a ADC (e.g. a sound/video
card) to a flat file. The signal source and the interface may have a
delay as well as a jitter. E.g. a soundcard may sample a given time span
and provide it as a block of data (RAM) to the OS. Also a file on a
harddisk does have this (as the disc is spinning at an finite speed it
needs time to read a block. jitter may e.g. be caused by seeking
(fragmentation or switching between files or just multiple access in
multi-tasking systems)). Depending on the Setup those times may be
within below one ms or well over a few hundred seconds.

The next part is the [OS]. The OS will also introduce delay and jitter.
E.g. it may change the buffers (such as by coping from IO space to
Userland space) and it may also be busy doing other tasks. Depending on
the OS and Hardware as well as configuration this may introduce delay
and jitter in the few-ms range.

The next block is the [Encoder]. It typically reads the data from the OS
via syscalls[0A]. For this little explanation we consider them to take
no time. The encoder introduces delay and jitter because of two reasons:
0) It will need CPU time to actually do the work. The time needed
   depends the data and
1) the used codec.
There are many codecs. Some target high quality and good compression and
therefore tend to have higher delay. Other target for less signal
quality and low delay. This is one of the main sources of delay and
jitter on the system. Codec specific encoding delay may be between a few
ms and many seconds. There is a nice chart on the delay of codecs
here[1].

Beside the codec there is another step within the encoder that
'amplifies' the codec specific delays and jitter: Muxing into the
container. The container is used to give meta information (such as:
which codec is used? What options are used? What time into the stream is
it?) and to add protection (Detection of transmission/storage errors).
Often several 'packets'[2A] from the encoder will be combined into
'frames'[2A] of the container before such a 'frame' is passed to the
next step. Thus also delay after this step may be a multiple of the
original delay of the encoder.

[libshout] is the reference implementation to send data to an Icecast2
server[3A]. It handles both the protocol (HTTP) as well as it takes care
of timing. libshout ensures that the stream is being send to the server
at the right speed. This helps to eliminate jitter from the
encoder/muxer and also allows streaming of files. To do so libshout has
a little buffer. This buffer is normally very small and should have
nearly no effect on the total delay if used correctly.

The next connection is a problem: the HTTP stream.
HTTP is a standard protocol as e.g. spoken by every web browser. It is
well developed and works fine for the tasks Icecast is made for. It adds
a little overhead at the start of the stream but that's about it.
However HTTP uses TCP as transport. TCP is a protocol that allows
continuous, ordered and protected streams of data. 'Raw' network is very
unreliable: packets may get lost or re-ordered or even duplicated. TCP
includes ways to eliminate this and allows us to use it as a smooth
stream of bytes without taking care of what is below. It also does this
very good[4A]. The problem is that to do this it needs... another buffer
and it may even go as far as re-requesting packets that have been lost.
So the performance of TCP highly depend on the physical link. It may add
a delay in the ms-range (e.g. local connections) to to several seconds
(just try a machine the other site of a uni campus) or even hundreds of
seconds (try a machine in a country with a bad connection to the world
wide backbone like St. Helena[6]).
This is another main source of latency. And it happens twice: to and
from Icecast2. See below.

Next we have [Icecast2] itself. Icecast2 does not really touch the data.
It just reads them off the source and distributes it to the listeners.
To do so it has a little buffer so it can deal with jitter on both ends.
As this buffer is a hybrid of fixed-bytes and fixed-time it's exact
parameters again depend on the used signal and encoder parameters. If
both the source and all clients write/read as expected the delay will
reduce nearly infinitely close to zero[5A].

Next we have the HTTP/TCP link between [Icecast2] and the [HTTP Engine].
This is about the same situation as above. The only real difference is
that from this point on each client has it's own path so it's own delay
and jitter. E.g. a close by (read: local) client may have close to no
extra delay and jitter from this point while a remote client (think
about St. Helena again) may have a significant delay jitter. So every
client may have another wall-clock time of arrival of a specific feature
of a stream[7].
Another interesting thing happens here: Up to [Icecast2] the source is
pushing data and the stream depends on the clock of the the source[8].
Icecast also tries to push the data to the client but that requires that
the client reads it as the TCP buffer is of limited size[10][11A].
The diagram for the influence of the clocks looks like this:
[Source Clock] -> [Icecast2]    ->  <- [Client Clock]
|Source Domain    | Icecast Domain |   Client Domain|

The next part is the [HTTP Engine]: It's basically the opposite of
[libshout] and also behaves very much the like. It usually holds a
little buffer to handle jitter on both ends and to provide a smooth
interface to the next part. If used correctly this adds close to zero
delay but helps with jitter.

The [Player] is the next part. It actually consists of several
components that are specific to and depending on what kind of software
is used. Normally it has a frontend buffer, a decoder and a backend
buffer. The frontend buffer is to eliminate most of the network and some
of the [Encoder] delay. Normally it's given in bytes. So again the delay
depend on the size and everything that is before that point (e.g. the
[Encoder] but also the actual network jitter).
The decoder the needs some CPU time to decode the signal.
Then the data is pushed to the backend buffer. The job of this buffer is
to provide a smooth stream to the [Signal Sink] as the decoder will work
in blocks (See 'packets' in [Encoder] details).

This is then passed using syscalls to the [Signal Sink] which behaves
exactly like the [Signal Source] in terms of delay and jitter.

So what have we got so far:
We got a streaming system that can handle bad links (mainly due to TCP
and the buffer in [Icecast2]) and can provide a bit perfect links in
those conditions and thus full quality. This happens at the expense of
delay.
What will the overall delay be like? Well, it depends on all those many
factors. On a typical Setup with first world networking between source,
[Icecast2] and sink you will end up with something <60s. There is
normally no problem of running at <30s. If you are in a very controlled
situation (like you control source, client and network as well as the
system [Icecast2] runs on) you may reach <3s. If you try that on 'the
wild internet' you will likely end up with some clients unable to
utilize your stream.

Ok, The good news: you are more than half thru my mail.

Now let's look at the typical VoIP Setup:

There is a wide range of different VoIP solutions. This is why I can
give only a rough explanation on how they work. It's not accurate and it
may be totally different for a specific solution. My intention here is
just to give you an rough idea on how that works to show the difference.

A typical VoIP Setup may look like this:
[Signal Source] -*> [OS] -syscalls> [Encoder] -*> [Protocol logic] -*>
[VoIP Network] -*> [Protocol logic] -*> [Decoder] -syscalls> [OS] -*>
[Signal Sink]

The following blocks are the same as above: [Signal Source], [Signal
Sink], [OS], [Encoder], [Decoder].

Also note that this is a unidirectional connection. VoIP applications
normally are bidirectional. So each client contains both source and the
sink components. They may interact with each other e.g. to optimize.
Here we keep it to the simple unidirectional case.

Ok, let's go from left to right again:

While the [Encoder] is the same as above there is some quantitative (not
qualitativ) difference: the selection of codecs and containers.
In a VoIP application no codecs and containers will be used that add
significant delay or jitter. Humans notice delay in a conversation very
early. A delay of only 20ms may be noticed. Over 200ms you will end up
with those calls you may know. They sound like two people constantly
asking 'Are you still on the line?'. So we need to limit the codecs top
those with only a few ms of delay.

The next step is the [Protocol logic] and the protocol used to
communicate with the [VoIP Network]. There is a wide range of protocols.
Typically those are implemented in a different ways than what we did
above with HTTP streaming. They split signalling and data apart.
Signalling is control traffic such as 'I want to call xyz.' or 'The
phone is ringing!'. The data traffic is send on it's own channel.
While the signalling needs some kind of reliability the data channel is
often used with nearly none. This means that if part of the signal is
lost on the network (packet lost) or is late or reaches the other site
multiple times it is ignored. This again requires a codec which can
handle holes in data.

What happens if data is lost? I'm sure you know that. It's a little
crackle or other distortion you can hear. As long as only a low number
of packets get lost there is no problem. Our speak contains more than
enough redundancy to compensate this.

What do we get for all this? We get a link that is very fast. The total
delay depends mostly on the distance between the two sites of the
communication. If you are lucky your link may be up to about c/3[12].
(If you want to reach a person on the other side of earth (20Mm) it will
take about 200ms for the packet to reach that point (This is the reason
why intercontinental calls are no fun).)

The next part is the [VoIP Network]:
This could be nothing to anything. Two such clients may communicate
directly or they may use several application layer routers (think:
proxy) to transport the signal from the source to the sink. Depending on
what is used here the delay and the jitter can vary in wide range.
Also note that if there are more than one such machines in between you
need to keep the distance and the link between them in mind as well.

So, what do we get in this Setup?
We get a much faster link. Normally well <1s. Even on intercontinental
connections we can easily reach <1s. Implementations are already close
to the limits of known laws of physic.
The price we need to pay for this is that we'll have a stream that may
not be of high quality and that it highly depends on the physical
position of all involved parties. It will of cause also depend on the
links and network quality (but there is no way around it anyway. How you
want to transmit a signal without a link between source and sink?).

Finally let's compare the two. I do it as a table to not stress you any
longer with my text:

What               | Icecast2               | VoIP
-------------------+------------------------+------------------------
Common delay       | <30s                   | <1s
-------------------+------------------------+------------------------
Bit perfection     | yes                    | no
Narrowband Audio   | yes                    | yes
Wideband Audio     | yes                    | maybe[13]
Superwideband Audio| yes                    | likely no[13]
-------------------+------------------------+------------------------
Handling of bad    | medium to good         | bad to good
 links             |                        |  depending on
                   |                        |  implementation.
Suitable for large | yes (>30000 per server)| no[15]
 number of clients |  [14]                  |
 per stream        |                        |
-------------------+------------------------+------------------------

Thank you for reading this long mail. I hope you got a pawful of useful
input. As I'm left out a lot and as I'm sure my texts is confusing in
some parts: Fell free to write and tell me your questions! (This is also
true for the readers of this email in the archive in many years. :)

And always keep in mind, the answer is 42!

Have a nice day.

Notes:
[1]   http://www.opus-codec.org/comparison/
[6]   A small island, part of the British Overseas Territory.
[7]   Most will have noticed that already: When there is a big sport
      event on TV some people in your street/block will shout 'Goal'
      earlier than others depending on if they are on analog/digital
     Terrestrial/Cable/Satellite.
[8]   E.g. the oscillator of the sound card or the RTC chip[9A] of the
      machine running libshout.
[10]  This is also why Icecast2 can detect clients not running fast
      enough.
[12]  speed of light/3 is about 99_930_819m/s.
[13]  This is slowly getting better. E.g. have a look at Opus in [1].
[14]  http://icecast.org/loadtest/
[15]  There are movements to implement it. Like WebRTC. 

Note for advanced readers:
[0A]  In case of memory mapped access consider traps 'indirect'
      syscalls.
[2A]  The terms used differ on used codec and container. I just tried to
      go with generic words.
[3A]  It also supports other protocols.
[4A]  normally...
[5A]  The main concern here is the OS of the server as well as the
      TCP connections.
[9A]  Or any other clock source provided by the OS.
[11A] I may write another one like this about clock desynchronization
      if there is interest.

-- 
Philipp.
 (Rah of PH2)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part
URL: <http://lists.xiph.org/pipermail/icecast/attachments/20150924/6b658369/attachment.sig>