By Badri Rajasekar, CTO, TokBox
No matter the industry, there is an expectation that business transactions will happen quickly. This is evident in the evolution of e-mail to Slack, blogging to tweeting—communication is becoming more synchronous. We now are pushing that immediacy even further, with the emergence of live video as a medium. Just look at the adoption of apps like Instagram Live, Doctor on Demand, and Houseparty.
Why is this happening now? There are the obvious factors, such as hardware advancements, improvements to network connectivity, and shifts in consumer behavior. But, one of the key catalysts is advances in communications technology—or, more specifically, Web Real-Time Communications (WebRTC).
Live video is where WebRTC is really accelerating innovation. Although there are plenty of great apps for straightforward video chatting available on the market today, developers who are building with WebRTC are completely reinventing the way we think about live video and its impact on app and Web site experiences.
Typically, with live video apps, call sizes fit into one of three buckets:
- Small group
- Large-scale livestream
Let's take a deep dive into one area in particular that is seeing explosive growth: livestreaming. The livestream experience is changing and becoming more interactive, allowing the audience to participate in the conversation, not just passively consume it. As an example, we have all seen journalists and commentators joining news pundits via Skype, creating a live TV experience—an experience that could be made more seamless to the end user. Imagine using a CNN mobile app to not only view, but to also join in and participate in a live TV experience. If you're a developer who wants to build an interactive livestreaming app, you have several paths you can pursue for your application architecture.
In the traditional broadcasting world, the de facto mechanism to deliver media streams at scale involves streaming CDNs and Multipoint Control Units (MCU), which mixes and routes the media streams. This approach works for a pure viewing scenario, but has multiple pros and cons associated with it.
Even though it is good for traditional, one-to-many distribution of content and can support a large number of users, it has some disadvantages. Traditional streaming protocols such as RTMP, HLS, MPEG-DASH, and so forth, have high latency (sometimes in excess of 30 seconds). Because these streaming protocols were designed for video-on-demand (VOD) streaming, which is not conducive to low latency, they have built-in buffering. MCUs traditionally decode, mix, and encode streams if there are multiple participants publishing media. Although this works for a static number of channels, this is definitely not useful or scalable where you have mass user-generated content. Also, MCUs are mostly compute-bound, making them very expensive to scale to large volumes.
Figure 1: Streaming protocols
Enter the SFU
With the advent of WebRTC and media engines hyper-optimized for low-latency, small group communication, traditional MCUs have been replaced with Selective Forwarding Units (SFU). Think of SFUs as media routers, where none of the incoming streams are mixed, but rather the packets are forwarded to the appropriate subscribers. Given that no real transcoding happens here, it is computationally efficient to scale a cluster of SFUs to service a large number of media sessions. This is useful, but it's really hard to scale a single SFU instance to tens of thousands of participants, and again, the model breaks down. Think about a bunch of people watching a Super Bowl game and also talking to each other about the game. In this scenario, you want all of the media streams to be as low latency as possible. Another disadvantage of this scheme is that, because each viewer pulls a copy of the stream, it can be quite bandwidth intensive, especially when the consumers are on mobile devices that are on constrained cellular networks.
Figure 2: SFU technology
As with most technologies, you usually want to pick and choose from each of the approaches to come up with a hybrid solution. The first step is to be able to separate your endpoints into the following distinct groups:
- On-Stage: Low-latency media path for the people on-stage. Essentially, this is the group of publishers (participants) that live in the session. Think of these end-points as the group of people on the panel.
- In-Line: Low-latency media path for the people off-stage. Essentially, this is the group of participants who are viewing the content but would like to participate. This is an intermediate stage where you would like to align the latency of the live-stream with the participant who may want to jump in and also needs to be in sync.
- Off-Stage: High-latency media path for the people off-stage. This is the group of people who are in pure view-only mode but potentially could be elevated to the next stage.
Figure 3: Creating the three-stage architecture
Now, the entire architecture of the system reduces to the problem of making sure everyone in the on-stage group and in-line group are in the lowest possible latency and that the people in the off-stage group are in a relatively high-latency path. It now is possible to make sure everyone in the on-stage and in-line groups are connected to the SFU, where they use a protocol like RTP to get on the lowest possible media path. The SFU then can forward the streams to an MCU that mixes and routes them through a traditional streaming CDN to deliver media to a very large number of participants. The application logic makes sure people in the in-line group get queued up to go on-stage and that people who leave the stage then are pushed back into the view-only mode of the system.
Can We Do Even Better?
Not to harken back to the days of database architecture, but it is possible with some ingenious systems engineering to get even better. It is possible to chain a bunch of SFUs into some sort of a loosely coupled, B-tree like structure where publishers can go to the root and the subscribers can subscribe streams from the leaf nodes. This is a n-ary tree; it is possible to deliver a massive number of streams to participants in near real-time. There are a few issues, though, to solve in these cases. For example, if we're using RTP for media and RTCP for feedback, it is quite possible that the amount of RTCP feedback quickly overwhelms the pipe because you have more participants in the call. Even if some sort of intelligent RTCP termination were to be implemented, you would need techniques like simulcast/SVC to enable a high-quality experience.
It's worth noting that, earlier this year, the "last piece of the puzzle" fell into place when Apple announced WebRTC support for Safari. Developers now can leverage WebRTC on any browser; this has opened up nearly endless possibilities to re-invent the way we think about live video and its impact on the user experience. The alignment of hardware advancements, improvements to network connectivity, shifts in consumer behavior, as well as enabling technologies like WebRTC, signal that the market is ready. But, to capitalize on this momentum, it's critical that developers leverage one of the thoughtful approaches to architecture that we've outlined above so that interactive livestreaming remains a seamless video experience for the end user. We're only just starting to scratch the surface of what is possible.