In this article by Anthony Minessale and Giovanni Maruzzelli, authors of Mastering FreeSWITCH, we will cover the following topics:
- What WebRTC is and how it works
- Encryption and NAT traversing (STUN, TURN, etc)
- Signaling and media
- Interconnection with PSTN and SIP networks
- FreeSWITCH as a WebRTC server, gateway, and application server
(For more resources related to this topic, see here.)
Finally something new! How refreshing it is to be learning and experimenting again, especially if you're an old hand! After at least ten years of linear evolution, here we are with a quantum leap, the black swan that truly disrupts the communication sector.
Browsers are already out there, waiting
With an installed base of hundreds of millions, and soon to be in the billions ballpark, browsers (both on PCs and on smart phones) are now complete communication terminals, audio/video endpoints that do not need any additional software, plugins, hardware, or whatever. Browsers now incorporate, per default and in a standard way, all the software needed to interact with loudspeakers, microphones, headsets, cameras, screens, etc.
Browsers are the new endpoints, the CPEs, the phones. They have an API, they're updated automatically, and are compatible with your system. You don't have to procure, configure, support, or upgrade them. They're ready for your new service; they just work, and are waiting for your business.
Web Real-Time Communication is coming
There are two completely separated flows in communication: Signaling and media. Signaling is a flow of information that defines who is calling whom, taking what paths, and which technology is used to transmit which content. Media is the actual digitized content of the communication, for example, audio, video, screen-sharing, etc.
Media and signaling often take completely unrelated paths to go from caller to callee, for example, their IP packets traverse different gateways and routers. Also, the two flows are managed by separate software (or by different parts of the same application) using different protocols.
WebRTC defines how a browser accesses its own media capture, how it sends and receives media from a peer through the network and how it renders the media stream that it receives. It represents this using the same Session Description Protocol (SDP) as SIP does.
So, WebRTC is all about media, and doesn't prescribe a signaling system. This is a design decision, embedded in the standard definition. Popular signaling systems include SIP, XMPP, and proprietary or custom protocols. Also, WebRTC is all about encryption. All WebRTC media streams are mandatorily encrypted.
Chrome, Firefox, and Opera (together they account for more than 70 percent of the browsers in use) already implement the standard; Edge is announcing the first steps in supporting WebRTC basic features, while only Safari is still holding its cards (Skype and FaceTime on WebRTC with proprietary signaling? Wink wink).
Under the hood
More or less, WebRTC works like this:
- The WebRTC Api Media object will contain the capabilities of all devices and codecs available, for example, definition, sample rate, and so on, and it will permit the user to choose their own capabilities preferences (for example, use QVGA video to minimize CPU and bandwidth)
- Webpage will interface with browser's user, getting some input for signing in the webserver's communication service (if any)
- Once signed up in the service, a call can be made and received. Signaling will give the protocol address of the peer (for example, sip:firstname.lastname@example.org)
These points are represented in the following image:
- Then, WebRTC Net API will exchange ICE candidates with the peer, until they both find the most "rational" triplets of IP address, port and transport (udp, dtls, and so on), for each stream (for example, audio, video, screen share, and so on)
- Once they get the best addresses, the signaling will establish the call.
These points are represented in the following image:
- Once signaling communication with the peer is established, media capabilities are exchanged in SDP format (exactly as in SIP), and the two peers agree on media formats (sample rates, codecs, and so on)
- Additionally or in alternative to media, peers can establish one or more data channels, through which they bidirectionally exchange raw or structured data (file transfers, augmented reality, stock tickers, and so on)
These points are represented in the following image:
This is a high level, but complete, view of how a WebRTC system works.
Encryption – security
Please note that in normal operation everything is encrypted, uses real PKI certificates from real Certification Authorities, actual DNS names, SSL, TLS, HTTPS, WSS, DTLS-SRTP. This is how it is supposed to work. In WebRTC, security is not an afterthought: It is mandatory.
To make signaling work without encryption (for example, for debugging signaling protocols) is not so easy, but it is possible. Browsers will often raise security exceptions, and will ask for permission each time they access a camera or microphone. Some hiccups will happen, but it is doable. Signaling is not part of WebRTC standard, as you know.
On the contrary, it is not possible to have the media or data streams to leave the browser in the clear, without encryption.
The use of plain RTP to transmit media is explicitly forbidden by the standard. Media is transmitted by SRTP (Secure RTP), where encryption keys are pre-exchanged via DTLS (Datagram Transport Layer Security, a version of TLS for Datagrams), basically a secure version of UDP.
Beyond peer to peer – WebRTC to communication networks and services
WebRTC is a technique for browsers to send media to each other via Internet, peer to peer, perhaps with the help of a relay server (TURN), if they can't reach each other directly. That's it.
No directories, no means to find another person, and also no way to "call" that person if we know "where" to call her.
No way to transfer calls, to react to a busy user or to a user that does not pickup, and so on.
Let's say WebRTC is a half-built phone: It has the handset, complete with working microphone and speaker, from which it comes out, the wiring left loose. You can cross join that wiring with the wiring of another half-built phone, and they can talk to each other.
Then, if you want to talk to another device, you must find it and then join the wires anew.
No dial pad, no Telecom Central Office, no interconnection between Local Carriers, and with International Carriers. No PBX. No way to call your grandma, and no possibilities to navigate the IVR at Federal Express' Customer Care.
We need to integrate the media capabilities and the ubiquity of WebRTC with the world of telecommunication services that constitute the planet's nervous system.
Enter the "WebRTC Gateway" and the "WebRTC Application Server"; in our case both are embodied by FreeSWITCH
WebRTC gateways and application servers
At the network plane, WebRTC uses ICE protocol to traverse NAT via STUN and TURN servers. ICE has been developed as Internet standard to be the ultimate tool to solve all NAT problems, but has not yet been implemented in either telco infrastructure, nor in most VoIP clients. Also, ICE candidates (the various different addresses the browser thinks they would be reachable at) need to be passed in SDP and negotiated between peers, in the same way codecs are negotiated. Being able to pass through corporate firewalls (UDP blocked, TCP open only on ports 80 and 443, and perhaps through protocol-aware proxies) is an absolute necessity for serious WebRTC deployment.
At media plane, WebRTC specific codecs (V8 for video and Opus for audio) are incompatible with the telco world, with audio G711 as the only common denominator.
Worst yet, all media are encrypted as SRTP with DTLS key exchange, and that's unheard of in today's telco infrastructure.
So, we need to create the signaling plane, and then convert the network transport, convert the codecs, manage the ICE candidates selection in SDP, and allow access to the wealth of ready-made services (PSTN calls, IVRs, PBXs, conference rooms, etc), and then complement the legacy services with special features and new interconnected services enabled by the unique capabilities of WebRTC endpoints.
Yeah, that's a job for FreeSWITCH.
Which architecture? Legacy on the Web, or Web on the Telco?
Real-time communication via the Web: From the building blocks we just saw, we can implement it in many ways.
We have one degree of freedom: Signaling. I mean, media will be anyway agreed about via SDP, transmitted via websockets as SRTP packets, and encrypted via DTLS key exchange.
We still have the task to choose how we will find the peer to exchange media with. So, this is an exercise in directory, location, registration, routing, presence, status, etc. You get the idea.
Actually it boils down to different possibilities:
- XMPP (eg: jabber)
- In-house signaling implementation
- VERTO (open source)
SIP and XMPP make today's world spin around. SIP is mostly known for carrying the majority of telephone and VoIP signaling traffic. The biggest implementations of instant messaging and chatting are based on XMPP. And there is more: Those two signaling protocols are often used together, although each one of them has extensions that provide the other one's functionality.
Both SIP and XMPP have been designed to be expandable and modular, and SIP particularly is an abstract protocol, for the management of "sessions" (where a "session" can be whatever has a beginning and an end in time, as a voice or video call, a screen share, a whiteboard, a collaboration platform, a payment, a message, and so on).
If your company has considerable investments and/or expertise in those protocols, then it makes sense to expand their usage on the web too.
FreeSWITCH accommodates them ALL
FreeSWITCH implements all of WebRTC low-level protocols, codecs and requirements. It's got encryption, SRTP, DTLS, RTP, websocket and secure websocket transports (ws:// and wss://). Having got it all, it is able to serve SIP endpoints over WebRTC via mod_sofia (they'll be just other SIP phones, exactly like the rest of soft and hard SIP phones), and it interacts with XMPP via mod_jingle.
Crucially, FreeSWITCH has been designed since its inception to be able to manage and message high-definition media, both audio and video. Support for OPUS audio codec (8 up to 48 khz, enough for actual audio-cd quality) started years ago as a pioneering feature, and has evolved over the years to be so robust and self-healing as to sustain a loss of more than 40% (yep, as in FORTY PERCENT) packets and maintain understandability. WebRTC's V8 video codec is routinely carrying our mixed video conferences in FullHD (as in 1920x1080 pixel), and we're looking forward to investing in fiber and in some facial cream to look good in 4K.
That's why FreeSWITCH can be the pivot of your next big WebRTC project: its architecture was designed from the start to be a multimedia powerhouse.
For the remainder of this article we'll write about VERTO, a FreeSWITCH proposal especially dedicated to Web development.
What is Verto (module and jslib)?
Verto is a FreeSWITCH module (mod_verto) that allows for JSON interaction with FreeSWITCH, via secure websockets (wss). All the power and complexity of FreeSWITCH can be harnessed via Verto: Session management, call control, text messaging, and user data exchange and synchronization. Take a note for yourself: "User data exchange and synchronization". We'll be back to this later.
Verto is like Event Socket Layer (ESL) on steroids: Anything you can do in ESL (subscribe, send and receive messages in FS core message pumps/queues) you can do in Verto, but Verto is actually much more and can do much more. Verto is also made for high-level control of WebRTC!
Also, Verto allows for the simplest way to extend your existing SIP services to WebRTC browsers.
The added benefit of "user data exchange and synchronization" (see, I'm back to it) is not to be taken lightly: You can create data structures (for example, in JSON) and have them synchronized on server and all clients, with each modification made by the client or server to be automatically, immediately and transparently reflected on all other clients.
Imagine a dynamic list of conference participants, or a chat, or a stock ticker, or a multiuser ping pong game, and so on.
Mod_verto is installed by default by standard FreeSWITCH implementation. Let's have a look at its configuration file, verto.conf.xml.
The most important parameter here, and the only one I had to modify from the stock configuration file, is ext-rtp-ip. If your server is behind a NAT (that is, it sits on a private network and exchanges packets with the public internet via some sort of port forwarding by a router or firewall), you must set this parameter to the public IP address the clients are reaching for.
Other very important parameters are the codec strings. Those two parameters determine the absolute string that will be used in SDP media negotiation. The list in the string will represent all the media formats to be proposed and accepted. WebRTC has mandatory (so, assured) support for vp8 video codec, while mandatory audio codecs are opus and pcmu/pcma (eg, g711). Pcmu and pcma are much less CPU hungry than opus. So, if you are willing to set for less quality (g711 is "old PSTN" audio quality), you can use "pcmu,pcma,vp8" as your strings, and have both clients and server use far less CPU power for audio processing.
This can make a real difference and very much sense in certain setups, for example, if you must cope with low-power devices. Also, if you route/bridge calls to/from PSTN, they will have no use for opus high definition audio; much better to directly offer the original g711 stream than decode/recode it in opus.
Test with Communicator
If it's not already done, copy Verto Communicator distribution directory (/usr/src/freeswitch.git/html5/verto/verto_communicator/dist/) into a directory served by your web server in SSL (be sure you got all the SSL certificates right).
In this article we delved in WerbRTC design, what infrastructure it requires, in what is similar and in what is different from known VoIP.
We understood that WebRTC is only about media, and leave the signaling to the implementor.
Also, we get the specific of WebRTC, its way to traverse NAT, its omnipresent encryption, its peer to peer nature.
We witnessed going beyond peer to peer, connecting with the telecommunication world of services needs gateways that do transport, protocol and media translations.
FreeSWITCH is the perfect fit as WebRTC server, WebRTC gateway, and also as application server.
And then we saw how to implement Verto, a signaling born on WebRTC, a JSON web protocol designed to exploit the additional features of WerbRTC and of FreeSWITCH, like real time data structure synchronization, session rehydration, event systems, and so on.
Resources for Article:
- Configuring FreeSWITCH for WebRTC [article]
- Architecture of FreeSWITCH [article]
- FreeSWITCH 1.0.6: SIP and the User Directory [article]