Secure voice communication on Android

While the topic of secure voice communication on mobile is hardly new, it has been getting a lot of media attention following the the official release of the Blackphone, Consequently, this is a good time to go back to basics and look into how secure voice communication is typically implemented. While this post focuses on Android, most of the discussion applies to other platforms too, with only the mobile clients presented being Android specific.

Voice over IP

Modern mobile networks already encrypt phone calls, so voice communication is secure by default, right? As it turns out, the original GSM encryption protocol (A5/1) is quite weak and can be attacked with readily available hardware and software. The somewhat more modern alternative (A5/3) is also not without flaws, and in addition its adoption has been fairly slow, especially in some parts of the world. Finally, mobile networks depend on a shared key, which while protected by hardware (UICC/SIM card) on mobile phones, can be obtained from MNOs (via legal or other means) and used to enable call interception and decryption.

So what's the alternative? Short of building your own cellular network, the alternative is to use the data connectivity of the device to transmit and receive voice. This strategy is known as Voice over IP (VoIP) and has been around for a while, but the data speeds offered by mobile networks have only recently reached levels that make it practical on mobiles.

Session Initiation Protocol

Different technologies and standards that enable VoIP are available, but by far the most widely adopted one relies on the Session Initiation Protocol (SIP). As the name implies, SIP is a signalling protocol, whose purpose is establish a media session between endpoints. A session is established by discovering the remote endpoint(s), negotiating a media path and codec, and establishing one or more media streams between the endpoints. Media negotiation is achieved with the help of the Session Description Protocol (SDP), and typically transmitted using the Real-time Transport Protocol (RTP). While a SIP client, or more correctly a user agent (UA), can connect directly to a peer, peer discovery usually makes use of one or more well-known registrars. A registrar is a SIP endpoint (server) which accepts REGISTER requests from a set of clients in the domain(s) it is responsible for, and offers a location services to interested parties, much like DNS. Registration is dynamic and temporary: each client registers its SIP URI and IP address with the registrar, thus making it possible for other peers to discover it for the duration of the registration period. The SIP URI can contain arbitrary alphanumeric characters (much like an email address), but the username part is typically limited to numbers for backward compatibility with existing networks and devices (e.g., sip:0123456789@mydomain.org).

A SIP call is initiated by a UA sending an INVITE message specifying the target peer, which might be mediated by multiple SIP 'servers' (registrars and/or proxies). Once a media path has been negotiated, the two endpoints (Phone A and Phone B in the figure below) might communicate directly (as shown in the figure) or via a one or more media proxies which help bridge SIP clients that don't have a publicly routable IP address (such as those behind NAT), implement conferencing, etc.

SIP on mobiles

Because SIP calls are ultimately routed using the registered IP address of the target peer, arguably SIP is not very well suited for mobile clients. In order to receive calls, clients need to remain online even when not actively used and keep a constant IP address for fairly long periods of time. Additionally, because public IP addresses are rarely assigned to mobile clients, establishing a direct media channel between two mobile peers can be challenging. The online presence problem is typically solved by using a complementary, low-overhead signalling mechanism such as Google Cloud Messaging (GCM) for Android in order to "wake up" the phone before it can receive a call. The requirement for a stable IP address is typically handled by shorter registration times and triggering registration each time the connectivity of the device changes (e.g., from going from LTE to WiFi). The lack of a public IP address is usually overcome by using various supporting methods, ranging from querying STUN servers to discover the external public IP address of a peer, to media proxy servers which bridge connections between heavily NAT-ed clients. By combining these and other techniques, a well-implemented SIP client can offer an alternative voice communication channel on a mobile phone, while integrating with the OS and keeping resource usage fairly low.

Most Android devices have included a built-in SIP client as part of the framework since version 2.3 in the android.net.sip package. However, the interface offered by this package is very high level, offers few options and does not really support extension or customization. Additionally, it hasn't received any new features since the initial release, and, most importantly, is optional and therefore unavailable on some devices. For this reason, most popular SIP clients for Android are implemented using third party libraries such as PJSIP, which support advanced SIP features and offer a more flexible interface.

Securing SIP

As mentioned above, SIP is a signalling protocol. As such, it does not carry any voice data, only information related to setting up media channels. A SIP session includes information about each of the peers and any intermediate servers, including IP addresses, supported codecs, user agent strings, etc. Therefore, even if the media channel is encrypted, and the contents of a voice call cannot be easily recovered, the information contained in the accompanying SIP messages -- who called whom, where the call originated from and when, can be equally important or damaging. Additionally, as we'll show in the next section, SIP can be used to negotiate keys for media channel encryption, in which case intercepting SIP messages can lead to recovering plaintext voice data.

SIP is a transport-independent text-based protocol, similar to HTTP, which is typically transmitted over UDP. When transmitted over an unencrypted channel, it can easily be intercepted using standard packet capture software or dumped to a log file at any of the intermediate nodes a SIP message traverses before reaching its destination. Multiple tools that can automatically correlate SIP messages with the associated media streams are readily available. This lack of inherent security features requires that SIP be secured by protecting the underlying transport channel.

VPN

A straightforward method to secure SIP is to use a VPN to connect peers. Because most VPNs support encryption, signalling, as well as media streams tunneled through the VPN are automatically protected. As an added benefit, using a VPN can solve the NAT problem by offering directly routable private addresses to peers. Using a VPN works well for securing VoIP trunks between SIP servers which are linked using a persistent, low-latency and high-bandwidth connection. However, the overhead of a VPN connection on mobile devices can be too great to sustain a voice channel of even average quality. Additionally, using a VPNs can result in highly variable latency (jitter), which can deteriorate voice quality even if jitter buffers are used. That said, many Android SIP clients can be setup to automatically use a VPN if available. The underlying VPN used can be anything supported on Android, for example the built-in IPSec VPN or a third-party VPN such as OpenVPN. However, even if a VPN provides tolerable voice quality, typically it only ensures an encrypted tunnel to a SIP proxy, and there are no guarantees that any SIP messages or voice streams that leave the proxy are encrypted. That said, a VPN can be a usable solution, if all calls are terminated within a trusted private network (such as a corporate network).

Secure SIP

Because SIP is transport-independent it can be transmitted over any supported protocol, including a connection-oriented one such as TCP. When using TCP, a secure channel between SIP peers can be established with the help of the standard TLS protocol. Peer authentication is handled in the usual manner -- using PKI certificates, which allow for mutual authentication. However, because a SIP message typically traverses multiple servers until it reaches its final destination, there is no guarantee that the message will be always encrypted. In other words, SIP-over-TLS, or secure SIP, does not provide end-to-end security but only hop-to-hop security.

SIP-over-TLS is relatively well supported by all major SIP servers, including open source once like Asterisk and FreeSWITCH. For example, enabling SIP-over-TLS in Asterisk requires generating a key and certificate, configuring a few global tls options, and finally requiring peers to use TLS when connecting to the server as described here. However, Asterisk does not currently support client authentication for SIP clients (although there is some limited support for client authentication on trunk lines).

Most popular Android clients support using the TLS transport for SIP, with some limitations. For example the popular open source CSipSimple client supports TLS, but only version 1.0 (as well as SSL v2/v3). Additionally, it does not use Android's built-in certificate and key stores, but requires certificates to be saved on external storage in PEM format. Both limitations are due to the underlying PJSIP library, which is built using OpenSSL and requires keys and certificates to be stored as files in OpenSSL's native format. Additionally, server identity is not checked by default and the check needs to be explicitly enabled in order for server identity to be verified, as shown in the screenshot below.

Another popular VoIP client, Zoiper, doesn't use a pre-initialized trust store at all, but requires peer certificates to be manually confirmed and cached for each SIP server. The commercial Bria Android client (by CounterPath) does use the system trust store and automatically verifies peer identity.

When a secure SIP connection to a peer is established, VoIP clients indicate this on the call setup and call screens as shown in the CSipSimple screenshot below.

SIP Alternatives

While SIP is a widely adopted standard, it is also quite complex and supports many extensions that are not particularly useful in a mobile environment. Instead of SIP the RedPhone secure VoIP client uses a simple custom signalling protocol based on a RESTful HTTP (with some additional verbs) API. The protocol is secured using TLS with server certificates issued by a private CA, which RedPhone clients implicitly trust.

Securing the media channel

As mentioned in our brief SIP introduction, the media channel between peers is usually implemented using the RTP protocol. Because the media channel is completely separated from SIP, even if all signalling is carried out over TLS, media streams are unprotected by default. RTP streams can be secured using the Secure RTP (SRTP) profile of the RTP protocol. SRTP is designed to provide confidentiality, message authentication, and replay protection to the underlying RTP streams, as well as to the supporting RTCP protocol. SRTP uses a symmetric cipher, typically AES in counter mode, to provide confidentiality and a message authentication code (MAC), typically HMAC-SHA1, to provide packet integrity. Replay protection is implemented by maintaining a replay list which received packets are checked against to detect possible replay.

When a voice channel is encrypted using SRTP the transmitted data looks like random noise (as any encrypted data should), as shown below.

SRTP defines a pseudo-random function (PRF) which is used to derive the session keys (used for encryption and authentication) from a master key and master salt. What SRTP does not specify is how the master key and salt should be obtained or exchanged between peers.

SDES

SDP Security Descriptions for Media Streams (SDES) is an extension to the SDP protocol which adds a media attribute that can be used to negotiate a key and other cryptographic parameters for SRTP. The attribute is simply called crypto and can contain a crypto suite, key parameters, and, optionally, session parameters. A crypto attribute which includes a crypto suite and key parameters might look like this:

a=crypto:1 AES_CM_128_HMAC_SHA1_80 inline:VozD8O2kcDFeclWMjBOwvVxN0Bbobh3I6/oxWYye

Here AES_CM_128_HMAC_SHA1_80 is a crypto suite which uses AES in counter mode with an 128-bit key for encryption and produces an 80-bit SRTP authentication tag using HMAC-SHA1. The Base64-encoded value that follows the crypto suite string contains the master key (128 bits) concatenated with the master salt (112 bits) which are used to derive SRTP session keys.

SDES does not provide any protection or authentication of the cryptographic parameters it includes, and is therefore only secure when used in combination with SIP-over-TLS (or another secure signalling transport). SDES is widely supported by both SIP servers, hardware SIP phones and software clients. For example, in Asterisk enabling SDES and SRTP is as simple as adding encryption=yes to the peer definition. Most Android SIP clients support SDES and can automatically enable SRTP for the media channel when the INVITE SIP message includes the crypto attribute. For example, in the CSipSimple screenshot above the master key for SRTP was received via SDES.

The main advantage of SDES is its simplicity. However it requires that all intermediate servers are trusted, because they have access to the SDP data that includes the master key. Even though the SRTP media stream might be transmitted directly between two peers, SRTP effectively provides only hop-to-hop security, because compromising any of the intermediate SIP servers can result in recovering the master key and eventually session keys. For example, if the private key of a SIP server involved in SDES key exchange is compromised, and the TLS session that carried SIP messages session did not use forward secrecy, the master key can easily be extracted from a packet capture using Wireshark, as shown below.

ZRTP

ZRTP aims to provide end-to-end security for SRTP media streams by using the media channel to negotiate encryption keys directly between peers. It is essentially a key agreement protocol based on a Diffie-Hellman with added Man-in-the-Middle (MiTM) protections. MiTM protection relies on the so called "short authentication strings" (SAS), which are derived from the session key and are displayed to each calling party. The parties need to confirm that they see the same SAS by reading it to each other over the phone. As an additional MiTM protection, ZRTP uses a form of key continuity, which mixes in previously negotiated key material into the shared secret obtained using Diffie-Hellman when deriving session keys. Thus ZRTP does not require a secure signalling channel or a PKI in order to establish a SRTP session key or protect against MiTM attacks.

On Android, ZRTP is supported both by VoIP clients for dedicated services such as RedPhone and Silent Phone, and by general-purpose SIP clients like CSipSimple. On the server side, ZRTP is supported by both FreeSWITCH and Kamailio (but not by Asterisk), so it its fairly easy to set up a test server and test ZRTP support on Android.

ZRTP support in CSipSimple can be configured on a per account basis by setting the ZRTP mode option to "Create ZRTP". It must be noted however, that ZRTP encryption is opportunistic and will fall back to cleartext communication if the remote peer does not support ZRTP. When the remote party does support ZRTP, CSipSimple shows an SAS confirmation dialog only the first time you connect to a particular peer and then displays the SAS and encryption scheme in the call dialog as shown below.

In this case, the voice channel is direct and ZRTP/SRTP provide end-to-end security. However, the SIP proxy server can also establish a separate ZRTP/SRTP channel with each party and proxy the media streams. In this case, the intermediate server has access to unencrypted media streams and the provided security is only hop-to-hop, as when using SDES. For example, when FreeSWITCH establishes a separate media channel with two parties that use ZRTP, CSipSimple will display the following dialog, and the SAS values at both clients won't match because each client uses a separate session key. Unfortunately, this is not immediately apparent to end users which may not be familiar with the meaning of the "EndAtMitM" string that signifies this.

The ZRTP protocol supports a "trusted MiTM" mode which allows clients to verify the intermediate server after completing a key enrollment procedure which establishes a shared key between the client and a particular server. This features is supported by FreeSWITCH, but not by common Android clients, including CSipSimple.

Summary

Android supports the SIP protocol natively, but the provided APIs are restrictive and do not support advanced VoIP features such as media channel encryption. Most major SIP client apps support voice encryption using SRTP and either SDES or ZRTP for key negotiation. Popular open source SIP severs such as Asterisk and FreeSWITCH also support SRTP, SDES, and ZRTP and make it fairly easy to build a small scale secure VoIP network that can be used by Android clients. Hopefully, the Android framework will be extended to include the features required to implement secure voice communication without using third party libraries, and integrate any such features with other security services provided by the platform.

Search This Blog

Android Explorations