Mitakuye Oyasin,
this seems a fine day to write about how I propose to make use of Diffie-Hellman (DH) key exchange within SIP to enable secure media exchange over the public Internet, and how this can change the way we think of secure VoIP and video communications and secure media streaming not just for individuals seeking privacy but also commercial organizations.
I chose to start with underlying idea in Diffie-Hellman key exchange, which is that two independent nodes can come to independently compute the same final value for a session key that can then be used for a symmetric cipher. This is done by computing key factors as exponents of a large prime which results in a pair of unique values that can produce the same result. The public number of the computed pair is shared and the private number is kept secret. A new secret/public pair is computed to generate the final cipher key, and a factor returned so the originator can do the same. The core idea is that the values publicly exchanged is insufficient to successfully compute the final key value without the unshared numbers. Hence, observing the exchange does not allow you to know or compute the final cipher key used.
Of course even before key exchange there must also be cipher negotiation, which really is similar to SDP media codec negotiation that SIP normally already does. I mention it now because while it is not an interesting or difficult problem, I note that a number of national governments and organizations have developed their own symmetric ciphers. Basically I choose to advertise the callers cipher choices in the initial SIP invite message (also indicating a secure call is requested), and the called party selects and replies a chosen cipher in the SIP answer message.
The second problem is with persistent private keys that can be raided from participating nodes to then recompute past session keys. My answer for this is simple; I always generate unique ephemeral key factors for each call session. The prime used in a DH key exchange is actually a less interesting number, and as long as the ephemeral keys come out of unbiased random data they are secure from dictionary attacks. Similarly the public and private key factors used to compute the final symmetric key is also an ephemeral value, and hence the final media cipher key is unique for each session as well, thereby assuring no forward knowledge.
The third problem is simply assuring a valid and unbiased random number generation source is used to create ephemeral keys. This actually is outside the scope of my own work, as I simply reuse existing high entropy random number sources, whether offered by the Linux kernel, the openssl library, or by other means.
I further note traditional public/private key pairs are simply a static prime and an only once computed permanently retained private and public factors, and with large enough numbers this is considered secure for many years by itself. While the prime can be entirely static, I think recomputing a new random prime every few weeks is far more than sufficient, especially given ephemeral key factors are used for the initial exchange making every session unique.
While the prime is consistent, and hence need not be exchanged in the final reply, what is important are the two ephemeral public factors, one computed by the called party (returned as part of the sip "answer" message), and the other by the inviting party (returned in the SIP ack). What happens then is this:
In the SIP invite I will advertise for key exchange and advertise what cipher(s) we support. The invite message may already be very large, and may also include authentication headers, as well as advertised media descriptions, so we do not wish to add much more to it. This is especially important for using SIP over UDP, where a sip message can not be larger than a single datagram. Hence adding this minimal amount of info is not much larger.
The called party creates an ephemeral (random) private key factor and computes the public factor around a prime that can then be shared. The prime and key factor can be sent during the SIP answer message. Since the answer may have a smaller media descriptor and never has authentication headers, adding the public key is not an impossible burden. Indeed, the public key is likely in the same size range as a SIP authorization header might be.
In response to the SIP answer message the calling (inviting) party computes a private (ephemeral) value that is then used to compute a public factor that can be used for computing the final media cipher key. The public factor it computes is returned in the SIP ACK message, which also completes SIP media call session setup.
Once both sides have both shared public factors, they can independently compute their final cipher key to use for the media. But Diffie-Hellman is itself vulnerable to a very basic man-in-the-middle (MITM) attack. If someone in the middle answers the invite, they can then compute and send their own ephemeral key factor. It is not going to be the same as the one the final node would generate, however.
The MITM would then have to also send the invite to the final destination. When the final destination answers, it gives an ephemeral factor. The MITM then simply computes a new random factor for the final cipher key and returns the answering party the factor needed to compute a final cipher key between it and the destination.
Similarly, the MITM sends it's own answer message with the initial computed random (ephemeral) key factor. The caller then computes a factor for the final cipher key and sends this back in it's SIP ACK. The man in the middle then computes the final media cipher key used between itself and the caller, and a different final media cipher key with the destination. Media that is send by the calling party is then intercepted and decrypted using the caller's cipher key, and re-encrypted using the cipher key it generated a factor for to the final destination. Neither end knows there is someone decrypting and listening to everything that is being sent, and doing so in real-time. This is the classic problem with pure Duffie-Hellman key exchange.
One thing we do know is that if both endpoints receive and have the same key factors there could not be a man-in-the-middle. This is because while the MITM will know the prime used, if the ephemeral private factors are sufficiently unbiased it has no means of knowing these private (secret) factors that were not shared. Hence it can only generate it's own new and unique private key factors and then compute a new (and different) public value that it then gives to each party in the call.
Hence, if we can prove the factors have the same values on both ends, we can validate the exchange. There are two ways to do this.
The first, and that is perfect for purely anonymous communication, is to compute a hash or human readable string based on the two key factors. Both sides can do this, and present it to the user if in a real-time call. The users then can read this hash to each other, and if they both have the same values, they know they are the same. This is essentially what ZRTP does also. This methodology we will keep, as it is particularly useful for anonymous calls.
Of course one may not wish to read hashes to each other in each and every call. Nor have we "proven" a user identity or allowed for automated transactions. However, if these are important there is one additional step we can also take beyond what ZRTP does.
As each end receives the signed hash, it can be automatically compared with the locally generated hash to validate that the keys were exchanged clean. The hashes are as trustworthy as the signing key used. This requires the destination to have the public key of the user's SIP from identity in their key chain of course. But this mode is seamless and very appropriate to a public or private organization deploying distributed self-managed secure networks, or for enabling generic voice commerce and secure b2b communications over the public Internet.
Entirely automatic exchange is also possible because one could create a dummy or special account filled with gpg keys for this purpose that an application can use unattended. For example, a /root gpg keychain could be filled with a special "mail@xxxx" key pair, and this could even be published through existing key servers. Then, if this were used to automatically transfer SMTP messages encapsulated in a cryptographic context initiated through SIP, the entire means of signing and verification also becomes automatic. This is somewhat analogous to SSL, except that we have eliminated the role of the certificate authority, and hence all keys are entirely self managed without requiring third parties or trusting some third party certificate.
Now there is one other trick that GNU SIP Witch in particular can do which will make this methodology and both anonymous and verified secure calling over VoIP much more widely available. GNU SIP Witch can choose to act as a media proxy. It can do this by rewriting the SDP to goto SIP Witch managed RTP ports. Doing so, these ports could take in unencrypted RTP streams and encrypt using media cipher keys that GNU SIP Witch computes. This would be done if the calling party or the called party do not already independently support secure calling on their own. This means any existing SIP VoIP application, including SIP phone devices, can suddenly be used to make entire secure calls without any modification. Moreover, SIP Witch can selectively use secure calling depending on if the endpoints are on the same subnet or not, or placed at each workstation as a local proxy and assure all call traffic, including internal traffic is always secure, especially if there is concern with internal espionage. This maximizes the range of secure deployment scenarios and all without requiring the introduction of new secure VoIP user agents.
There are other reasons related to anonymous calling uses where also creating stand-alone secure SIP user agent application is still of course also desirable. This is especially true when creating a secure client that for example can be ran at an Internet cafe to offer anonymous secure communications. The core methodology is entirely applicable to a custom written client application as well as to an SIP intermediary service like GNU SIP Witch. It is also applicable to using SIP for managing all kinds of “media” sessions, including, as noted, things like SMTP email exchange that traditionally used SSL.
GNU SIP Witch was originally brought forward as a means to replace Skype and preserve individual communication privacy by enabling the use of SIP URI's for calling ZRTP users. By solving core issues related to enabling unrelated parties to call each other securely with provable identities over the public Internet using SIP uri's alone, this makes many commercial uses practical as well. This includes business-to-business calling and receiving calls from commercial customers over the public Internet, and use by the medical profession or lawyers where privacy and security are mandated requirements. In that we preserve privacy and also continue the optional use of social key verification, we also address those needs where privacy and anonymity are also essential.
Although I had discussed this with somewhat less clarity previously, I also failed over the past 3 years to find an actually visionary organization outside of a national government who understood the potential of this work. By finally having time to focus exclusively these last two weeks on drafting a detailed NSF grant application it has become possible for me to improve clarity. As to finding a visionary organization to fund this work and truly bring this idea forward into widespread general use, I think that too may be about to change.
this seems a fine day to write about how I propose to make use of Diffie-Hellman (DH) key exchange within SIP to enable secure media exchange over the public Internet, and how this can change the way we think of secure VoIP and video communications and secure media streaming not just for individuals seeking privacy but also commercial organizations.
I chose to start with underlying idea in Diffie-Hellman key exchange, which is that two independent nodes can come to independently compute the same final value for a session key that can then be used for a symmetric cipher. This is done by computing key factors as exponents of a large prime which results in a pair of unique values that can produce the same result. The public number of the computed pair is shared and the private number is kept secret. A new secret/public pair is computed to generate the final cipher key, and a factor returned so the originator can do the same. The core idea is that the values publicly exchanged is insufficient to successfully compute the final key value without the unshared numbers. Hence, observing the exchange does not allow you to know or compute the final cipher key used.
Of course even before key exchange there must also be cipher negotiation, which really is similar to SDP media codec negotiation that SIP normally already does. I mention it now because while it is not an interesting or difficult problem, I note that a number of national governments and organizations have developed their own symmetric ciphers. Basically I choose to advertise the callers cipher choices in the initial SIP invite message (also indicating a secure call is requested), and the called party selects and replies a chosen cipher in the SIP answer message.
The second problem is with persistent private keys that can be raided from participating nodes to then recompute past session keys. My answer for this is simple; I always generate unique ephemeral key factors for each call session. The prime used in a DH key exchange is actually a less interesting number, and as long as the ephemeral keys come out of unbiased random data they are secure from dictionary attacks. Similarly the public and private key factors used to compute the final symmetric key is also an ephemeral value, and hence the final media cipher key is unique for each session as well, thereby assuring no forward knowledge.
The third problem is simply assuring a valid and unbiased random number generation source is used to create ephemeral keys. This actually is outside the scope of my own work, as I simply reuse existing high entropy random number sources, whether offered by the Linux kernel, the openssl library, or by other means.
I further note traditional public/private key pairs are simply a static prime and an only once computed permanently retained private and public factors, and with large enough numbers this is considered secure for many years by itself. While the prime can be entirely static, I think recomputing a new random prime every few weeks is far more than sufficient, especially given ephemeral key factors are used for the initial exchange making every session unique.
While the prime is consistent, and hence need not be exchanged in the final reply, what is important are the two ephemeral public factors, one computed by the called party (returned as part of the sip "answer" message), and the other by the inviting party (returned in the SIP ack). What happens then is this:
In the SIP invite I will advertise for key exchange and advertise what cipher(s) we support. The invite message may already be very large, and may also include authentication headers, as well as advertised media descriptions, so we do not wish to add much more to it. This is especially important for using SIP over UDP, where a sip message can not be larger than a single datagram. Hence adding this minimal amount of info is not much larger.
The called party creates an ephemeral (random) private key factor and computes the public factor around a prime that can then be shared. The prime and key factor can be sent during the SIP answer message. Since the answer may have a smaller media descriptor and never has authentication headers, adding the public key is not an impossible burden. Indeed, the public key is likely in the same size range as a SIP authorization header might be.
In response to the SIP answer message the calling (inviting) party computes a private (ephemeral) value that is then used to compute a public factor that can be used for computing the final media cipher key. The public factor it computes is returned in the SIP ACK message, which also completes SIP media call session setup.
Once both sides have both shared public factors, they can independently compute their final cipher key to use for the media. But Diffie-Hellman is itself vulnerable to a very basic man-in-the-middle (MITM) attack. If someone in the middle answers the invite, they can then compute and send their own ephemeral key factor. It is not going to be the same as the one the final node would generate, however.
The MITM would then have to also send the invite to the final destination. When the final destination answers, it gives an ephemeral factor. The MITM then simply computes a new random factor for the final cipher key and returns the answering party the factor needed to compute a final cipher key between it and the destination.
Similarly, the MITM sends it's own answer message with the initial computed random (ephemeral) key factor. The caller then computes a factor for the final cipher key and sends this back in it's SIP ACK. The man in the middle then computes the final media cipher key used between itself and the caller, and a different final media cipher key with the destination. Media that is send by the calling party is then intercepted and decrypted using the caller's cipher key, and re-encrypted using the cipher key it generated a factor for to the final destination. Neither end knows there is someone decrypting and listening to everything that is being sent, and doing so in real-time. This is the classic problem with pure Duffie-Hellman key exchange.
One thing we do know is that if both endpoints receive and have the same key factors there could not be a man-in-the-middle. This is because while the MITM will know the prime used, if the ephemeral private factors are sufficiently unbiased it has no means of knowing these private (secret) factors that were not shared. Hence it can only generate it's own new and unique private key factors and then compute a new (and different) public value that it then gives to each party in the call.
Hence, if we can prove the factors have the same values on both ends, we can validate the exchange. There are two ways to do this.
The first, and that is perfect for purely anonymous communication, is to compute a hash or human readable string based on the two key factors. Both sides can do this, and present it to the user if in a real-time call. The users then can read this hash to each other, and if they both have the same values, they know they are the same. This is essentially what ZRTP does also. This methodology we will keep, as it is particularly useful for anonymous calls.
Of course one may not wish to read hashes to each other in each and every call. Nor have we "proven" a user identity or allowed for automated transactions. However, if these are important there is one additional step we can also take beyond what ZRTP does.
As each end receives the signed hash, it can be automatically compared with the locally generated hash to validate that the keys were exchanged clean. The hashes are as trustworthy as the signing key used. This requires the destination to have the public key of the user's SIP from identity in their key chain of course. But this mode is seamless and very appropriate to a public or private organization deploying distributed self-managed secure networks, or for enabling generic voice commerce and secure b2b communications over the public Internet.
Entirely automatic exchange is also possible because one could create a dummy or special account filled with gpg keys for this purpose that an application can use unattended. For example, a /root gpg keychain could be filled with a special "mail@xxxx" key pair, and this could even be published through existing key servers. Then, if this were used to automatically transfer SMTP messages encapsulated in a cryptographic context initiated through SIP, the entire means of signing and verification also becomes automatic. This is somewhat analogous to SSL, except that we have eliminated the role of the certificate authority, and hence all keys are entirely self managed without requiring third parties or trusting some third party certificate.
Now there is one other trick that GNU SIP Witch in particular can do which will make this methodology and both anonymous and verified secure calling over VoIP much more widely available. GNU SIP Witch can choose to act as a media proxy. It can do this by rewriting the SDP to goto SIP Witch managed RTP ports. Doing so, these ports could take in unencrypted RTP streams and encrypt using media cipher keys that GNU SIP Witch computes. This would be done if the calling party or the called party do not already independently support secure calling on their own. This means any existing SIP VoIP application, including SIP phone devices, can suddenly be used to make entire secure calls without any modification. Moreover, SIP Witch can selectively use secure calling depending on if the endpoints are on the same subnet or not, or placed at each workstation as a local proxy and assure all call traffic, including internal traffic is always secure, especially if there is concern with internal espionage. This maximizes the range of secure deployment scenarios and all without requiring the introduction of new secure VoIP user agents.
There are other reasons related to anonymous calling uses where also creating stand-alone secure SIP user agent application is still of course also desirable. This is especially true when creating a secure client that for example can be ran at an Internet cafe to offer anonymous secure communications. The core methodology is entirely applicable to a custom written client application as well as to an SIP intermediary service like GNU SIP Witch. It is also applicable to using SIP for managing all kinds of “media” sessions, including, as noted, things like SMTP email exchange that traditionally used SSL.
GNU SIP Witch was originally brought forward as a means to replace Skype and preserve individual communication privacy by enabling the use of SIP URI's for calling ZRTP users. By solving core issues related to enabling unrelated parties to call each other securely with provable identities over the public Internet using SIP uri's alone, this makes many commercial uses practical as well. This includes business-to-business calling and receiving calls from commercial customers over the public Internet, and use by the medical profession or lawyers where privacy and security are mandated requirements. In that we preserve privacy and also continue the optional use of social key verification, we also address those needs where privacy and anonymity are also essential.
Although I had discussed this with somewhat less clarity previously, I also failed over the past 3 years to find an actually visionary organization outside of a national government who understood the potential of this work. By finally having time to focus exclusively these last two weeks on drafting a detailed NSF grant application it has become possible for me to improve clarity. As to finding a visionary organization to fund this work and truly bring this idea forward into widespread general use, I think that too may be about to change.