Let's be honest - most telephony features are boring. The top business voice features - transfer, conference, and hold/resume - have been around since nearly the dawn of time, and they haven't changed much since then. Yet, despite their long-time existence, they remain hard to use. The conference button on most phones has reached a state of infamy. How often have we all heard colleagues tell us over the phone, "OK, hang on, I'm going to try and conference Joe in now, if this doesn't work, I'll call you back". Why is it that this feature - and others like it - are hard to use?
The answer is simple. When telephones were first invented, they had very limited user interfaces. Users provided input by speaking or pressing a fixed set of buttons on the phone, and output from the system was limited to just voice. Features, like conference and transfer, were designed to use this kind of interface, and people got used to it. People became trained to know that, when you are on a call, and you press "conference", you get a dial tone, and that dial tone means that you should now dial the number of the second party. This implicit training led to the establishment of norms, and because users needed to know these norms in order to use the feature, people were reluctant to change them.
Humans are visual creatures though, and if you look at the modern wonders of user interface design - the iPhone, the ZuneHD, even Google's home page - they are wonderful because they are visual perfection. You look at them, and you know exactly what to do. You don't need to understand the norm. You don't need to be trained. Its obvious what to do, and its obvious with just a glance.
Let me pick on another oft-maligned feature – call hold. Call hold typically plays music of some sort to the other party. This music is there for a reason – there is a fundamental requirement to alert the other party that they shouldn’t hang up, that despite the fact that they cannot be heard, something is happening. If there was no music, a user could not differentiate a dead connection (in which case they should hang up), from hold (in which case they should wait). Given that the only interface available for conveying information to users was voice, designers of this feature decided to play music, and now we’ve all been trained that hearing music means that I am on hold. However, most folks hate music-on-hold. The music quality is awful, it’s often music I don’t care to listen to, and it can be incredibly disruptive when played into conference calls.
Now that technology has evolved, enabling rich visual interfaces on everything large to small, it becomes possible to re-imagine these features with visual interfaces. What would call hold look like if it were redesigned today with the iPhone in mind? To invoke it, I'd still have a button that says "hold". But when I press it, the other participant gets a visual cue that lets them know they are on hold. This would be a permanent indicator, right there in the main call pane, that tells them clearly - in their native language - that they are on hold. In some cases, I may need audible cues – for example, when my cell phone is in a screen saver mode. However, these can be played locally and would never be rendered into a conference call. The audible clue can become contextual - only used when it is needed. Context is another hallmark of modern user interface design.
Visual refreshes of the most common features aren't just academic; they can actually have an impact on the bottom line. If you consider how often invocation of these features fail - how often a held call is dropped by accident - how often a 3-way conference doesn't work - those wasted minutes start to actually add up. Those minutes are even more valuable when dealing with colleagues in other businesses, where time is more valuable then ever. Some simple math proves this out. If every user wastes just 30 seconds a day three times a week due to failed interactions with these features, this is costing an enterprise with 50,000 employees 1.5 million dollars a year. This is 1.5 million dollars they can save by modernizing these features and making them visual.
In many cases, the way these features are made visual is to take information which was previously being conveyed to users in speech, and instead, convey it to their devices through semantic protocol messages that can be understood by their device. Once received, their device can interpret them and provide a visual rendering of the result. In the case of call hold, instead of playing music-on-hold through the voice channel, we'd send a signaling message to the user, which their phone could use to render a visual cue in a way that is appropriate to the user interface of their device. Creating features in this way is called semantic signaling.
If it is so obvious that this is a better way to do it, why isn't it there today? The reason is that the Public Switched Telephone Network (PSTN) provides a fixed and fundamentally non-extensible set of signaling features between parties. To go beyond its limitations, it is necessary to switch to protocols like SIP, and to use SIP in ways which allow us to easily add new features in the future. With more and more enterprises deploying IP-based communications within their boundaries, the groundwork is being built for this switch.
Once that switch arrives, things will change. Our experiences will broaden, and they will become visual. We'll look back on the days when it was actually hard to set up a conference call, and we'll laugh.