It seems that ‘bots’ are ‘hot’ in the sense that the topic’s really attracting a lot of attention and if I had to pick out one great article that I read in this area then I’d say that it was this one on “Conversational Commerce” by Uber’s Chris Messina which really brought it home to me – I think that’s a really good read although I was late in coming to it
The rest of this post does not relate to anything that Microsoft does or makes, it’s more my own brain dump from trying to think through some of the pieces that might be part of a platform that provides an ability to have these types of conversations.
I’ve been thinking on this topic of ‘bots’ for a little while and I wanted to;
- start to try and get my thoughts in order and write them up so that I can come back to them and refine them
- have a framework that I can use to look at particular platforms for ‘bots’ and try to evaluate whether that platform covers some of the areas that I’d expect a ‘bot’ platform to address.
Beyond here, I’m going to use the term ‘Agent’ rather than ‘Bot’ to avoid getting tied up in any particular implementation or anything like that.
Once again, it’s just a brain dump and a pretty sketchy one but you have to start somewhere
We’ve been having conversations of different forms with software Agents for the longest time. You could argue that when I do something like;
then I’m having a ‘conversation’ with the command prompt.
I “say” something and it “says” something back. It’s a short conversation. It’s not a very natural conversation but, nonetheless, it’s a form of conversation.
It also doesn’t offer much in the way of choice around the input/output mechanisms here. As far as I know, I can’t speak to the command prompt and it doesn’t speak back although it may have accessibility capabilities that I’m unaware of.
At a more advanced level, I can have a conversation with an Agent on one of my devices today and I can actually say something like;
- “Agent, can you play the song [SONG TITLE] for me?”
- “Did you mean this one or that one?”
- “The second one”
- “Ok, playing it”
This one definitely feels a bit more “conversational” and an Agent that accepts speech like this usually accepts at least typing as another possible input mechanism and displaying on a screen as another possible output mechanism.
Implicit in there is the idea that the Agent that I’m speaking to knows of some kind of app or service that can actually find music and get it played for me and it’s debatable as to whether that app/service does or doesn’t display a graphical interface as maybe sometimes it should and sometimes it shouldn’t depending on the context.
What’s interesting though would be that if I then continued the conversation with something like;
- “Agent, remember that song you played for me just before lunch? Play it again”
then I don’t know whether there are any platforms out there today that can handle even a simple conversation like that and the notion that conversations might last a while and might have related history.
The context has been lost at that point and we have to start again and it feels like even the simplest elements of human conversations are going to need quite a lot of work if they’re going to be brought to a ‘conversation platform’ and, naturally, this will be done incrementally with ever growing value along the way.
With that in mind, I was trying to think of some of the pieces that might make up that kind of platform and mostly so that I can come back to them at a later point. Some of these pieces I’ve seen referred to in other articles, videos, etc. and others I’m perhaps conjuring out of the air.
I scratched my head for quite a while and this list dropped out of some of the pieces that might be involved in a conversational platform when thinking of conversations in a broad sense;
- The Conversational Host or Canvas
- The Agent
- Language Processing
- User Knowledge
That list isn’t ordered in any fashion.
I did a quick sketch below and you’ll soon realise from the number of arrows on the diagram that I haven’t reached any kind of clarity on this just yet and am still fumbling a bit and ‘making it up’ but, again, it’s a starting point that can be refined.
The Conversational Host or Canvas
This feels like a very broad term but it seems that there’s a need for “something” to host the conversation and it might be something like an app that hosts a voice call or an SMS conversation. It might be a chat app or an email client. It might be a part of an operating system “shell”.
It’s the “host” of the conversation and, naturally, I might want to move from one host to another and have a conversation follow me which almost certainly comes with a set of technical challenges.
Some conversational hosts might serve a specific purpose. For example, a device on a kitchen table that is hard-wired to play music.
Others might broker between many agents – for example a chat application that can both book train tickets and return flight information.
It seems to me that it’s likely that the Canvas will control the modes of input and output, perhaps offering some subset of those available on the device that it’s running on and it also seems to me that it’s unlikely that developers will want to build for every Canvas out there and so, over time, perhaps some canvases will be specifically targeted whereas others might somehow be treated generically.
The Agent is the piece of software that the user is having the conversation with through the Canvas in question. The Canvas and the Agent might sometimes be one and the same thing and/or might be produced by the same company but I guess that in the general case the Canvas (e.g. an IM chat window) could be separate from the Agent (e.g. a travel Agent) which might itself rely on separate Services (e.g. a weather service, a train booking service, a plane booking service) in order to get work done.
How does the user discover that a particular (complex) Canvas has an Agent available and, beyond that, how do they discover what that Agent can do?
It’s the age-old ‘Clippy’ style problem. A Canvas (e.g. a chat app) can broker conversations with N Agents but the user doesn’t know that and we see this today with personal assistants offering menus via “Tell me what I can say” type commands.
It seems to me that there’s a general need for discovery and it might involve things like;
- Reading the instructions that came with the Canvas
- Asking the Canvas before…
- Asking the Agent.
- Looking up services in a directory.
- Being prompted by the Canvas (hopefully with some level of intelligence) when an appropriate moment seems to arrive – e.g. “did you know that I can book tickets for you?”.
and no doubt more but you need to know that you can start a conversation before you start a conversation
There’s lots of ways to converse. We can do it by voice, by typing, by SMS. We might even stretch to include things like pointing with gamepads or waving our arms to dismiss dialogs but maybe that’s pushing things a bit far.
Equally, there’s many ways to display outputs and a lot of this is going to depend on the Canvas and device in question.
For example, if I have an Agent that knows how to find photos. I might input;
“Agent, go get me some high quality, large size images of people enjoying breakfast”
What should the output be? Maybe a set of hyperlinks? Maybe open an imaging app and display the photos themselves ready for copy/paste? Maybe offer to send me an email with all the details in so I can read it later?
I’d argue that it depends on the Canvas, the device and what I’m currently doing. If I’m walking down the street then the email option might be a good one. If I’m sitting at my PC then maybe open up an app and show me the results.
I suspect that this might get complex over time but I/O options seem to be a big part of trying to have a conversation.
How to start a conversation?
At what point does a conversation with an Agent begin in the sense that the Agent tracks the flow of interactions back and forth such that it can build up Context and start to offer some kind of useful function?
Most Agents support some kind of “Hey, how are you?” type interaction but that’s not really the conversation opener, it perhaps comes more at the point where someone says “I need a train” or “I need a ticket” or similar.
Conversations are stateful and could potentially span across many devices and Canvases and so there’s going to need to be some kind of conversational identifier that can be (re)presented to the agent at a later point in time. The analogy in the human world would be something like;
Remember when we were talking about that holiday in Spain last week?
and, no doubt, if we’re to make conversations work in the virtual world then there is likely to be an equivalent.
An identifier for a conversation is one thing but it’s pretty much useless without a notion of the user who was involved in the conversation.
You’d imagine that this is perhaps one of the things that a Canvas can do for a user – e.g. an IM Canvas has presumably already authenticated the user so it might be able to provide some kind of token representing that identify to an Agent such that the Agent can know the differences between conversations with Mike and conversations with Michelle.
If a conversation then moves from one Canvas to another then the Agent has to be able to understand whatever identity token might come from the second Canvas as well.
I suspect that this is a roundabout way of saying that it feels to me like identity is going to be an important piece in a platform that does conversations.
I’m obsessed with context and I guess that a conversation with an Agent is, in some ways, about building up the context to the point where some ‘transaction’ can be completed.
That context needs to be associated with the conversation and with the identity of the user and perhaps needs to have some kind of associated lifetime such that it doesn’t stay around for ever in a situation where a conversation starts but never completes.
There’s then the question of whether that content can be;
- pre-loaded with some of the knowledge that either the Agent or the Canvas has about the user.
- used to add to the knowledge that either the agent or the Canvas keeps about the user after the conversation.
For example – if a user has a conversation with an Agent about a train journey then part of the context might be the start/end locations.
If one of those locations turns out to be the user’s home then that might become part of the future knowledge that an Agent or a Canvas has about the user such that in the future it can be smarter. Naturally, that needs to remain within the user’s control in terms of the consent around where it might be used and/or shared.
I’m unsure whether knowledge about a user lives with an Agent, with a Canvas, with a Service or with all of them and I suspect it’s perhaps the latter – i.e. all of them.
No doubt, this is related to Identity, Context and Trust in the sense that if I use some Canvas on a regular basis (like a favourite chat app) and if it comes from a vendor that I trust then I might be prepared to share more personal data with that Canvas than I do perhaps with a an Agent which does (e.g.) concert-ticket bookings and which I only use once every 1-2 years.
The sort of knowledge that I’m thinking of here stems from personal information like age, date-of-birth, gender, height, weight, etc. through into locations like home, work and so on and then perhaps also spans into things like friends/family.
You can imagine scenarios like;
“Hey, can you ask the train ticketing service to get me a ticket from home to my brother’s place early in the morning a week on Saturday and drop him an SMS to tell him I’m coming?”
and a Canvas (or Agent) that can use knowledge about the user to plug in all the gaps around the terms ‘home’ and ‘brother’ in order to work out the locations and phone numbers is a useful thing
Now, whether it’s the Canvas that turns these terms into hard data or whether it’s the Agent that does it, I’m not sure.
Trust is key here. As a user, I don’t want to have a conversation with an Agent that is then keeping or sharing data that I didn’t consent to but, equally, conversations that are constantly interrupted by privacy settings aren’t likely to progress well.
In a conversation between User<->Canvas<->Agent<->Service it’s not always going to be easy to know where the trust boundaries are being drawn or stretched and perhaps it becomes the responsibility of the Canvas/Agent to let the user know what’s going on as knowledge is disseminated? For example, in a simple scenario of;
“Hey, can you buy me train tickets from home to work next Tuesday?”
there’s a question around whether the Agent needs to prompt if it’s not aware of what ‘home’ and ‘work’ might mean and doesn’t have a means to purchase the ticket.
Also, does the Canvas attempt to help in fleshing out those definitions of ‘home’ and ‘work’ and does it do with/without the user’s explicit consent?
It feels like a conversational platform needs to have the ability to define dialogs for how a conversation is supposed to flow between the user and the agent.
I suspect that it probably shouldn’t be a developer who defines what the structure and the content of these dialogs should be.
I also suspect that they shouldn’t really be hard-wired into some piece of code somewhere but should, somehow, be open to being created and revised by someone who has a deep understanding of the business domain and who can then tweak the dialogs based on usage.
That would imply a need for some sort of telemetry to be captured which lets that Agent be tweaked in response to how users are actually interacting with it.
Part of defining dialogs might tie in with inputs and outputs in that you might define different sets of dialogs depending on the input/output capabilities of the Canvas that’s hosting the conversation with the Agent. It’s common to use different techniques when using speech input/output versus (say) textual input/output and so dialogs would presumably need to cater for those types of differences.
Another part of defining dialogs might be to plug in decision logic around how dialogs flow based on input from the user and responses from services.
Language Understanding and Intent
One of the challenges of defining those dialogs is trying to cater for all the possible language variations in the ways in which a user might frame or phrase a particular question or statement. There are so many ways of achieving the same result that it’s practically impossible to define dialogs that cater for everything. For example;
- “I want to book a taxi”
- “I need a lift to catch my plane”
- “Can I get a car to the airport”
are simple variations of possibly the very same thing and so it feels to me like there’s a very definite need here for a service which can take all of these variations and turn them into more of a canonical representation which can report the intent that’s common across all three of them.
Without that, I think all developers are going to be building their own variant of that particular wheel and it’s not an easy wheel to build.
Just for completeness, if there is an “initiation” step to a conversation with an Agent then I guess there should be a point at which a conversation is “over” whether that be due to a time-out or whether it be that the user explicitly states that they are done.
I can see a future scenario analogous to clearing out your cookies in the browser today where you want to make sure that whatever you’ve been conversing about with some Agent via some Canvas has truly gone away.
An Agent is a representative for some collection of Services. These might be a bunch of RESTful services or similar and it’s easy to think of some kind of travel Agent that provides a more natural and interactive front end to a set of existing back-end services for booking hotels, flights, trains, ferries, etc. and looking up timetables.
A platform for conversations would probably want to make it as easy as possible to call services, bring back their data and surface it into decision-logic.
Sticking with decisions – there’s likely to be a need for making decisions in all but the simplest of conversations and those decisions might well steer the conversational flow in terms of the ‘dialogs’ that are presented to the user.
Those decisions might be based on the user’s input, the responses that come back from services invoked by the Agent or might be based on User Knowledge or some ambient context like the current date and time, weather, traffic or similar.
Some of that decision making might be influenced by use of the Agent itself – e.g. if the Agent uses telemetry to figure out that 95% of all users go with the default options across step 1 to step 5 of a conversation then maybe it can start to adapt and offer the user shortcuts based on that knowledge?
I’d expect an Agent to be gathering telemetry as it was progressing such that aggregate data was available across areas like;
- Agent usage – i.e. where traffic is coming from, how long conversations last, etc.
- Dialog flow – which paths through the Agent’s capabilities are ‘hot’ in the sense of followed by most users.
- Dialog blockers – points where conversations are consistently abandoned.
- Choices – the sorts of options that users are choosing as they navigate the Agent.
- Failures – places where an agent isn’t understanding the user’s intent.
I’m sure that there’s probably a lot more telemetry that an Agent would gather – it’s definitely an important part of the picture.
It’s common to refer to a previous conversation from a current one and I think that over time a conversation platform needs to think about this as it’s pretty common in the real world to refer to conversations that happened at some earlier point in time including perhaps reaching back months or years.
That needs to fit with Trust but I think it would add a lot of value to an Agent to be able to cope with the idea of something like;
“I need to re-order the same lightbulbs that I ordered from you six months ago”
or similar. Whether that needs to be done by the Agent “remembering” the conversation or whether it needs to be done by one of its supporting Services taking on that responsibility, I’m not sure.
That’s my initial ramble over. I need to go away and think about this some more. Please feel free to feedback as these are just my rough notes but there’s a few things in here that I think are probably going to stick with me as I think more on this topic of conversations…