Sketchy Thoughts on Bots/Agents/Conversations

It seems that ‘bots’ are ‘hot’ in the sense that the topic’s really attracting a lot of attention and if I had to pick out one great article that I read in this area then I’d say that it was this one on “Conversational Commerce” by Uber’s Chris Messina which really brought it home to me – I think that’s a really good read although I was late in coming to it Smile

You can read about some of Microsoft’s plans in the area of bots here [Bot Framework] and here [Cortana] and there were quite a few sessions at //build that relate to this area too.

The rest of this post does not relate to anything that Microsoft does or makes, it’s more my own brain dump from trying to think through some of the pieces that might be part of a platform that provides an ability to have these types of conversations.

I’ve been thinking on this topic of ‘bots’ for a little while and I wanted to;

  • start to try and get my thoughts in order and write them up so that I can come back to them and refine them
  • have a framework that I can use to look at particular platforms for ‘bots’ and try to evaluate whether that platform covers some of the areas that I’d expect a ‘bot’ platform to address.

Beyond here, I’m going to use the term ‘Agent’ rather than ‘Bot’ to avoid getting tied up in any particular implementation or anything like that.

Once again, it’s just a brain dump and a pretty sketchy one but you have to start somewhere Smile

Conversations

We’ve been having conversations of different forms with software Agents for the longest time. You could argue that when I do something like;

image

then I’m having a ‘conversation’ with the command prompt.

I “say” something and it “says” something back. It’s a short conversation. It’s not a very natural conversation but, nonetheless, it’s a form of conversation.

It also doesn’t offer much in the way of choice around the input/output mechanisms here. As far as I know, I can’t speak to the command prompt and it doesn’t speak back although it may have accessibility capabilities that I’m unaware of.

At a more advanced level, I can have a conversation with an Agent on one of my devices today and I can actually say something like;

    • “Agent, can you play the song [SONG TITLE] for me?”
    • “Did you mean this one or that one?”
    • “The second one”
    • “Ok, playing it”

This one definitely feels a bit more “conversational” and an Agent that accepts speech like this usually accepts at least typing as another possible input mechanism and displaying on a screen as another possible output mechanism.

Implicit in there is the idea that the Agent that I’m speaking to knows of some kind of app or service that can actually find music and get it played for me and it’s debatable as to whether that app/service does or doesn’t display a graphical interface as maybe sometimes it should and sometimes it shouldn’t depending on the context.

What’s interesting though would be that if I then continued the conversation with something like;

    • “Agent, remember that song you played for me just before lunch? Play it again”

then I don’t know whether there are any platforms out there today that can handle even a simple conversation like that and the notion that conversations might last a while and might have related history.

The context has been lost at that point and we have to start again and it feels like even the simplest elements of human conversations are going to need quite a lot of work if they’re going to be brought to a ‘conversation platform’ and, naturally, this will be done incrementally with ever growing value along the way.

With that in mind, I was trying to think of some of the pieces that might make up that kind of platform and mostly so that I can come back to them at a later point. Some of these pieces I’ve seen referred to in other articles, videos, etc. and others I’m perhaps conjuring out of the air.

Conversational Pieces

I scratched my head for quite a while and this list dropped out of some of the pieces that might be involved in a conversational platform when thinking of conversations in a broad sense;

  • The Conversational Host or Canvas
  • The Agent
  • Discoverability
  • Inputs
  • Outputs
  • Initiation
  • Identity
  • Dialogs
  • Language Processing
  • User Knowledge
  • Trust
  • Context
  • Termination
  • Services
  • Decisions
  • Telemetry
  • History

That list isn’t ordered in any fashion.

I did a quick sketch below and you’ll soon realise from the number of arrows on the diagram that I haven’t reached any kind of clarity on this just yet and am still fumbling a bit and ‘making it up’ Smile but, again, it’s a starting point that can be refined.

foo

The Conversational Host or Canvas

This feels like a very broad term but it seems that there’s a need for “something” to host the conversation and it might be something like an app that hosts a voice call or an SMS conversation. It might be a chat app or an email client. It might be a part of an operating system “shell”.

It’s the “host” of the conversation and, naturally, I might want to move from one host to another and have a conversation follow me which almost certainly comes with a set of technical challenges.

Some conversational hosts might serve a specific purpose. For example, a device on a kitchen table that is hard-wired to play music.

Others might broker between many agents – for example a chat application that can both book train tickets and return flight information.

It seems to me that it’s likely that the Canvas will control the modes of input and output, perhaps offering some subset of those available on the device that it’s running on and it also seems to me that it’s unlikely that developers will want to build for every Canvas out there and so, over time, perhaps some canvases will be specifically targeted whereas others might somehow be treated generically.

The Agent

The Agent is the piece of software that the user is having the conversation with through the Canvas in question. The Canvas and the Agent might sometimes be one and the same thing and/or might be produced by the same company but I guess that in the general case the Canvas (e.g. an IM chat window) could be separate from the Agent (e.g. a travel Agent) which might itself rely on separate Services (e.g. a weather service, a train booking service, a plane booking service) in order to get work done.

Discoverability

How does the user discover that a particular (complex) Canvas has an Agent available and, beyond that, how do they discover what that Agent can do?

It’s the age-old ‘Clippy’ style problem. A Canvas (e.g. a chat app) can broker conversations with N Agents but the user doesn’t know that and we see this today with personal assistants offering menus via “Tell me what I can say” type commands.

It seems to me that there’s a general need for discovery and it might involve things like;

  1. Reading the instructions that came with the Canvas
  2. Asking the Canvas before…
  3. Asking the Agent.
  4. Looking up services in a directory.
  5. Being prompted by the Canvas (hopefully with some level of intelligence) when an appropriate moment seems to arrive – e.g. “did you know that I can book tickets for you?”.

and no doubt more but you need to know that you can start a conversation before you start a conversation Smile

Inputs/Outputs

There’s lots of ways to converse. We can do it by voice, by typing, by SMS. We might even stretch to include things like pointing with gamepads or waving our arms to dismiss dialogs but maybe that’s pushing things a bit far.

Equally, there’s many ways to display outputs and a lot of this is going to depend on the Canvas and device in question.

For example, if I have an Agent that knows how to find photos. I might input;

“Agent, go get me some high quality, large size images of people enjoying breakfast”

What should the output be? Maybe a set of hyperlinks? Maybe open an imaging app and display the photos themselves ready for copy/paste? Maybe offer to send me an email with all the details in so I can read it later?

I’d argue that it depends on the Canvas, the device and what I’m currently doing. If I’m walking down the street then the email option might be a good one. If I’m sitting at my PC then maybe open up an app and show me the results.

I suspect that this might get complex over time but I/O options seem to be a big part of trying to have a conversation.

Initiation

How to start a conversation?

At what point does a conversation with an Agent begin in the sense that the Agent tracks the flow of interactions back and forth such that it can build up Context and start to offer some kind of useful function?

Most Agents support some kind of “Hey, how are you?” type interaction but that’s not really the conversation opener, it perhaps comes more at the point where someone says “I need a train” or “I need a ticket” or similar.

Conversations are stateful and could potentially span across many devices and Canvases and so there’s going to need to be some kind of conversational identifier that can be (re)presented to the agent at a later point in time. The analogy in the human world would be something like;

Remember when we were talking about that holiday in Spain last week?

and, no doubt, if we’re to make conversations work in the virtual world then there is likely to be an equivalent.

Identity

An identifier for a conversation is one thing but it’s pretty much useless without a notion of the user who was involved in the conversation.

You’d imagine that this is perhaps one of the things that a Canvas can do for a user – e.g. an IM Canvas has presumably already authenticated the user so it might be able to provide some kind of token representing that identify to an Agent such that the Agent can know the differences between conversations with Mike and conversations with Michelle.

If a conversation then moves from one Canvas to another then the Agent has to be able to understand whatever identity token might come from the second Canvas as well.

I suspect that this is a roundabout way of saying that it feels to me like identity is going to be an important piece in a platform that does conversations.

Context

I’m obsessed with context Smile and I guess that a conversation with an Agent is, in some ways, about building up the context to the point where some ‘transaction’ can be completed.

That context needs to be associated with the conversation and with the identity of the user and perhaps needs to have some kind of associated lifetime such that it doesn’t stay around for ever in a situation where a conversation starts but never completes.

There’s then the question of whether that content can be;

  • pre-loaded with some of the knowledge that either the Agent or the Canvas has about the user.
  • used to add to the knowledge that either the agent or the Canvas keeps about the user after the conversation.

For example – if a user has a conversation with an Agent about a train journey then part of the context might be the start/end locations.

If one of those locations turns out to be the user’s home then that might become part of the future knowledge that an Agent or a Canvas has about the user such that in the future it can be smarter. Naturally, that needs to remain within the user’s control in terms of the consent around where it might be used and/or shared.

User Knowledge

I’m unsure whether knowledge about a user lives with an Agent, with a Canvas, with a Service or with all of them and I suspect it’s perhaps the latter – i.e. all of them.

No doubt, this is related to Identity, Context and Trust in the sense that if I use some Canvas on a regular basis (like a favourite chat app) and if it comes from a vendor that I trust then I might be prepared to share more personal data with that Canvas than I do perhaps with a an Agent which does (e.g.) concert-ticket bookings and which I only use once every 1-2 years.

The sort of knowledge that I’m thinking of here stems from personal information like age, date-of-birth, gender, height, weight, etc. through into locations like home, work and so on and then perhaps also spans into things like friends/family.

You can imagine scenarios like;

“Hey, can you ask the train ticketing service to get me a ticket from home to my brother’s place early in the morning a week on Saturday and drop him an SMS to tell him I’m coming?”

and a Canvas (or Agent) that can use knowledge about the user to plug in all the gaps around the terms ‘home’ and ‘brother’ in order to work out the locations and phone numbers is a useful thing Smile

Now, whether it’s the Canvas that turns these terms into hard data or whether it’s the Agent that does it, I’m not sure.

Trust

Trust is key here. As a user, I don’t want to have a conversation with an Agent that is then keeping or sharing data that I didn’t consent to but, equally, conversations that are constantly interrupted by privacy settings aren’t likely to progress well.

In a conversation between User<->Canvas<->Agent<->Service it’s not always going to be easy to know where the trust boundaries are being drawn or stretched and perhaps it becomes the responsibility of the Canvas/Agent to let the user know what’s going on as knowledge is disseminated? For example, in a simple scenario of;

“Hey, can you buy me train tickets from home to work next Tuesday?”

there’s a question around whether the Agent needs to prompt if it’s not aware of what ‘home’ and ‘work’ might mean and doesn’t have a means to purchase the ticket.

Also, does the Canvas attempt to help in fleshing out those definitions of ‘home’ and ‘work’ and does it do with/without the user’s explicit consent?

Dialogs

It feels like a conversational platform needs to have the ability to define dialogs for how a conversation is supposed to flow between the user and the agent.

I suspect that it probably shouldn’t be a developer who defines what the structure and the content of these dialogs should be.

I also suspect that they shouldn’t really be hard-wired into some piece of code somewhere but should, somehow, be open to being created and revised by someone who has a deep understanding of the business domain and who can then tweak the dialogs based on usage.

That would imply a need for some sort of telemetry to be captured which lets that Agent be tweaked in response to how users are actually interacting with it.

Part of defining dialogs might tie in with inputs and outputs in that you might define different sets of dialogs depending on the input/output capabilities of the Canvas that’s hosting the conversation with the Agent. It’s common to use different techniques when using speech input/output versus (say) textual input/output and so dialogs would presumably need to cater for those types of differences.

Another part of defining dialogs might be to plug in decision logic around how dialogs flow based on input from the user and responses from services.

Language Understanding and Intent

One of the challenges of defining those dialogs is trying to cater for all the possible language variations in the ways in which a user might frame or phrase a particular question or statement. There are so many ways of achieving the same result that it’s practically impossible to define dialogs that cater for everything. For example;

  • “I want to book a taxi”
  • “I need a lift to catch my plane”
  • “Can I get a car to the airport”

are simple variations of possibly the very same thing and so it feels to me like there’s a very definite need here for a service which can take all of these variations and turn them into more of a canonical representation which can report the intent that’s common across all three of them.

Without that, I think all developers are going to be building their own variant of that particular wheel and it’s not an easy wheel to build.

Termination

Just for completeness, if there is an “initiation” step to a conversation with an Agent then I guess there should be a point at which a conversation is “over” whether that be due to a time-out or whether it be that the user explicitly states that they are done.

I can see a future scenario analogous to clearing out your cookies in the browser today where you want to make sure that whatever you’ve been conversing about with some Agent via some Canvas has truly gone away.

Services

An Agent is a representative for some collection of Services. These might be a bunch of RESTful services or similar and it’s easy to think of some kind of travel Agent that provides a more natural and interactive front end to a set of existing back-end services for booking hotels, flights, trains, ferries, etc. and looking up timetables.

A platform for conversations would probably want to make it as easy as possible to call services, bring back their data and surface it into decision-logic.

Decisions

Sticking with decisions – there’s likely to be a need for making decisions in all but the simplest of conversations and those decisions might well steer the conversational flow in terms of the ‘dialogs’ that are presented to the user.

Those decisions might be based on the user’s input, the responses that come back from services invoked by the Agent or might be based on User Knowledge or some ambient context like the current date and time, weather, traffic or similar.

Some of that decision making might be influenced by use of the Agent itself – e.g. if the Agent uses telemetry to figure out that 95% of all users go with the default options across step 1 to step 5 of a conversation then maybe it can start to adapt and offer the user shortcuts based on that knowledge?

Telemetry

I’d expect an Agent to be gathering telemetry as it was progressing such that aggregate data was available across areas like;

  • Agent usage – i.e. where traffic is coming from, how long conversations last, etc.
  • Dialog flow – which paths through the Agent’s capabilities are ‘hot’ in the sense of followed by most users.
  • Dialog blockers – points where conversations are consistently abandoned.
  • Choices – the sorts of options that users are choosing as they navigate the Agent.
  • Failures – places where an agent isn’t understanding the user’s intent.

I’m sure that there’s probably a lot more telemetry that an Agent would gather – it’s definitely an important part of the picture.

History

It’s common to refer to a previous conversation from a current one and I think that over time a conversation platform needs to think about this as it’s pretty common in the real world to refer to conversations that happened at some earlier point in time including perhaps reaching back months or years.

That needs to fit with Trust but I think it would add a lot of value to an Agent to be able to cope with the idea of something like;

“I need to re-order the same lightbulbs that I ordered from you six months ago”

or similar. Whether that needs to be done by the Agent “remembering” the conversation or whether it needs to be done by one of its supporting Services taking on that responsibility, I’m not sure.

Done

That’s my initial ramble over. I need to go away and think about this some more. Please feel free to feedback as these are just my rough notes but there’s a few things in here that I think are probably going to stick with me as I think more on this topic of conversations…

Winter 2015 – Blog Migration

I spent a bit of time over the Xmas holidays 2015 migrating this blog from its original home to a new hosted WordPress site.

This blog has a long and chequered history – I began posting back in 2003 when I had the blog hosted on  a system called MoveableType. Somewhere in 2007, I wrote some code that moved the content across to CommunityServer which I ran on a hosted server and it stayed there until late 2015.

During that time CommunityServer had pretty much died but I was a bit stuck with it in that I’d got a couple of thousand posts in its database and there never seemed to be a great utility for exporting content and so, like a lot of old-but-useful software, I got a little bit stuck.

In late 2015, I decided that it was time to bite the bullet and move on and so I decided that I would move the content elsewhere that didn’t involve me directly managing a server.

I settled on (hosted) WordPress.

Moving my materials out of CommunityServer wasn’t completely painless and it took me around 4-5 days of effort of going around the loop trying to write code and SQL to analyse the content that I had in CommunityServer and make sure that it would land ‘reasonably’ on WordPress.

What that largely involved was;

  1. Taking all the content from my hosted web server like downloads, images, videos, etc. and moving it across to an Azure web server, largely preserving the folder structure to make it easier for my code to rewrite hyperlinks.
  2. Getting a local copy of my CommunityServer database and writing custom code in order to attempt to;
    1. Rewrite hyperlinks that were pointing at the old locations for images, downloads, videos, etc.
    2. Rewrite hyperlinks that linked from one post to another.
    3. Fix up posts which contained code snippets – over the years, a number of different ‘code formatters’ had crept in to my blog and patching them up after the event wasn’t so much fun Smile
    4. Fix up videos – my blog site had allowed posts to contain object tags and iframes and that wasn’t appropriate for WordPress posts which (generally) have a better way of handling this.

In doing this work, I ended up writing a C# console application (!) which did old-school ADO.NET into the community server database and then I used the Html Agility Pack in order to pull apart the HTML and attempt to put it back together again with various links rewritten.

That code then uploaded posts to WordPress by using the WordSharp wrapper that I found to be a nice, convenient way of getting content posted into WordPress.

Some of that process involved digging through the WordPress REST APIs and I found that, generally, the developer docs are pretty good and the APIs seem to have the common scenarios covered and it felt good to be following in very well-trodden footsteps rather than trying to blaze a trail.

What I haven’t attempted to do as part of this migration is to preserve external  hyperlinks into my blog site and so links that come into my site are likely to break. It would have been ‘nice’ to be able to do work to redirect those links but I don’t think it’s possible under a hosted model and so that’s the trade off that I’ve made.

Hopefully, WordPress can be this blog’s home for the next few years and I won’t have to write any more one-time migration code again in a hurry Smile

Is Your Next Phone Your Next PC?

It’s getting increasingly difficult to differentiate a phone from a tablet from a PC and, today, while watching the Microsoft stream from Mobile World Congress one of those differentiators for “Windows Phone” seemed to disappear – the absence of a physical keyboard.

Along with the launch of another large-screen Lumia, the 640XL to go alongside the 1320 and the 1520;

came the announcement of the new Universal Foldable Keyboard;

and a demonstration of that keyboard working with a phone running Windows 10 with the “Detail” section on that product website noting that the keyboard will also work on Windows Phone 8.1 Update 2.

image

Part of the same event demo’d more of running the forthcoming Universal Office Apps on Windows 10 on a phone. I haven’t captured a screenshot of Excel in use on the phone but it was part of this demo section and it was pretty impressive as was the Outlook demo;

and so a typical laptop worker can sit on a 6” phone running Windows 10 and run a version of Office and bang away on their foldable, bluetooth keyboard and all they really need is a big screen to put their content onto and, presumably, they could make use of something like;

as detailed in this blog post to wirelessly project this onto an external monitor and now they’ve effectively constructed a ‘desktop PC’.

I’m not sure that in Windows Phone 8.1 this would provide the ‘perfect’ experience when compared to working on a laptop and I don’t know how it’ll work in Windows 10 on a small mobile device as I’m not running Windows 10 yet on anything under 8” and, of course, it’s a preview right now rather than a finished product.

But…it seems that all the pieces are there to make this work today and to refine it into a more mainstream idea in the future?