thoughts on voice interfaces

obj: 7abbbfdf
date: 2024-01-15T16:40:22.000000Z

intro

This is another post-shower note on interface considerations for using ChatGPT.
The primary insight is that, I mean this is going to be a restatement just to be clear, but the primary insight is that ChatGPT voice as it exists today is like the automatic version

voice vs manual

It’s like an automatic transmission in a car versus a manual one.
- You have no control outside of like some, you know, you technically do have control, right?
- You have choices that you can make, but the set of what you can do, like the options that you have available to you are so limited.
- And it’s like on a stick, you can actually go back and forth at will.
- When you’re working with a manual interface, like with text and keyboard and mouse, you can go anywhere in the branch and just start a new tree.
- You can create a new branch from any point.
- But in the automatic voice, it’s like it could not be more linear.
- You cannot even regenerate a response.
- There is no such thing as a command layer, right?
- It’s just completely like… It’s like Vim versus regular text editors, but the regular text editor did not even have the ability to backspace.
- It’s like a typewriter versus Vim.
- It’s literally that much of a difference between ChatGPT Voice and manual keyboard and mouse usage, which obviously itself has tons of room for improvement.

desired features

But on the voice end of things, we really need to have a system that is like, not just, it doesn’t necessarily have to be manual in this sense.
- It can be like direct, it can be like an enhanced, it’s like power steering.
- It’s like, it can be automatic, but you need to have full control.
- It needs to be the kind of automatic that is an enhancement of your agency, not a replacement for it, right?
- It’s not like, and it’s a poor one at that.
- It’s a very poor one.
I mean, you can’t even pause without touching the screen.
I mean, you can’t even speak longer without touching the screen.
It’s like if this message would just stop if I stopped talking for 10 seconds or five.
It’s really a short time.
It’s terrible.
And even if it was increased, like in ChatGPT, that would be a terrible stopgap or whatever.
It would just be a very poor substitute for what could be.

so yeah I think it’s worth looking at that guys interface that like weird thing that Andres mentioned which it’s like a voice based coding grammar thing like it’s like it uses grammars to manipulate text kind of like Vim I think
that’s the direction but I don’t necessarily want to go that far where you like have to create new sounds and stuff that you are saying like that would be kind of weird if you’re walking on the street and you’re like saying these weird things though it would be like the power users tool that’s like vim right the difference is that vim looks kind of cool and that would not sound cool if you’re using that on the street you want it to be articulate like you want

vision

The ultimate feedback loop.
Engage those, you know parts in your brain that are thinking actively and expressing well
The more you get out of the system and the more you get out of the system the more you want to and the more ideas you have and the more it’s like a very great cycle
That’s what I want to Define here.
I’m not exactly planning out the project itself.
I don’t have any specific ideas right now

reflection

I’ve expressed this similarly, maybe less concisely, maybe less coherently.
yeah it’s just i just had a very long voice session where like it would have been extremely fun if i was able to edit the thing and like replay because like sometimes you want to go back and send a new message but you can’t even do like you can in ChatGPT with like very annoying editing uh techniques you can go back in the tree like once
But it’s like a single rewind button in a game, right?
It’s not the same as if you had the full tree in front of you to manipulate with whatever interface you have in front of you.
yeah it’s just a single rewind button that you don’t even have access to with voice right but even if you did go back out of voice and you go rewind it you have to like read it it totally takes you out of the vibe like shower voice like you don’t want to be on your phone the whole time you want to be like talking to the machine but the machine is not like it’s not like you’re telling it okay go back and edit for me this thing like sure you could do that right that maybe would be an okay start to have like where the machine talks to you and it does all those edits itself

ideas

Honestly, it might be the best approach, like a simple lang chain agent that does that.
That’s an interesting idea.
That handles its state and history and stuff with the OpenAI API, but the agent that is doing that handling is not necessarily OpenAI.
That’s interesting.
It’s fun to consider.
You could even use a base model prompt for that.
But I don’t think that’s what the actual ideal system is doing.
I don’t think you have like a traditional, like what is now traditional, agent or agentic framework that looks like that.
Like I think instead you have something that feels more like Vim, that maybe looks something like Star Trek, right, to an external observer, but it’s actually far more precise and you are holding the entire tree in your head if you want to when you’re talking to the thing.
You don’t have to.
but you have the option to navigate the tree.
You can be like back, back, back, forward, forward, forward, right?
Maybe you just use letters like A, B, C, D, E, F, G, or A, B, C, D, X, Y, Z, 1, 2, 3, you know, like that kind of grammar.
Maybe that has already been explored.
It probably has with this guy’s language.
And you use these little commands to like, almost the way that a voice, like TalkBack works on iPhones.
but fully voice-controlled and simplified, obviously, compared to a touchscreen.
I guess that’s what TapBack does.
It simplifies the interface down to a set of commands that you can navigate fairly easily.
That’s the goal, I guess, with that interface.
which is also very interesting to consider, like that abstraction that we don’t think about when we use them, they look like they can do anything, but actually there’s only really a set number of things you can do at any time on a given interface, like you can open the text box, you can type in the text box, you have a back button, you have maybe some forward buttons, it might get more complex when you’re dealing with a web-based interface, but yeah, for our needs, we don’t need to really even consider that.
Anyway, yeah.
I wanted to, in just this moment, anchor the idea of having a navigable tree.
If you had hand tracking while you were walking around, you wouldn’t even need voice.
You could just grab the tree.
That’s like Apple XR level.
You wouldn’t even need Apple XR.
You could use the voice to give you the feedback on your hand tracking.
Use a Leap Motion or an Oculus controller or something that you just carry around.
And you just like, or maybe even simpler.
It just needs to have like a gyroscope.
You could use a smartphone and a button, right?
And like, or something like that, like a little thing that just says, okay, I’m grabbing and I’m moving in like some set number of directions, perhaps.
I mean, that’s ultimately, it’s the same thing as a tap back thing.
That’s what you’re doing with the Apple XR thing too.
You’re just reducing it down to that.
You might have like more,
maybe you can move it up and down while you’re moving left and right but it’s just like visual more than it is like you know to the API or to the system that like is actually handling your movements like it’s just left right it’s just forward back right up down in out side to side like in the tree um yeah I want to navigate the tree with my voice

conclusion

and with other things.
I want to navigate language model trees.
I think that’s like one of the most important agendas in cyborgism is like making trees like accessible making that something that like anyone even if they have to be like what’s the word like
The Church of the Loom
It’s very annoying how this linear voice style way of approaching language model inference is the only thing that’s delivered to most consumers that are operating with AI.
I doubt that any of the consumer chat apps allow you to even edit your message, right?
Even though there’s no reason you couldn’t.
There’s no reason you couldn’t have loom in those interfaces.
It’s just that they don’t really care to build it, right?
For various reasons.
And this is something that I keep echoing in my notes, but that is like one of my great frustrations of this last year, that that has been the case.
I just really didn’t expect that.
And I guess it’s like naive to expect that they’re going to support this thing that like a power user would want, but it’s like, come on, it shouldn’t be a power users.
Like it’s not really complex.
Like you just,
I guess it is, you know, it is to many, but it’s a very achievable understanding.
And that’s what my interface is for.
It’s for those people who see things that way.
I know that there are many who do.
I know that there will be, for example, if I wanted to sell this, there would be a market for it.
And beyond the financial marketplace, there is a market for ideas where like,
simply building this and executing this well with good design or even decent design shows so much like in the same way that Linus’s projects have they showed so much in the times that he built them right and especially as he went into language models it’s the same thing this is that kind of thing directionally and I think it’s quite important to realize it

Conscious Data

Explorer

thoughts on voice interfaces

intro

voice vs manual

desired features

vision

reflection

ideas

conclusion

Graph View

Table of Contents

Backlinks

Conscious Data

Explorer

thoughts on voice interfaces

intro

voice vs manual

desired features

related work

vision

reflection

ideas

conclusion

Graph View

Table of Contents

Backlinks