Voice Input Is Awesome
Introduction: The Promise and Disappointment of Early Voice Recognition
In the 90s, my mum brought a copy of IBM ViaVoice from our local Staples store. She told me that we’d be able to talk to our computer and it would be able to understand our voice and write up what we tell it so that we don’t have to type it in. I thought that sounded amazing. Very sci-fi. “The future of computing is now” or so I thought.
Then the reality of that software struck us. My mum told me I would have to sit and train the software to understand my voice. As a young child I didn’t have the patience for that sort of thing, that sounded boring. I remember mum sitting and training the system to respond to her speech over several hours. I kept asking her, “Does it work yet?” I remember her getting very frustrated that the system was not getting better at understanding her voice and in the end she stopped trying to train it and gave up. That’s not to disparage the incredible work that must have gone into those early probabilistic models for speech. It’s just that the technology and the computing power wasn’t there yet.
A photo of the viavoice box shamelessly lifted from an ebay listing
In 2025 that future has finally arrived and I’m here for it….
The Future Arrives
I have been obsessively interacting with computers for the best part of 30 years. So, believe me when I say that I can type very fast… but I can also write pretty quickly with a fountain pen. In the last couple of years I’ve been getting into physical notepads and writing down my thoughts and feelings in a journal and using image recognition models to get those notes into my Obsidian vault. Handwriting my thoughts and feelings feels quite a lot more natural than typing them out and I like to sit first thing in the morning with my coffee and do some journaling.
One thing feels more natural than writing and that’s speaking. Before the invention of paper or even cave paintings, our ancestors used their voices to tell their stories to speak their minds. Voice dictation technology allows me to just speak my mind, a trail of conscious thought that spills out onto my screen as it occurs to me. It really feels revolutionary to me to be able to just say stuff and have it come out as text on my computer and that text be pretty damn accurate. All of this without the hours of painstaking training that ViaVoice required nearly 30 years ago.
How It Works (and Where It Runs)
On my desktop computer and my work laptop I can use local voice models like whisper to do this high-quality transcription in real time. Google also added the ability to do local voice transcription to recent Pixel phones like my Pixel 8. Pretty much all of my devices support this in one way or another. It comes at a bit of a cost in terms of compute so it doesn’t necessarily work well on older computers or those that don’t have a GPU or Apple Silicon. That said, the smaller Whisper models, which are still reasonably accurate, can even run on a Raspberry Pi. It’s not real-time, but it’s still quite fast and quite high quality.
You can also pay OpenAI to use a hosted version of their whisper model and there are app companies like the people who make SuperWhisper that can also provide remote model capability too. To me, part of the allure of all of this is the ability to run it locally. If I’m dictating my innermost private thoughts, I don’t like the idea that that is happening via some third party that I have no power over.
I’ve been writing a little app that I can use on my Intel laptop that doesn’t have a GPU, which will allow me to leverage my home server GPU and Whisper setup if I’m not traveling with a device that has the onboard capability to run the largest, most accurate models in real time.
Talking to the Machine: A Strange Intimacy
One weird side effect of all of this is that I’ve started to notice that I feel quite self-conscious about dictating to my computer, particularly when my wife is around. It feels quite intimate and personal. Even with the person that I’ve chosen to spend my life with, it feels a bit strange to be speaking to my computer. I find that I need some privacy and I have a psychological crutch where I come up to my office and I talk to my computer in private.
When I’m in the zone, I can write, or rather speak, very quickly and produce notes, blog posts, read me files and email responses without ever lifting a finger. Or rather occasionally lifting a finger to correct the odd word or to slightly change some phrasing after the fact. I’ve found that it lowers the barrier to writing blog posts and I echo Hamel Husain’s advice about using voice-to-text pipelines to make it easier to write and share. FWIW I’m not a fan of the reductionist notion that blogs should be “content” or having “content” pipelines but I do buy the notion that voice makes regular publishing much easier — for me at least.
Give it a Try!
I highly recommend giving voice-to-text a go. You can try out superwhisper for free to get started. Another free/cross-platform solution is whispering. Once my app is in a better place, I’ll share that too.
Originally published at https://brainsteam.co.uk on April 14, 2025.