UI and Product Decisions on AI-Powered Dictation Apps

The interaction that isn't

The core interaction is push-to-talk: hold a key, speak, release. We obsessed over making this feel like a non-interaction—something your muscles learn and your conscious mind forgets. Once you are used to it, it becomes such a habit that it's very hard not to use it. I wanted the user to be able to use just one key—the function key, or the right option, or the right command—because those are two of the most unused keys on any Mac keyboard. That also presented its own set of challenges because the standard macOS keyboard shortcut system doesn't let you do that easily.

Details matter: the grace period

Here's a detail that took a bit more than expected to get right. When you release the hotkey, we don't stop recording immediately. We wait 200 milliseconds.

/// Grace period after releasing push-to-talk hotkey before stopping audio capture.
private let pushToTalkReleaseGracePeriod: TimeInterval = 0.2

The problem: humans release the button at the same time as their last syllable. I was doing that myself unconsciously while testing the app, without even realising it. Without the grace period, the final word gets clipped. "Send me the report" becomes "Send me the repor."

200 milliseconds is long enough to capture trailing audio, short enough that the user doesn't notice the delay. If the user re-presses the hotkey during the grace period, the timer cancels and recording continues seamlessly.

This is the kind of decision that gets unnoticed when it works and everyone notices when it doesn't. Invisible design is often about the absence of frustration rather than the presence of delight.

But by far the hardest part of being invisible has been the text injection. When transcription finishes, the text needs to appear wherever the cursor is: in the email, the Slack message, the code editor, Google Docs. The user didn't switch apps. They didn't copy anything. Text just has to materialise as if typed.

We achieve this by simulating keystrokes at the system level. Each character is passed individually to the system, which routes it to whatever application has focus. From the user's perspective, they are typing very fast. The delays are adaptive. Spaces get extra time because many applications—word processors, chat apps—do word boundary processing on spaces: autocorrect, spell check, URL detection.

We also had to resolve a few edge cases, like certain apps that were causing trouble or password fields where the user cannot easily paste. We could say this is design by subtraction. Removing every moment where the user would have to think about the tool and trying to get out of the way just to get the job done.

The Icon That Breathes

This was way more fun to code than I expected. I wanted the menu bar to participate in that philosophy of "Hey, I'm not here, but I can communicate that I'm here and what's going on." The whole idea was that during idle, the icon shows state, but during recording, it does something very subtle: it spins in response to your voice. It is a physics-inspired system that drives the animation. When audio levels exceed a threshold, spin velocity increases. When you pause or lower your voice, velocity decays naturally.

The need came from a request from one user who was not fully aware when Yakki was still recording their meetings after the meeting had finished. There were a couple of technical ways of stopping the recording when the meeting wrapped up, but those involved workarounds I didn't want to touch at that point. An alternative approach was: what if we communicate through the menu bar the state of your recording? What is happening with the job you are trying to do?

The spin is visible only if you look at the menu bar. It's not there to demand attention. It's there so that if you glance up, you get immediate confirmation that the app is hearing you, proportional to how loudly you're speaking. When you stop, it coasts to stillness. It breathes with your voice.

The Philosophy Behind the Decisions

Every design decision in Yakki flows from a set of principles we internalised early:

Every click is a speed bump. Modals, confirmation dialogs, and "are you sure?" prompts exist to make developers feel safe. They make users feel slow. We have zero confirmation dialogs in the recording flow. You press, speak, release. Done.

Design for states, not screens. We don't have a "recording screen" or an "injection screen." We have states—idle, listening, processing, injecting, finished—and the indicator reflects those states through animation. The user sees a continuous flow, not a sequence of pages.

The tool should follow the user, never the reverse. The indicator follows your active screen. The text injection targets whatever app is focused. The hotkey works everywhere, even in full-screen apps. At no point does the user need to arrange their workspace around our app.

Disappearance is a spectrum. Some users want the glass indicator. (Some users definitely didn't like the glass indicator though.) Some want just the menu bar icon. Some want nothing at all. We support all three. The most invisible configuration is: no indicator, modifier-only hotkey, clipboard injection. At that point, the app has no visual presence at all. You hold a key, speak, release, and text appears. Nothing else happens.

Yakki is a macOS dictation app that turns your voice into text, wherever your cursor is. It runs on-device, respects your privacy, and tries very hard to stay out of your way. Learn more at yakki.ai.