Aloud — Swapping the Speech Engine for Doubao Streaming ASR
Why I Built It
Speaking is faster than typing, but macOS dictation is useless for mixed Chinese-English speech. A sentence like “写个 Python retry decorator” comes back as a pile of homophones with wrong word breaks. What I wanted was simple: tap to talk, tap to commit, inject into whatever field is focused, understand code terms and code-switching, and ideally have an LLM quietly fix the obvious mis-hears — fix only what was misheard, never rewrite or polish.
yetone has an open-source voice-input. The menu-bar interaction and text injection are already there. The problem is the engine: it uses Apple’s SFSpeechRecognizer, and mixed-language plus technical vocabulary is exactly that engine’s weak spot. Right shell, wrong engine. So I forked it and replaced the engine.
What It Is
A macOS menu-bar tool. Tap Fn to start recording, a capsule overlay shows the live waveform and incoming text, tap again to stop, the final text is injected into the focused field, and your clipboard is restored. Optional LLM correction pass (Doubao seed-lite), on by default, can be turned off.
The recognition backend is Volcano/Doubao streaming ASR 2.0 over WebSocket. Triggered entirely locally, credentials stored on-machine, nothing routed through a third party.
How It Works
The interesting part is all in the engine swap.
Volcano’s streaming ASR is not REST. It’s a custom binary WebSocket protocol: a 4-byte header, a uint32 payload size, then the body, big-endian, with frame type encoded in the high/low nibbles of the second header byte (full client request / audio / server response / error). I hand-wrote the codec (VolcProtocol.swift); the last audio packet carries a special flag to close the stream.
The audio pipeline had to change. The macOS input node defaults to 44.1/48 kHz Float32; Doubao only takes 16 kHz 16-bit mono. AVAudioConverter resamples, pushing 200 ms chunks (6400 bytes) upstream. Buffers accumulated before the handshake completes have to be flushed in the handshake callback, not dropped.
The deepest trap: the resource id for Doubao streaming 2.0 is volc.seedasr.sauc.duration, while 1.0 is volc.bigasr.sauc.duration. Get it wrong and the server returns 403 not granted — the error points at “service not provisioned,” when the real cause is one wrong word in the resource id, bigasr should be seedasr. The endpoint also has to be bigmodel_async; /bigmodel only accepts 1.0. That one word took a while to find.
The LLM correction layer runs Doubao seed-lite, and the key was disabling thinking — with deep reasoning on, correcting one sentence takes 30 seconds; off, 3 seconds. The prompt is locked to fixing only obvious speech mis-recognition (“配森”→Python, “杰森”→JSON), no rewriting, polishing, or deleting.
State convergence funnels all of SpeechEngine’s reads, writes, and send decisions onto one serial queue, with idempotent guards on teardown and removeTap. This part went through a Codex review that caught and fixed several re-entrancy crashes and cross-thread races (P1s).
How It Gets to You
Same as marklite: no GitHub Release. The .app is zipped and lives on Cloudflare R2 in its own bucket behind aloud-releases.openedon.com, and the download page is part of this site. R2 has zero egress fees — for a single-developer project this is the cheapest distribution channel that exists.
The base shell (the Apple Speech version) is open-sourced by yetone at https://github.com/yetone/voice-input-src; the Doubao engine layer is my own change.
Get It
Download Aloud for macOS — Apple Silicon only.
Before using it, provision Doubao streaming ASR in the Volcano Engine console and put the AppID / Access Token into the app’s Voice Engine Settings. It’s unsigned — on first launch, right-click and choose Open, or use System Settings → Privacy & Security → Open Anyway. It also needs Microphone and Accessibility permission — both are required to monitor the Fn key and inject text.
This is a tool I built for myself: early, unsigned, no auto-updater. It works, but don’t expect it to feel like a finished product. If it breaks, [email protected].