Running EchoKit With Fully Local AI (with Claude)
2026-04-20 22:19:43 PDT
Bart Massey 2026-04-20
I got EchoKit DIY running with all-local AI on my home box — sort of. I still need to sort out a networking thing to try the device itself, but the browser simulation works fine.
EchoKit is a fun personal assistant device from Second State — think Google Home or Alexa. It's got a good speaker and microphone, and connects to external AI via WiFi. It is intended primarily as a learning tool for AI, Rust and embedded development.
Second State gave me an EchoKit (US$50) to evaluate back in November or so, and I'm really embarrassed to only be posting about it now. I unboxed the EchoKit fairly quickly, got it running, and printed a bracket/case thingy from a model supplied by Second State. Then I got sidetracked by life. Now I'm getting ready to run a workshop centered around the EchoKit, so I've got back to playing with it.
(I've left the unboxing video unpublished until someone other than me has had a chance to look at it and say that it looks [barely] usable.)
The firmware and the external server for EchoKit are both written in Rust: everything is open source and mostly published hardware. There's a plethora of first-party open source packages available — they seem to have been heavily AI co-written.
The core hardware for the DIY EchoKit is an ESP32-S3, and the device
runs ESP-IDF (and thus FreeRTOS). The Rust bindings have
std support, which makes "embedded" development pretty easy
— there's 8MB of RAM available for a decent-sized heap or whatever. The
speaker is powered by a I2S amplifier, and the MEMS microphone is also
I2S. The speaker is very loud and clear, and the microphone works really
well. There is a 1.25" TFT screen, which currently isn't used much by
the firmware.
Out of the box, the Automatic Speech Recognition (ASR), Text-To-Speech (TTS) and the LLM assistant service are all supplied indirectly via ChatGPT. For both ethical and practical reasons, I would prefer not to do this; the supplied alternative is Groq, which is not desirable either. I thus set out to see what this device would be like with all the services running locally.
After about four hours of fooling around with Claude Code, I managed to achieve my local-AI goal. tl/dr; I'm running a local EchoKit server, Qwen 3.5 9B with 4-bit quantization on Ollama for the LLM on my GPU, Whisper for ASR on my CPU, and Piper for TTS on my CPU. Running ASR/TTS on the CPU is both an attempt to save my 12GB of VRAM for the LLM + my normal desktop, and a way to work around the vagaries of trying to get ASR and TTS interfaces to use CUDA. It seems plenty efficient, the ASR quality is great, and the TTS is good enough to be plenty usable.
The whole story is… large. I had Claude write a summary of our session: see below. There were a lot of adventures here, but I think the end goal was worth it. Enjoy.
Fully-Local EchoKit on Debian
Claude Code with Bart Massey
EchoKit is an open-source voice agent platform: an ESP32-based device (DIY or pre-assembled) talks to a WebSocket server over WiFi, and the server runs a three-stage pipeline — speech recognition, language model, text-to-speech — to hold a voice conversation with the user.
This guide sets up that server entirely from local components on Debian: no cloud APIs, no tokens, no network calls off the box. It reflects what was made to work end-to-end on one Debian machine. Approaches that were tried and abandoned are recorded at the end for the benefit of others considering them.
Scope
What this guide covers:
- Local LLM via Ollama and Qwen3.5
- Local ASR via the Debian whisper.cpp package's
whisper-server - Local TTS via native Piper fronted by a small Python HTTP wrapper
- EchoKit
config.tomlwiring all three together
What this guide does not cover:
- Flashing and configuring the EchoKit device firmware. See the upstream docs at https://echokit.dev/setup/ — a Bluetooth-pair workflow through a browser page.
Tested environment
- Debian Trixie with some Sid, kernel 6.x, x86_64
- AMD Ryzen 9 5900X, 64 GB RAM
- NVIDIA RTX 3060 (12 GB), proprietary driver, CUDA 12.x
- 802.11 WiFi shared with the EchoKit device
Other recent Debian / Ubuntu installs should work the same way. Only the LLM uses the GPU in this setup; ASR and TTS run on CPU.
Architecture
EchoKit device (ESP32)
|
| WebSocket over local WiFi
v
echokit_server (localhost:8080)
|
+---> ASR: whisper-server (localhost:9092)
|
+---> LLM: ollama (localhost:11434)
|
+---> TTS: piper via Flask (localhost:9094)
All three backing services expose OpenAI-compatible HTTP endpoints,
which is the contract EchoKit's config.toml is built
around.
1. LLM: Ollama with Qwen3.5
Ollama is the lowest-effort way to serve an OpenAI-compatible
/v1/chat/completions endpoint locally.
## Install
curl -fsSL https://ollama.com/install.sh | sh
## Pull a model. Qwen3.5 9B at Q4_K_M is a good fit for a 12 GB GPU;
## it will run on CPU too, just slower.
ollama pull qwen3.5:9b-q4_K_M
## Ollama runs as a systemd service on install; nothing else to start.
## Sanity-check:
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3.5:9b-q4_K_M","messages":[{"role":"user","content":"Say hi."}]}'If the JSON response includes the model's reply, the LLM is ready.
Smaller variants (qwen3.5:4b, qwen3.5:2b)
are worth trying on machines with less GPU memory or no GPU.
2. ASR: whisper.cpp from Debian
The Debian whisper.cpp package ships
whisper-server, which exposes an OpenAI-compatible
/v1/audio/transcriptions endpoint. It is CPU-only in this
package (Debian main does not allow CUDA-linked binaries), which is fine
for voice-assistant use on a modern CPU.
Install and run
sudo apt update
sudo apt install whisper.cpp
## Fetch a model. small.en with q5_1 quantization is a good default for
## conversational use — clear English voice-assistant input transcribes
## accurately and the model is ~3x faster than unquantized small.en.
mkdir -p ~/echokit/asr && cd ~/echokit/asr
curl -L -o ggml-small.en-q5_1.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.en-q5_1.bin
## Find the exact binary name (upstream renamed it at some point):
dpkg -L whisper.cpp | grep -E 'bin/whisper.*server'
## Start the server. --threads should match your CPU's physical core count
## (not logical / SMT count) — whisper.cpp does not benefit from SMT and
## using all logical cores typically hurts due to cache contention.
whisper-server \
--model ~/echokit/asr/ggml-small.en-q5_1.bin \
--host 127.0.0.1 --port 9092 \
--inference-path /v1/audio/transcriptions \
--threads 12--threads defaults to 4, which noticeably underutilises
any modern desktop CPU. For a 12-core Ryzen 9 5900X,
--threads 12 is correct; adjust for your machine. Omit
--convert if your incoming audio is already 16 kHz mono
16-bit PCM (as EchoKit's device audio is) — that flag shells out to
ffmpeg on every request and adds latency for no gain.
Smoke-test from another terminal with any short 16 kHz mono WAV:
time curl -s http://localhost:9092/v1/audio/transcriptions \
-F file=@/path/to/some.wav -F model=whisperA JSON {"text": "..."} response confirms it works. On a
5900X with the stock Debian package and small.en-q5_1,
expect around 1.5s of wall-clock for 5s of audio. That works, but is
slower than necessary for reasons described in the next subsection. For
a comfortable EchoKit experience you want sub-second; read on.
(Recommended) Rebuild libggml with native CPU flags
whisper-server is thin plumbing around the
libggml library, which is where all the compute-intensive
matrix kernels live. Debian's libggml package is built for
a conservative CPU baseline (Haswell on x86_64) for portability across
the distro's supported hardware. On anything newer — Zen 3, Zen 4,
modern Intel — this leaves most of the ISA unused: no AVX-512, no VNNI,
no FMA3 beyond Haswell's subset. Rebuilding libggml with
-march=native typically yields a ~4x speedup on these CPUs.
whisper.cpp itself does not need rebuilding, because it
links libggml dynamically.
mkdir -p ~/src/ggml-deb && cd ~/src/ggml-deb
apt source ggml
sudo apt build-dep ggml
cd ggml-*/
## Edit debian/rules to add native flags to the dh_auto_configure override.
## Find the existing `override_dh_auto_configure:` block and add:
## -DGGML_NATIVE=ON \
## -DCMAKE_C_FLAGS="-march=native -O3" \
## -DCMAKE_CXX_FLAGS="-march=native -O3"
## (Mind the tab indentation — it's a Makefile.)
$EDITOR debian/rules
## Bump the version so apt can track and upgrade cleanly later:
dch -l +native "Rebuild with GGML_NATIVE=ON and -march=native"
## Build (unsigned, binary-only):
dpkg-buildpackage -us -uc -b
cd ..
## Install the resulting .debs and pin them so apt doesn't silently replace
## them with the stock ones on upgrade:
sudo dpkg -i libggml*+native*.deb
sudo apt-mark hold libggml libggml-devRestart whisper-server, rerun the timing test. On a
5900X with small.en-q5_1, expect a few hundred ms for 5s of
audio — comfortably inside EchoKit's latency budget.
If your CPU is Haswell-era or older, the rebuild buys little; the stock package is already tuned for you.
Voice activity detection (optional)
EchoKit supports an optional VAD service that detects when a speaker
has finished talking, so the server can hand audio to whisper at the
right moment instead of sending arbitrary-length buffers. Once whisper
is fast enough (sub-second with the native libggml
rebuild), VAD is not strictly needed for the pipeline to function — the
device-side turn-detection is usually adequate.
If you do want server-side VAD, options are:
silero_vad_serverfrom Second State (https://github.com/second-state/silero_vad_server). Rust + libtorch (CPU is fine). Addvad_url = "http://localhost:9093/v1/audio/vad"to the[asr]section ofconfig.toml.- A Python wrapper around
silero-vad, analogous to the Piper wrapper described below. Avoids the libtorch dependency at the cost of a Python process. - The
secondstate/echokit:latest-server-vadDocker image, which bundles EchoKit and Silero VAD together — useful if you'd rather containerize the whole thing.
3. TTS: Piper with a Python HTTP wrapper
Piper is an ONNX-based neural TTS from the Rhasspy project. It runs as a native binary with per-voice model files and is fast on CPU. It has no built-in HTTP server, so a small Python wrapper bridges it to EchoKit's OpenAI-style TTS endpoint.
Install Piper and download a voice
mkdir -p ~/echokit/tts && cd ~/echokit/tts
## Binary release
curl -LO https://github.com/rhasspy/piper/releases/download/2023.11.14-2/piper_linux_x86_64.tar.gz
tar xzf piper_linux_x86_64.tar.gz
## This produces ./piper/piper (binary) and ./piper/espeak-ng-data/
## Voice model. Ryan-high is a clear, natural American English male voice.
## Browse https://rhasspy.github.io/piper-samples/ to pick a different one.
mkdir -p voices && cd voices
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/high/en_US-ryan-high.onnx
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/high/en_US-ryan-high.onnx.json
cd ..Quick CLI smoke test:
echo "Hello, this is Piper." | \
./piper/piper --model voices/en_US-ryan-high.onnx --output_file /tmp/test.wav
aplay /tmp/test.wavInstall the Python wrapper
One dependency:
python3 -m pip install --user flaskSave the following as ~/echokit/tts/server.py:
##!/usr/bin/env python3
"""Minimal OpenAI-compatible HTTP wrapper around the Piper TTS binary."""
import os
import subprocess
import tempfile
from flask import Flask, request, Response
PIPER_BIN = os.path.expanduser("~/echokit/tts/piper/piper")
MODEL = os.path.expanduser("~/echokit/tts/voices/en_US-ryan-high.onnx")
app = Flask(__name__)
@app.post("/v1/audio/speech")
def speech():
body = request.get_json(force=True)
text = body.get("input", "")
if not text:
return ("missing 'input'", 400)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
out_path = f.name
try:
subprocess.run(
[PIPER_BIN, "--model", MODEL, "--output_file", out_path],
input=text.encode("utf-8"),
check=True,
capture_output=True,
)
with open(out_path, "rb") as g:
wav = g.read()
finally:
os.unlink(out_path)
return Response(wav, mimetype="audio/wav")
if __name__ == "__main__":
app.run(host="0.0.0.0", port=9094)Run it:
python3 ~/echokit/tts/server.pySmoke-test from another terminal:
curl -X POST http://localhost:9094/v1/audio/speech \
-H 'Content-Type: application/json' \
-d '{"input":"Hello from Piper via the wrapper."}' \
--output /tmp/piper.wav
aplay /tmp/piper.wavLeave the wrapper running.
4. EchoKit config.toml
With all three services up, EchoKit's configuration points at each by URL:
addr = "0.0.0.0:8080"
hello_wav = "hello.wav"
[asr]
platform = "openai"
url = "http://localhost:9092/v1/audio/transcriptions"
api_key = "NONE"
model = "whisper"
lang = "en"
prompt = "Hello\n(noise)\n(silence)\n"
[llm]
platform = "openai_chat"
url = "http://localhost:11434/v1/chat/completions"
api_key = "NONE"
model = "qwen3.5:9b-q4_K_M"
history = 20
[[llm.sys_prompts]]
role = "system"
content = """
You are a helpful, concise voice assistant. Keep answers short —
one or two sentences unless more is clearly needed.
"""
[tts]
platform = "openai"
url = "http://localhost:9094/v1/audio/speech"
api_key = "NONE"
model = "piper"
voice = "en_US-ryan-high"Build and run the EchoKit server from source (https://github.com/second-state/echokit_server):
git clone https://github.com/second-state/echokit_server
cd echokit_server
## Put your config.toml and a hello.wav in the working directory.
cargo build --release
RUST_LOG=info ./target/release/echokit_serverOn the device side, pair via the https://echokit.dev/setup/ page over Bluetooth and set the WebSocket server URL to:
ws://<your-machine-ip>:8080/ws/
Note the ws:// (not http://) scheme and the
trailing /ws/ path — both are required. The same URI can be
pasted into the browser tester at https://echokit.dev/chat/ to
validate the server without the device.
5. Bringing it up
Start services in this order; each one leaves a foreground process, so use separate terminals or a multiplexer:
- Ollama — already running as a systemd service after install.
- whisper-server on port 9092 (section 2).
- Piper Flask wrapper on port 9094 (section 3).
- echokit_server on port 8080 (section 4).
If something misbehaves, curl each local endpoint in
isolation to pinpoint the stage. RUST_LOG=debug on the
EchoKit server will show which upstream call failed.
Alternative paths that did not pan out
Recorded here for the benefit of anyone else considering similar approaches.
qwen3_tts_rs on GPU
Second State's own Qwen3-TTS Rust port is an attractive option — same maintainers as EchoKit itself, safetensors-based, no baked-in TorchScript device or dtype assumptions. The CPU path works out of the box (the installer's default pick) and produces excellent voice quality, including built-in named speakers like "Vivian" (named after a Second State engineer, a nice demo detail). The catch is that it is very slow on CPU — several seconds per short phrase on a 5900X — making it impractical for conversational use.
The GPU path did not work in this environment. The CLI binary
hardcodes Device::Cpu; patching it to
Device::cuda_if_available() and rebuilding produces a
binary that panics at model load with:
Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend.
Diagnosis: tch's torch-sys build script
correctly emits cargo:rustc-link-lib=torch_cuda, but those
link directives do not propagate to the final binary's link step.
ldd on the resulting binary shows no libtorch
.so files in its DT_NEEDED list — only
libcudart.so.12 as a transitive dependency. RUSTFLAGS with
--no-as-needed and --copy-dt-needed-entries do
not help: there is nothing to save, because the libtorch libraries are
never passed to the linker to begin with.
This is plausibly a tch/torch-sys packaging
bug surfacing under recent Rust toolchains, which default to
rust-lld as the linker on Linux. An issue has been filed
upstream.
gsv_tts (GPT-SoVITS)
EchoKit's first-party local TTS option. Builds cleanly on Debian but
the published v2pro model weights have baked-in fp16 and
cuda:0 device assignments in the exported TorchScript
graph. Any GPU path fails with either a complex_half FFT
error on Ampere or a CUDAHalf / CUDAFloat type mismatch depending on
which model variant is used. Only works with
CUDA_VISIBLE_DEVICES="" and the .cpu.pt model
variants, i.e. CPU-only. The reference-audio voice cloning also wants a
cleaner recording than the shipped examples provide. Issue filed
upstream. Not recommended unless voice cloning is specifically needed
and the model-export problems are acceptable.
Piper via LlamaEdge
tts-api-server.wasm
The official-looking path for Piper in the EchoKit/LlamaEdge
ecosystem. It requires a specific WasmEdge plugin
(wasi_nn-piper) that the Debian wasmedge
package does not ship. Upstream plugin tarballs are built for Ubuntu
20.04 / 22.04 glibc, which does not cleanly match Debian Sid. The Python
wrapper used here is simpler and more portable across distros, at the
cost of depending on Python and Flask — a trade worth making.
Known limitations
- Whisper ASR is CPU-only via the Debian package. With the
native-libggml rebuild described above, latency on a Zen 3-class CPU is
a few hundred ms for a 5s utterance — fast enough that GPU ASR offers no
practical benefit. For older CPUs where the rebuild doesn't help as
much, consider building whisper.cpp from source with
-DGGML_CUDA=ONif you have an NVIDIA GPU to spare. - Piper voice quality is audibly synthetic. It is clearly intelligible and has a consistent speaker identity, but it is not on par with Qwen3-TTS or commercial TTS. It is fast, reliable, and free, which is why it wins for real-time use here.
- The Python wrapper is a foreground process with no restart logic.
For a durable setup, wrap it in a
systemd --userunit.
Summary
| Component | Backend | Device | Status |
|---|---|---|---|
| LLM | Ollama + qwen3.5:9b-q4_K_M | GPU | Works |
| ASR | whisper.cpp (Debian package) | CPU | Works |
| TTS | Piper via Python Flask wrapper | CPU | Works |
| End-to-end voice conversation | — | — | Works |
Start all four services, point the EchoKit device at
ws://<your-ip>:8080/ws/, and talk.