Running EchoKit With Fully Local AI (with Claude)

2026-04-20 22:19:43 PDT

Bart Massey 2026-04-20

I got EchoKit DIY running with all-local AI on my home box — sort of. I still need to sort out a networking thing to try the device itself, but the browser simulation works fine.

EchoKit is a fun personal assistant device from Second State — think Google Home or Alexa. It's got a good speaker and microphone, and connects to external AI via WiFi. It is intended primarily as a learning tool for AI, Rust and embedded development.

Second State gave me an EchoKit (US$50) to evaluate back in November or so, and I'm really embarrassed to only be posting about it now. I unboxed the EchoKit fairly quickly, got it running, and printed a bracket/case thingy from a model supplied by Second State. Then I got sidetracked by life. Now I'm getting ready to run a workshop centered around the EchoKit, so I've got back to playing with it.

(I've left the unboxing video unpublished until someone other than me has had a chance to look at it and say that it looks [barely] usable.)

The firmware and the external server for EchoKit are both written in Rust: everything is open source and mostly published hardware. There's a plethora of first-party open source packages available — they seem to have been heavily AI co-written.

The core hardware for the DIY EchoKit is an ESP32-S3, and the device runs ESP-IDF (and thus FreeRTOS). The Rust bindings have std support, which makes "embedded" development pretty easy — there's 8MB of RAM available for a decent-sized heap or whatever. The speaker is powered by a I2S amplifier, and the MEMS microphone is also I2S. The speaker is very loud and clear, and the microphone works really well. There is a 1.25" TFT screen, which currently isn't used much by the firmware.

Out of the box, the Automatic Speech Recognition (ASR), Text-To-Speech (TTS) and the LLM assistant service are all supplied indirectly via ChatGPT. For both ethical and practical reasons, I would prefer not to do this; the supplied alternative is Groq, which is not desirable either. I thus set out to see what this device would be like with all the services running locally.

After about four hours of fooling around with Claude Code, I managed to achieve my local-AI goal. tl/dr; I'm running a local EchoKit server, Qwen 3.5 9B with 4-bit quantization on Ollama for the LLM on my GPU, Whisper for ASR on my CPU, and Piper for TTS on my CPU. Running ASR/TTS on the CPU is both an attempt to save my 12GB of VRAM for the LLM + my normal desktop, and a way to work around the vagaries of trying to get ASR and TTS interfaces to use CUDA. It seems plenty efficient, the ASR quality is great, and the TTS is good enough to be plenty usable.

The whole story is… large. I had Claude write a summary of our session: see below. There were a lot of adventures here, but I think the end goal was worth it. Enjoy.


Fully-Local EchoKit on Debian

Claude Code with Bart Massey

EchoKit is an open-source voice agent platform: an ESP32-based device (DIY or pre-assembled) talks to a WebSocket server over WiFi, and the server runs a three-stage pipeline — speech recognition, language model, text-to-speech — to hold a voice conversation with the user.

This guide sets up that server entirely from local components on Debian: no cloud APIs, no tokens, no network calls off the box. It reflects what was made to work end-to-end on one Debian machine. Approaches that were tried and abandoned are recorded at the end for the benefit of others considering them.

Scope

What this guide covers:

What this guide does not cover:

Tested environment

Other recent Debian / Ubuntu installs should work the same way. Only the LLM uses the GPU in this setup; ASR and TTS run on CPU.

Architecture

  EchoKit device (ESP32)
         |
         |  WebSocket over local WiFi
         v
  echokit_server                       (localhost:8080)
         |
         +---> ASR:  whisper-server    (localhost:9092)
         |
         +---> LLM:  ollama            (localhost:11434)
         |
         +---> TTS:  piper via Flask   (localhost:9094)

All three backing services expose OpenAI-compatible HTTP endpoints, which is the contract EchoKit's config.toml is built around.


1. LLM: Ollama with Qwen3.5

Ollama is the lowest-effort way to serve an OpenAI-compatible /v1/chat/completions endpoint locally.

## Install
curl -fsSL https://ollama.com/install.sh | sh

## Pull a model. Qwen3.5 9B at Q4_K_M is a good fit for a 12 GB GPU;
## it will run on CPU too, just slower.
ollama pull qwen3.5:9b-q4_K_M

## Ollama runs as a systemd service on install; nothing else to start.
## Sanity-check:
curl http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3.5:9b-q4_K_M","messages":[{"role":"user","content":"Say hi."}]}'

If the JSON response includes the model's reply, the LLM is ready.

Smaller variants (qwen3.5:4b, qwen3.5:2b) are worth trying on machines with less GPU memory or no GPU.


2. ASR: whisper.cpp from Debian

The Debian whisper.cpp package ships whisper-server, which exposes an OpenAI-compatible /v1/audio/transcriptions endpoint. It is CPU-only in this package (Debian main does not allow CUDA-linked binaries), which is fine for voice-assistant use on a modern CPU.

Install and run

sudo apt update
sudo apt install whisper.cpp

## Fetch a model. small.en with q5_1 quantization is a good default for
## conversational use — clear English voice-assistant input transcribes
## accurately and the model is ~3x faster than unquantized small.en.
mkdir -p ~/echokit/asr && cd ~/echokit/asr
curl -L -o ggml-small.en-q5_1.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.en-q5_1.bin

## Find the exact binary name (upstream renamed it at some point):
dpkg -L whisper.cpp | grep -E 'bin/whisper.*server'

## Start the server. --threads should match your CPU's physical core count
## (not logical / SMT count) — whisper.cpp does not benefit from SMT and
## using all logical cores typically hurts due to cache contention.
whisper-server \
  --model ~/echokit/asr/ggml-small.en-q5_1.bin \
  --host 127.0.0.1 --port 9092 \
  --inference-path /v1/audio/transcriptions \
  --threads 12

--threads defaults to 4, which noticeably underutilises any modern desktop CPU. For a 12-core Ryzen 9 5900X, --threads 12 is correct; adjust for your machine. Omit --convert if your incoming audio is already 16 kHz mono 16-bit PCM (as EchoKit's device audio is) — that flag shells out to ffmpeg on every request and adds latency for no gain.

Smoke-test from another terminal with any short 16 kHz mono WAV:

time curl -s http://localhost:9092/v1/audio/transcriptions \
  -F file=@/path/to/some.wav -F model=whisper

A JSON {"text": "..."} response confirms it works. On a 5900X with the stock Debian package and small.en-q5_1, expect around 1.5s of wall-clock for 5s of audio. That works, but is slower than necessary for reasons described in the next subsection. For a comfortable EchoKit experience you want sub-second; read on.

whisper-server is thin plumbing around the libggml library, which is where all the compute-intensive matrix kernels live. Debian's libggml package is built for a conservative CPU baseline (Haswell on x86_64) for portability across the distro's supported hardware. On anything newer — Zen 3, Zen 4, modern Intel — this leaves most of the ISA unused: no AVX-512, no VNNI, no FMA3 beyond Haswell's subset. Rebuilding libggml with -march=native typically yields a ~4x speedup on these CPUs. whisper.cpp itself does not need rebuilding, because it links libggml dynamically.

mkdir -p ~/src/ggml-deb && cd ~/src/ggml-deb
apt source ggml
sudo apt build-dep ggml
cd ggml-*/

## Edit debian/rules to add native flags to the dh_auto_configure override.
## Find the existing `override_dh_auto_configure:` block and add:
##     -DGGML_NATIVE=ON \
##     -DCMAKE_C_FLAGS="-march=native -O3" \
##     -DCMAKE_CXX_FLAGS="-march=native -O3"
## (Mind the tab indentation — it's a Makefile.)
$EDITOR debian/rules

## Bump the version so apt can track and upgrade cleanly later:
dch -l +native "Rebuild with GGML_NATIVE=ON and -march=native"

## Build (unsigned, binary-only):
dpkg-buildpackage -us -uc -b
cd ..

## Install the resulting .debs and pin them so apt doesn't silently replace
## them with the stock ones on upgrade:
sudo dpkg -i libggml*+native*.deb
sudo apt-mark hold libggml libggml-dev

Restart whisper-server, rerun the timing test. On a 5900X with small.en-q5_1, expect a few hundred ms for 5s of audio — comfortably inside EchoKit's latency budget.

If your CPU is Haswell-era or older, the rebuild buys little; the stock package is already tuned for you.

Voice activity detection (optional)

EchoKit supports an optional VAD service that detects when a speaker has finished talking, so the server can hand audio to whisper at the right moment instead of sending arbitrary-length buffers. Once whisper is fast enough (sub-second with the native libggml rebuild), VAD is not strictly needed for the pipeline to function — the device-side turn-detection is usually adequate.

If you do want server-side VAD, options are:


3. TTS: Piper with a Python HTTP wrapper

Piper is an ONNX-based neural TTS from the Rhasspy project. It runs as a native binary with per-voice model files and is fast on CPU. It has no built-in HTTP server, so a small Python wrapper bridges it to EchoKit's OpenAI-style TTS endpoint.

Install Piper and download a voice

mkdir -p ~/echokit/tts && cd ~/echokit/tts

## Binary release
curl -LO https://github.com/rhasspy/piper/releases/download/2023.11.14-2/piper_linux_x86_64.tar.gz
tar xzf piper_linux_x86_64.tar.gz
## This produces ./piper/piper (binary) and ./piper/espeak-ng-data/

## Voice model. Ryan-high is a clear, natural American English male voice.
## Browse https://rhasspy.github.io/piper-samples/ to pick a different one.
mkdir -p voices && cd voices
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/high/en_US-ryan-high.onnx
curl -LO https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/high/en_US-ryan-high.onnx.json
cd ..

Quick CLI smoke test:

echo "Hello, this is Piper." | \
  ./piper/piper --model voices/en_US-ryan-high.onnx --output_file /tmp/test.wav
aplay /tmp/test.wav

Install the Python wrapper

One dependency:

python3 -m pip install --user flask

Save the following as ~/echokit/tts/server.py:

##!/usr/bin/env python3
"""Minimal OpenAI-compatible HTTP wrapper around the Piper TTS binary."""
import os
import subprocess
import tempfile

from flask import Flask, request, Response

PIPER_BIN = os.path.expanduser("~/echokit/tts/piper/piper")
MODEL = os.path.expanduser("~/echokit/tts/voices/en_US-ryan-high.onnx")

app = Flask(__name__)


@app.post("/v1/audio/speech")
def speech():
    body = request.get_json(force=True)
    text = body.get("input", "")
    if not text:
        return ("missing 'input'", 400)

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        out_path = f.name
    try:
        subprocess.run(
            [PIPER_BIN, "--model", MODEL, "--output_file", out_path],
            input=text.encode("utf-8"),
            check=True,
            capture_output=True,
        )
        with open(out_path, "rb") as g:
            wav = g.read()
    finally:
        os.unlink(out_path)

    return Response(wav, mimetype="audio/wav")


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=9094)

Run it:

python3 ~/echokit/tts/server.py

Smoke-test from another terminal:

curl -X POST http://localhost:9094/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"input":"Hello from Piper via the wrapper."}' \
  --output /tmp/piper.wav
aplay /tmp/piper.wav

Leave the wrapper running.


4. EchoKit config.toml

With all three services up, EchoKit's configuration points at each by URL:

addr = "0.0.0.0:8080"
hello_wav = "hello.wav"

[asr]
platform = "openai"
url = "http://localhost:9092/v1/audio/transcriptions"
api_key = "NONE"
model = "whisper"
lang = "en"
prompt = "Hello\n(noise)\n(silence)\n"

[llm]
platform = "openai_chat"
url = "http://localhost:11434/v1/chat/completions"
api_key = "NONE"
model = "qwen3.5:9b-q4_K_M"
history = 20

[[llm.sys_prompts]]
role = "system"
content = """
You are a helpful, concise voice assistant. Keep answers short —
one or two sentences unless more is clearly needed.
"""

[tts]
platform = "openai"
url = "http://localhost:9094/v1/audio/speech"
api_key = "NONE"
model = "piper"
voice = "en_US-ryan-high"

Build and run the EchoKit server from source (https://github.com/second-state/echokit_server):

git clone https://github.com/second-state/echokit_server
cd echokit_server
## Put your config.toml and a hello.wav in the working directory.
cargo build --release
RUST_LOG=info ./target/release/echokit_server

On the device side, pair via the https://echokit.dev/setup/ page over Bluetooth and set the WebSocket server URL to:

ws://<your-machine-ip>:8080/ws/

Note the ws:// (not http://) scheme and the trailing /ws/ path — both are required. The same URI can be pasted into the browser tester at https://echokit.dev/chat/ to validate the server without the device.


5. Bringing it up

Start services in this order; each one leaves a foreground process, so use separate terminals or a multiplexer:

  1. Ollama — already running as a systemd service after install.
  2. whisper-server on port 9092 (section 2).
  3. Piper Flask wrapper on port 9094 (section 3).
  4. echokit_server on port 8080 (section 4).

If something misbehaves, curl each local endpoint in isolation to pinpoint the stage. RUST_LOG=debug on the EchoKit server will show which upstream call failed.


Alternative paths that did not pan out

Recorded here for the benefit of anyone else considering similar approaches.

qwen3_tts_rs on GPU

Second State's own Qwen3-TTS Rust port is an attractive option — same maintainers as EchoKit itself, safetensors-based, no baked-in TorchScript device or dtype assumptions. The CPU path works out of the box (the installer's default pick) and produces excellent voice quality, including built-in named speakers like "Vivian" (named after a Second State engineer, a nice demo detail). The catch is that it is very slow on CPU — several seconds per short phrase on a 5900X — making it impractical for conversational use.

The GPU path did not work in this environment. The CLI binary hardcodes Device::Cpu; patching it to Device::cuda_if_available() and rebuilding produces a binary that panics at model load with:

Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend.

Diagnosis: tch's torch-sys build script correctly emits cargo:rustc-link-lib=torch_cuda, but those link directives do not propagate to the final binary's link step. ldd on the resulting binary shows no libtorch .so files in its DT_NEEDED list — only libcudart.so.12 as a transitive dependency. RUSTFLAGS with --no-as-needed and --copy-dt-needed-entries do not help: there is nothing to save, because the libtorch libraries are never passed to the linker to begin with.

This is plausibly a tch/torch-sys packaging bug surfacing under recent Rust toolchains, which default to rust-lld as the linker on Linux. An issue has been filed upstream.

gsv_tts (GPT-SoVITS)

EchoKit's first-party local TTS option. Builds cleanly on Debian but the published v2pro model weights have baked-in fp16 and cuda:0 device assignments in the exported TorchScript graph. Any GPU path fails with either a complex_half FFT error on Ampere or a CUDAHalf / CUDAFloat type mismatch depending on which model variant is used. Only works with CUDA_VISIBLE_DEVICES="" and the .cpu.pt model variants, i.e. CPU-only. The reference-audio voice cloning also wants a cleaner recording than the shipped examples provide. Issue filed upstream. Not recommended unless voice cloning is specifically needed and the model-export problems are acceptable.

Piper via LlamaEdge tts-api-server.wasm

The official-looking path for Piper in the EchoKit/LlamaEdge ecosystem. It requires a specific WasmEdge plugin (wasi_nn-piper) that the Debian wasmedge package does not ship. Upstream plugin tarballs are built for Ubuntu 20.04 / 22.04 glibc, which does not cleanly match Debian Sid. The Python wrapper used here is simpler and more portable across distros, at the cost of depending on Python and Flask — a trade worth making.

Known limitations


Summary

Component Backend Device Status
LLM Ollama + qwen3.5:9b-q4_K_M GPU Works
ASR whisper.cpp (Debian package) CPU Works
TTS Piper via Python Flask wrapper CPU Works
End-to-end voice conversation Works

Start all four services, point the EchoKit device at ws://<your-ip>:8080/ws/, and talk.