Aethos · Voice · by STK Engineering Local speech recognition & synthesis · 2026

Your voice. Your infrastructure.
Zero cloud.

Real-time speech recognition with Whisper, neural speech synthesis with F5-TTS and Kokoro, and voice cloning — all running on your hardware, behind your firewall. Not a single second of audio leaves the building.

Hear the difference Get started

I · CAPABILITIES Full-stack voice, fully local

Full-stack voice,
fully local.

Real-Time Speech Recognition

Whisper on NPU with streaming. Multilingual, low word-error rate even in domain-specific vocabulary. Medical, legal, financial terminology out of the box.

Neural Speech Synthesis

F5-TTS and Kokoro engines. Natural prosody, emotional range, breathing patterns. Indistinguishable from human speech.

Voice Cloning

Clone any voice from a short sample. Your CEO's voice for internal comms, your brand voice for customer interactions. Consent-based, auditable, sovereign.

Multilingual by Default

30+ languages, code-switching within sentences. Accent preservation, dialect awareness. No per-language licensing.

NPU-Accelerated

Runs on Ryzen AI NPU for always-on, low-power inference. GPU not required for standard workloads. Scales from laptop to data centre.

Streaming Pipeline

Sub-200ms first-token latency. Bidirectional streaming via WebSocket and gRPC. Interruption handling, barge-in detection built in.

II · USE CASES Where voice creates value

Where voice
creates value.

CASE 01

Voice Agents for Customer Service

AI phone agents that listen, understand and respond in natural speech. Handle tier-1 inquiries, route complex cases, document everything. 24/7, every language.

CASE 02

Dictation & Documentation

Real-time transcription for doctors, lawyers, engineers. Domain-specific vocabulary, automatic formatting, direct integration into EHR/DMS systems.

CASE 03

Accessible Interfaces

Screen readers, voice navigation, audio descriptions. Making applications accessible to visually impaired users. Compliance with WCAG and EN 301 549.

CASE 04

Brand Voice & Content

Generate training videos, product announcements, internal communications in your own brand voice. Consistent tone across all channels, all languages.

III · THE PIPELINE From sound to meaning and back

From sound to meaning
and back.

A six-stage pipeline that captures audio, understands speech, processes intent, and responds in natural voice — all locally, all in real time. Sub-second response time end-to-end.

Audio Capture

Microphone input, telephony stream or file upload. Noise cancellation and gain normalisation applied at source.

VAD & Segmentation

Voice activity detection isolates speech from silence. Segments are chunked for streaming inference.

Whisper STT

Speech-to-text via Whisper Large V3. Multilingual transcription with domain-adapted vocabulary.

LLM Processing

Transcribed text is processed by the local LLM for intent recognition, response generation or task execution.

TTS Synthesis

Response text is synthesised into speech via F5-TTS or Kokoro. Voice cloning applied if configured.

Audio Playback

Synthesised audio is streamed back to the client. Sub-200ms latency from text to first audio frame.

IV · SPECIFICATIONS Technical detail

Built to
specification.

Recognition

Whisper Large V3

<50ms chunk latency. WER <5% on domain-adapted data. Streaming and batch modes.

Synthesis

F5-TTS / Kokoro

24kHz output. <200ms first-token latency. Natural prosody with emotional control.

Languages

30+ Languages

Real-time code-switching. Accent preservation. Dialect awareness. No per-language licensing.

Hardware

NPU / GPU / CPU

NPU at 5W for always-on inference. GPU at 50W for peak loads. CPU fallback for maximum compatibility.

Integration

WebSocket, gRPC, REST, SIP

Bidirectional streaming. SIP trunk for telephony. REST for batch processing. gRPC for low-latency pipelines.

Security

Zero-Cloud Architecture

No data egress. Encrypted at rest and in transit. Full audit trail on every inference request.

V · GET STARTED Your voice, on your terms

Let's give your
business a voice.

Ready to deploy sovereign speech AI? Tell us about your use case and we'll set up a demo tailored to your infrastructure.

Vienna, Austria

office@stk-engineering.com
Ferrogasse 59, 1180 Wien

FN 386373x · Handelsgericht Wien
UID: ATU67528106

Belgrade, Serbia

office@stk-engineering.com
Moravska 6, 11000 Beograd

Maticni Broj: 20960671

Chalandri, Greece

office@stk-engineering.com
Nestoros 1, 15231 Chalandri

ΓΕΜΗ: 192581301000
ΑΦΜ: 803229510

Email *

Name

I am interested in

Voice Agents Dictation & Documentation Voice Cloning Accessibility Pilot Project Consulting

Your voice. Your infrastructure.Zero cloud.

Full-stack voice,fully local.