Aethos · Voice · by STK Engineering Local speech recognition & synthesis · 2026

Your voice. Your infrastructure.
Zero cloud.

Real-time speech recognition with Whisper, neural speech synthesis with F5-TTS and Kokoro, and voice cloning — all running on your hardware, behind your firewall. Not a single second of audio leaves the building.

I · CAPABILITIES Full-stack voice, fully local

Full-stack voice,
fully local.

A

Real-Time Speech Recognition

Whisper on NPU with streaming. Multilingual, low word-error rate even in domain-specific vocabulary. Medical, legal, financial terminology out of the box.

B

Neural Speech Synthesis

F5-TTS and Kokoro engines. Natural prosody, emotional range, breathing patterns. Indistinguishable from human speech.

C

Voice Cloning

Clone any voice from a short sample. Your CEO's voice for internal comms, your brand voice for customer interactions. Consent-based, auditable, sovereign.

D

Multilingual by Default

30+ languages, code-switching within sentences. Accent preservation, dialect awareness. No per-language licensing.

E

NPU-Accelerated

Runs on Ryzen AI NPU for always-on, low-power inference. GPU not required for standard workloads. Scales from laptop to data centre.

F

Streaming Pipeline

Sub-200ms first-token latency. Bidirectional streaming via WebSocket and gRPC. Interruption handling, barge-in detection built in.

II · USE CASES Where voice creates value

Where voice
creates value.

CASE 01

Voice Agents for Customer Service

AI phone agents that listen, understand and respond in natural speech. Handle tier-1 inquiries, route complex cases, document everything. 24/7, every language.

CASE 02

Dictation & Documentation

Real-time transcription for doctors, lawyers, engineers. Domain-specific vocabulary, automatic formatting, direct integration into EHR/DMS systems.

CASE 03

Accessible Interfaces

Screen readers, voice navigation, audio descriptions. Making applications accessible to visually impaired users. Compliance with WCAG and EN 301 549.

CASE 04

Brand Voice & Content

Generate training videos, product announcements, internal communications in your own brand voice. Consistent tone across all channels, all languages.

III · THE PIPELINE From sound to meaning and back

From sound to meaning
and back.

A six-stage pipeline that captures audio, understands speech, processes intent, and responds in natural voice — all locally, all in real time. Sub-second response time end-to-end.

01

Audio Capture

Microphone input, telephony stream or file upload. Noise cancellation and gain normalisation applied at source.

02

VAD & Segmentation

Voice activity detection isolates speech from silence. Segments are chunked for streaming inference.

03

Whisper STT

Speech-to-text via Whisper Large V3. Multilingual transcription with domain-adapted vocabulary.

04

LLM Processing

Transcribed text is processed by the local LLM for intent recognition, response generation or task execution.

05

TTS Synthesis

Response text is synthesised into speech via F5-TTS or Kokoro. Voice cloning applied if configured.

06

Audio Playback

Synthesised audio is streamed back to the client. Sub-200ms latency from text to first audio frame.

IV · SPECIFICATIONS Technical detail

Built to
specification.

Recognition

Whisper Large V3

<50ms chunk latency. WER <5% on domain-adapted data. Streaming and batch modes.

Synthesis

F5-TTS / Kokoro

24kHz output. <200ms first-token latency. Natural prosody with emotional control.

Languages

30+ Languages

Real-time code-switching. Accent preservation. Dialect awareness. No per-language licensing.

Hardware

NPU / GPU / CPU

NPU at 5W for always-on inference. GPU at 50W for peak loads. CPU fallback for maximum compatibility.

Integration

WebSocket, gRPC, REST, SIP

Bidirectional streaming. SIP trunk for telephony. REST for batch processing. gRPC for low-latency pipelines.

Security

Zero-Cloud Architecture

No data egress. Encrypted at rest and in transit. Full audit trail on every inference request.

Sovereign by design

Your audio stays yours.

Not a single second of audio leaves your network. Aethos Voice is built for highly regulated industries — enterprise-grade speech AI, delivered inside your existing infrastructure.

V · GET STARTED Your voice, on your terms

Let's give your
business a voice.

Ready to deploy sovereign speech AI? Tell us about your use case and we'll set up a demo tailored to your infrastructure.

Vienna, Austria
office@stk-engineering.com
Ferrogasse 59, 1180 Wien
Belgrade, Serbia
office@stk-engineering.com
Moravska 6, 11000 Beograd
Chalandri, Greece
office@stk-engineering.com
Nestoros 1, 15231 Chalandri
I am interested in