TRAMBA: Practical Bone-Conduction / IMU Speech Enhancement

TRAMBA enhances vibration-based speech BCM ACCEL : naturally robust to ambient noise but missing high-frequency components.
TRAMBA enhances vibration-based speech (BCM/ACCEL): naturally robust to ambient noise but missing high-frequency components.

Vibration-based speech sensing (e.g., bone-conduction microphones and accelerometers) is inherently robust to ambient noise, but it loses high-frequency details that are critical for intelligibility and naturalness. TRAMBA restores these missing components via practical super-resolution and enhancement, enabling real-time, on-device deployment on a wearable + smartphone pipeline.

Performance vs. efficiency trade-offs compared with state-of-the-art U-Net and GAN baselines model size, inference time, and fine-tuning time .
Performance vs. efficiency trade-offs compared with state-of-the-art U-Net and GAN baselines (model size, inference time, and fine-tuning time).

Why TRAMBA is practical

Core Idea: Practical Super-Resolution from Low-Rate Vibration

Unlike over-the-air (OTA) audio, BCM/ACCEL signals already suppress many ambient noises, but they contain limited high-frequency energy. TRAMBA targets robust super-resolution from signals directly sampled at low rates (rather than “filter then decimate”), which enables savings on both:

Effect of sampling rate on quality: TRAMBA remains competitive even at low sampling rates.
Effect of sampling rate on quality: TRAMBA remains competitive even at low sampling rates.

Architecture

TRAMBA architecture: modified U-Net with self-attention in down up blocks and a Mamba bottleneck for efficient long-range modeling.
TRAMBA architecture: modified U-Net with self-attention in down/up blocks and a Mamba bottleneck for efficient long-range modeling.

TRAMBA adopts a modified U-Net backbone with:

Benchmark snapshot (OTA Audio SR, 4 kHz → 16 kHz)

On the OTA audio super-resolution benchmark (512 ms window), TRAMBA achieves:

(These values match Table 1 in the paper.)

Training & Personalization (User-Specific Fine-Tuning)

TRAMBA follows a practical training pipeline: 1) Pre-train using widely available OTA audio (simulate low-rate inputs by attenuating high-frequency components). 2) Fine-tune with a small amount of paired vibration data (BCM/ACCEL ↔ OTA reference) for user-specific adaptation.

This personalization step is key for handling different users and sensor placements.

System & Prototype

Data collection locations, environments, and our mobile-TRAMBA prototype.
Data collection locations, environments, and our mobile-TRAMBA prototype.

TRAMBA is integrated into an end-to-end wearable/mobile pipeline:

Power + data-rate savings (wearable sensing + streaming)

Sampling/streaming at 4 kHz (64 kbps) consumes 3.21 mW, while streaming full-resolution 16 kHz (256 kbps) consumes 6.48 mW.
At 500 Hz (8 kbps), power drops to 2.49 mW, which corresponds to ~160% longer battery life vs. 16 kHz (6.48 / 2.49 ≈ 2.60×).
(See Table 9.)

The paper also reports that operating at 4 kHz enables 75% less transmitted data and >50% lower power compared to 16 kHz OTA streaming.

Real-time latency on phones

TRAMBA processes 512 ms windows and achieves <30 ms inference time on modern iPhones (e.g., iPhone 15 Pro: 19.584 ms; iPhone 12: 27.931 ms), enabling real-time enhancement. (See Table 10.)

Robustness: Noisy Environments & Movement

Code and Media

Publications