Building a Voice-Controlled Motion Recognition Server

I've been working on building a comprehensive server that combines voice control with motion recognition, and I wanted to share my journey and the key design decisions that shaped this project. The server acts as the brain for a system that can understand both voice commands and motion patterns, making it suitable for sports analysis, fitness tracking, or any application requiring multimodal input processing.

Architecture Overview

The server is built using FastAPI and Python, structured as a modular system with clear separation of concerns. At its core, it handles three main types of data:

  • Voice commands and speech recognition

  • IMU sensor data from Bluetooth devices

  • WebSocket connections for real-time communication

The modular structure includes separate services for each major functionality, making the codebase maintainable and allowing individual components to be updated without affecting others.

Voice Processing: A Two-Tier Approach

One of my most important design decisions was implementing a two-tier voice processing system. I realized early on that different voice processing tasks have different requirements - keyword spotting needs to be fast and lightweight, while understanding complex commands requires more sophisticated processing.

Vosk for Keyword Spotting

I chose Vosk for keyword spotting because it offers exceptional speed and accuracy for a focused vocabulary. The system maintains a list of German keywords (numbers, commands like "controller", "start", "stopp") and can detect them with minimal latency. Vosk runs continuously on the audio stream, consuming minimal resources while maintaining high accuracy.

The implementation uses WebRTC VAD (Voice Activity Detection) to efficiently process audio, only engaging the recognition engine when speech is detected. This approach significantly reduces computational overhead. When a keyword is spotted, the system can trigger more complex processing if needed.

Whisper for Complex Commands

For longer and more nuanced commands, I integrated OpenAI's Whisper model through the pywhispercpp library. This decision was driven by Whisper's superior ability to understand context and handle complex utterances. However, since Whisper is resource-intensive, it's only activated conditionally - specifically when the keyword "controller" is detected.

The Whisper integration is optimized for Apple Silicon (M4) with Metal acceleration, using 6 threads and greedy sampling for optimal performance. I also implemented a filtering system to remove common hallucinated phrases that Whisper sometimes generates, improving the overall accuracy of the transcription.

Command Matching with FAISS and Embeddings

Perhaps the most interesting architectural decision was using embedding-based similarity search for command matching. Instead of relying on rigid pattern matching or regular expressions, I implemented a vector search system using FAISS and sentence transformers.

The system uses the Qwen3-Embedding-0.6B model to generate 1024-dimensional embeddings for all possible commands. When a user speaks or types a command, it's converted to an embedding and compared against the command database using FAISS's efficient similarity search. This approach provides several advantages:

  • Fuzzy matching: Users don't need to remember exact command phrases

  • Multilingual support: The embedding model can understand variations and even different languages

  • Extensibility: New commands can be added without modifying any matching logic

  • Performance: FAISS provides sub-millisecond search times even with thousands of commands

The command service maintains a cache of recent matches to further improve performance, and the entire embedding initialization happens asynchronously during server startup to avoid blocking the main application.

Sensor Data Processing and Storage

The server includes a sophisticated system for handling IMU (Inertial Measurement Unit) sensor data. I designed it to accept data in two formats:

  1. JSON format: For structured data with individual timestamps

  2. Binary format: For efficient bulk data transfer

The binary format was particularly important for reducing bandwidth usage when receiving data from battery-powered Bluetooth devices. The server can decode concatenated binary batches, each containing multiple sensor readings with shared timestamps. The binary protocol uses fixed-point arithmetic (scaling factors of 100 for accelerometer and 20 for gyroscope) to maintain precision while minimizing data size.

All sensor data is stored in a SQLite database with proper indexing on device_id and timestamp for efficient time-series queries. The database service uses SQLAlchemy with bulk insert operations to handle high-frequency sensor data efficiently.

Machine Learning Pipeline

The ML pipeline represents one of the most complex parts of the system. I structured it using clean architecture principles with clear separation between data sources, preprocessing, feature extraction, and model training.

The pipeline uses an LSTM-CNN hybrid model for motion recognition. This architecture was chosen because:

  • LSTMs capture temporal dependencies in motion sequences

  • CNNs extract spatial features from sensor readings

  • The combination provides robust motion classification

The training pipeline includes:

  • Automatic sequence creation from continuous sensor streams

  • Label generation from voice commands (using keywords like "rally", "start", "playing")

  • Data preprocessing with outlier removal and normalization

  • Support for both training and inference modes

The ML service is designed to be adapter-agnostic, meaning it can work with different data sources (database, files, streams) without modification to the core logic.

Real-time Communication

WebSocket support enables real-time bidirectional communication with clients. The implementation includes:

  • Streaming audio processing for live transcription

  • Real-time sensor data streaming

  • Command execution with immediate feedback

  • Automatic reconnection handling

The WebSocket routes integrate seamlessly with the voice processing services, allowing clients to stream audio and receive transcriptions in real-time.

Performance Optimizations

Throughout the implementation, I made several key optimizations:

  1. Model Loading: Both Vosk and embedding models are loaded once at startup and reused across requests

  2. Bulk Operations: Database operations use bulk inserts for efficiency

  3. Caching: Command matching results are cached to avoid repeated embedding calculations

  4. Async Processing: Heavy operations like embedding initialization happen asynchronously

  5. Resource Management: Proper cleanup and resource management prevent memory leaks

Challenges and Solutions

One significant challenge was managing the different processing requirements for various voice tasks. The solution - using Vosk for keywords and Whisper for complex commands - provides an optimal balance between responsiveness and capability.

Another challenge was handling the continuous stream of sensor data without overwhelming the system. The binary protocol and bulk insert operations solved this effectively, allowing the system to handle thousands of sensor readings per second.

Future Enhancements

Looking ahead, I plan to:

  • Expand the motion recognition capabilities to detect more complex movements

  • Integrate more sophisticated audio labeling using the voice commands as ground truth

  • Add support for multiple simultaneous users with speaker identification

  • Implement real-time motion feedback based on the ML model predictions

Conclusion

Building this server taught me the importance of choosing the right tool for each job. Python proved to be the ideal choice due to the excellent library support - pywhispercpp for speech recognition, FAISS for vector search, and PyTorch for machine learning all have first-class Python bindings.

The modular architecture has already proven its worth, allowing me to iterate quickly on individual components without affecting the overall system. Whether you're building a fitness app, a sports analysis tool, or any system requiring voice and motion input, I hope these design decisions and insights prove helpful for your own projects.