Building a Voice-Controlled Motion Recognition Server
I've been working on building a comprehensive server that combines voice control with motion recognition, and I wanted to share my journey and the key design decisions that shaped this project. The server acts as the brain for a system that can understand both voice commands and motion patterns, making it suitable for sports analysis, fitness tracking, or any application requiring multimodal input processing.
Architecture Overview
The server is built using FastAPI and Python, structured as a modular system with clear separation of concerns. At its core, it handles three main types of data:
Voice commands and speech recognition
IMU sensor data from Bluetooth devices
WebSocket connections for real-time communication
The modular structure includes separate services for each major functionality, making the codebase maintainable and allowing individual components to be updated without affecting others.
Voice Processing: A Two-Tier Approach
One of my most important design decisions was implementing a two-tier voice processing system. I realized early on that different voice processing tasks have different requirements - keyword spotting needs to be fast and lightweight, while understanding complex commands requires more sophisticated processing.
Vosk for Keyword Spotting
I chose Vosk for keyword spotting because it offers exceptional speed and accuracy for a focused vocabulary. The system maintains a list of German keywords (numbers, commands like "controller", "start", "stopp") and can detect them with minimal latency. Vosk runs continuously on the audio stream, consuming minimal resources while maintaining high accuracy.
The implementation uses WebRTC VAD (Voice Activity Detection) to efficiently process audio, only engaging the recognition engine when speech is detected. This approach significantly reduces computational overhead. When a keyword is spotted, the system can trigger more complex processing if needed.
Whisper for Complex Commands
For longer and more nuanced commands, I integrated OpenAI's Whisper model through the pywhispercpp library. This decision was driven by Whisper's superior ability to understand context and handle complex utterances. However, since Whisper is resource-intensive, it's only activated conditionally - specifically when the keyword "controller" is detected.
The Whisper integration is optimized for Apple Silicon (M4) with Metal acceleration, using 6 threads and greedy sampling for optimal performance. I also implemented a filtering system to remove common hallucinated phrases that Whisper sometimes generates, improving the overall accuracy of the transcription.
Command Matching with FAISS and Embeddings
Perhaps the most interesting architectural decision was using embedding-based similarity search for command matching. Instead of relying on rigid pattern matching or regular expressions, I implemented a vector search system using FAISS and sentence transformers.
The system uses the Qwen3-Embedding-0.6B model to generate 1024-dimensional embeddings for all possible commands. When a user speaks or types a command, it's converted to an embedding and compared against the command database using FAISS's efficient similarity search. This approach provides several advantages:
Fuzzy matching: Users don't need to remember exact command phrases
Multilingual support: The embedding model can understand variations and even different languages
Extensibility: New commands can be added without modifying any matching logic
Performance: FAISS provides sub-millisecond search times even with thousands of commands
The command service maintains a cache of recent matches to further improve performance, and the entire embedding initialization happens asynchronously during server startup to avoid blocking the main application.
Sensor Data Processing and Storage
The server includes a sophisticated system for handling IMU (Inertial Measurement Unit) sensor data. I designed it to accept data in two formats:
JSON format: For structured data with individual timestamps
Binary format: For efficient bulk data transfer
The binary format was particularly important for reducing bandwidth usage when receiving data from battery-powered Bluetooth devices. The server can decode concatenated binary batches, each containing multiple sensor readings with shared timestamps. The binary protocol uses fixed-point arithmetic (scaling factors of 100 for accelerometer and 20 for gyroscope) to maintain precision while minimizing data size.
All sensor data is stored in a SQLite database with proper indexing on device_id and timestamp for efficient time-series queries. The database service uses SQLAlchemy with bulk insert operations to handle high-frequency sensor data efficiently.
Machine Learning Pipeline
The ML pipeline represents one of the most complex parts of the system. I structured it using clean architecture principles with clear separation between data sources, preprocessing, feature extraction, and model training.
The pipeline uses an LSTM-CNN hybrid model for motion recognition. This architecture was chosen because:
LSTMs capture temporal dependencies in motion sequences
CNNs extract spatial features from sensor readings
The combination provides robust motion classification
The training pipeline includes:
Automatic sequence creation from continuous sensor streams
Label generation from voice commands (using keywords like "rally", "start", "playing")
Data preprocessing with outlier removal and normalization
Support for both training and inference modes
The ML service is designed to be adapter-agnostic, meaning it can work with different data sources (database, files, streams) without modification to the core logic.
Real-time Communication
WebSocket support enables real-time bidirectional communication with clients. The implementation includes:
Streaming audio processing for live transcription
Real-time sensor data streaming
Command execution with immediate feedback
Automatic reconnection handling
The WebSocket routes integrate seamlessly with the voice processing services, allowing clients to stream audio and receive transcriptions in real-time.
Performance Optimizations
Throughout the implementation, I made several key optimizations:
Model Loading: Both Vosk and embedding models are loaded once at startup and reused across requests
Bulk Operations: Database operations use bulk inserts for efficiency
Caching: Command matching results are cached to avoid repeated embedding calculations
Async Processing: Heavy operations like embedding initialization happen asynchronously
Resource Management: Proper cleanup and resource management prevent memory leaks
Challenges and Solutions
One significant challenge was managing the different processing requirements for various voice tasks. The solution - using Vosk for keywords and Whisper for complex commands - provides an optimal balance between responsiveness and capability.
Another challenge was handling the continuous stream of sensor data without overwhelming the system. The binary protocol and bulk insert operations solved this effectively, allowing the system to handle thousands of sensor readings per second.
Future Enhancements
Looking ahead, I plan to:
Expand the motion recognition capabilities to detect more complex movements
Integrate more sophisticated audio labeling using the voice commands as ground truth
Add support for multiple simultaneous users with speaker identification
Implement real-time motion feedback based on the ML model predictions
Conclusion
Building this server taught me the importance of choosing the right tool for each job. Python proved to be the ideal choice due to the excellent library support - pywhispercpp for speech recognition, FAISS for vector search, and PyTorch for machine learning all have first-class Python bindings.
The modular architecture has already proven its worth, allowing me to iterate quickly on individual components without affecting the overall system. Whether you're building a fitness app, a sports analysis tool, or any system requiring voice and motion input, I hope these design decisions and insights prove helpful for your own projects.