Using Natural Language to change Software Settings

Changing Software Settings Through Natural Language: A Developer's Journey

As a software developer, I've always been fascinated by the idea of making technology more human. Recently, I had the opportunity to work on a project that brought this vision to life: building a system where users can control their app settings simply by speaking naturally, like saying "switch to dark mode" or "connect to my Bluetooth headphones."

The Challenge: Bridging Human Speech and Software Actions

The challenge was fascinating from an engineering perspective. How do you take the messy, imperfect world of human speech and translate it into precise software commands? I needed to build a system that could:

  1. Capture high-quality audio from users

  2. Convert speech to text accurately

  3. Understand the intent behind natural language

  4. Execute the correct system actions

  5. Provide meaningful feedback

Let me walk you through how I designed each component of this natural language command system.

The Audio Pipeline: From Voice to Data

The foundation of any voice command system is rock-solid audio processing. I started with the Android client side, implementing AudioRecordingManager.kt with some interesting design decisions.

Smart Audio Processing with VAD

Rather than sending continuous audio streams (which would drain battery and waste bandwidth), I implemented Voice Activity Detection (VAD) using WebRTC. The system buffers audio during silence and only transmits when it detects speech:

private fun processVADDecision(hasVoice: Boolean, frameData: ByteArray, webSocket: WebSocket): Boolean {
    if (hasVoice) {
        consecutiveVoiceFrames++
        val isSpeechOnset = consecutiveVoiceFrames == minConsecutiveVoiceFrames

        if (isSpeechOnset) {
            // Send buffered pre-speech frames to capture speech beginning
            sendPreSpeechBuffer(webSocket)
        }

        if (consecutiveVoiceFrames >= minConsecutiveVoiceFrames) {
            voiceFramesDetected++
            postSpeechPadding = maxPostSpeechPadding
            return true // Transmit current frame
        }
    }
    // ...
}

This approach reduces data transmission by up to 80% while ensuring we never miss the beginning of a user's command. The pre-speech buffering was crucial - without it, users would have to pause before speaking, making the interaction feel unnatural.

Intelligent Audio Routing

One detail I'm particularly proud of is the automatic headset detection. The system automatically switches to Bluetooth or wired headset microphones when available:

private fun configureHeadsetMicrophone() {
    val headsetMic = inputDevices.find { device ->
        device.type == AudioDeviceInfo.TYPE_WIRED_HEADSET ||
        device.type == AudioDeviceInfo.TYPE_BLUETOOTH_SCO ||
        device.type == AudioDeviceInfo.TYPE_USB_HEADSET
    }

    if (headsetMic != null && headsetMic.type == AudioDeviceInfo.TYPE_BLUETOOTH_SCO) {
        audioManager.startBluetoothSco()
        audioManager.isBluetoothScoOn = true
    }
}

This seemingly small feature dramatically improves speech recognition accuracy in noisy environments, which is critical for reliable natural language processing.

The User Interface: Making Voice Feel Natural

On the frontend, I designed ChatInput.tsx to provide clear visual feedback about the system's state. Users need to know when the system is listening, when it's processing their command, and what action it took.

The interface includes three distinct interaction modes:

  • Text input: For typing commands directly

  • Voice recording: For transcribing speech to text first

  • Direct voice commands: For immediate command execution

// Direct Send Record Button - executes commands immediately
<TouchableOpacity
  className={`p-3 rounded-xl m-1 ${
    isDirectRecording
      ? 'bg-orange-500 border border-orange-400'
      : 'bg-white dark:bg-dark-background-elevated'
  }`}
  onPress={onDirectRecordToggle}
  disabled={isSendingChat || isRecording}>
  <Icon name="send" size={20} color={isDirectRecording ? '#ffffff' : '#10a37f'} />
</TouchableOpacity>

The color-coded buttons (orange for direct commands, red for recording, green for sending) give users immediate visual feedback about which mode they're in. This prevents the confusion that often plagues voice interfaces.

The Brain: Natural Language Understanding

The real magic happens in the server-side processing. I implemented a multi-layered approach in websocket.py that handles different types of voice interactions:

Keyword Spotting for Simple Commands

For basic commands, I used keyword spotting with Vosk, which provides instant recognition for predefined phrases:

if spotted_lower == "controller":
    logger.info("Controller keyword detected - switching to transcription mode")
    controller_mode = True
    controller_transcriber.set_immediate_recording(True)

    await websocket.send_json({
        "type": "controller_mode_activated",
        "message": "Controller mode activated - listening for commands",
    })

This hybrid approach is key to the system's responsiveness. Simple commands like "löschen" (delete) are processed instantly, while complex natural language goes through the full transcription pipeline.

Advanced Speech-to-Text with Whisper

For complex commands, I implemented transcription_service.py using Whisper.cpp with Apple Metal acceleration. The key insight here was optimizing for real-time performance while maintaining accuracy:

# Optimized settings for M4 processor
self.model = Model(
    model=WHISPER_MODEL_SIZE,
    n_threads=6,
    params_sampling_strategy=0,  # Greedy sampling for speed
    redirect_whispercpp_logs_to=False,  # Disable logging for performance
)
I also implemented hallucination filtering to remove common Whisper artifacts:
def _filter_hallucinations(self, text: str) -> str:
    hallucinated_phrases = {
        "vielen dank", "untertitelung des zdf", "tschüss.", "amen"
    }

    text_lower = text.lower().strip()
    if text_lower in self.hallucinated_phrases:
        logger.debug(f"Filtered out hallucinated phrase: '{text}'")
        return ""

    return text

This prevents the system from trying to execute nonsensical commands generated by the AI model.

Semantic Understanding with Vector Search

The most sophisticated part of the system is the natural language understanding layer in vector_service.py. Instead of rigid pattern matching, I used semantic embeddings to match user commands to actions:

async def search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
    query_embedding = self._get_embedding(query)
    distances, indices = self.index.search(query_embedding_np, min(k, self.index.ntotal))

    results = []
    for i, idx in enumerate(indices[0]):
        if idx != -1:
            meta = self.metadata[idx]
            result = meta.copy()
            result["similarity"] = float(1 - distances[0][i] / 2)
            results.append(result)

    return results

This allows the system to understand that "make it darker", "switch to night mode", and "enable dark theme" all refer to the same action, even though they use completely different words.

The Action Layer: Executing Commands Safely

The final piece is CommandService.ts, which bridges the gap between understood intent and actual system actions. I designed this with security and reliability in mind:

private handleCommand(response: TextCommandResponse): void {
  if (response.type === 'command_result' && response.result) {
    const {action_type} = response.result;

    switch (action_type) {
      case 'change_to_dark_theme':
      case 'change_to_light_theme':
        this.handleThemeChange(action_type);
        break;
      case 'scan_bluetooth_devices':
        this.handleBluetoothScan();
        break;
      // ...
    }
  }
}

Each action type has a specific handler that validates the request and executes it safely. The system never executes arbitrary code from voice commands - everything goes through predefined, secure action handlers.

Context-Aware Execution

One design decision I'm particularly proud of is the context-aware execution system. The command service only executes actions when it has the proper context functions registered:

private handleThemeChange(action_type: string): void {
  if (!getCurrentThemeFunction || !themeToggleFunction) {
    console.warn('Theme functions not registered');
    return;
  }

  const currentTheme = getCurrentThemeFunction();
  const targetTheme = action_type === 'change_to_dark_theme' ? 'dark' : 'light';

  if (currentTheme !== targetTheme) {
    themeToggleFunction();
    console.log(`Theme changed to ${targetTheme} mode`);
  }
}

This prevents the system from attempting actions when the necessary UI components aren't available, making it much more robust in different app states.

Lessons Learned: The Devil is in the Details

Building a reliable natural language interface taught me several important lessons:

Performance Matters More Than Accuracy

Users will tolerate a 95% accurate system that responds in 500ms, but they'll abandon a 99% accurate system that takes 3 seconds. Every optimization I made prioritized speed while maintaining "good enough" accuracy.

Feedback Loops Are Critical

Users need to know what's happening at every step. The visual indicators, audio feedback, and status messages aren't just nice-to-have features - they're essential for user trust and adoption.

Graceful Degradation Wins

The system works even when individual components fail. If speech recognition fails, users can type. If the vector search is slow, keyword spotting provides fallback functionality. This layered approach creates a robust user experience.

Context Is Everything

The same words can mean different things in different situations. Building context awareness into every layer of the system - from audio processing to command execution - was crucial for creating interactions that feel natural and intelligent.

The Future of Natural Interfaces

Working on this project convinced me that natural language interfaces aren't just a novelty - they're the future of human-computer interaction. But the key insight is that they work best when they augment traditional interfaces rather than replacing them entirely.

The system I built allows users to quickly change settings, start actions, and navigate the app using voice, while still providing traditional touch controls for complex operations. This hybrid approach gives users the best of both worlds: the speed and naturalness of voice for simple commands, and the precision of touch for detailed work.

As AI continues to improve and edge computing becomes more powerful, I expect we'll see natural language interfaces become as common as touchscreens are today. The challenge for developers will be building systems that feel magical to users while remaining reliable, secure, and performant under the hood.

The code architecture I've shared here provides a solid foundation for any developer looking to add natural language capabilities to their applications. The key is to start simple, optimize for real-world usage, and always keep the user experience at the center of every design decision.