✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: March 16, 2026
  • 19 min read

Building a Fully Local Voice Assistant with Home Assistant – A Comprehensive Guide

My Journey to a reliable and enjoyable locally hosted voice assistant Configuration Voice Assistant crzynik (Nicolas Mowen) October 27, 2025, 6:03pm 1 I have been watching HomeAssistant’s progress with assist for some time. We previously used Google Home via Nest Minis, and have switched to using fully local assist backed by local first + llama.cpp (previously Ollama).In this post I will share the steps I took to get to where I am today, the decisions I made and why they were the best for my use case specifically. Links to Additional Improvements Here are links to additional improvements posted about in this thread.New Features Security Camera Insight / Analsysis Search for YouTube video and play on the TV Fixing Unwanted HA / LLM Behaviors Overriding the default HassGetWeather intent to have consistent weather outputs Improving handling of unclear requests / false activations Automatically handle obvious transcription errors Optimizing prompt to reduce bloat and token usage Optimizing Performance llama.cpp performance optimizations Hardware Details I have tested a wide variety of hardware from a 3050 to a 3090, most modern discrete GPUs can be used for local assist effectively, it just depends on your expectations of capability and speed for what hardware is required. I am running HomeAssistant on my UnRaid NAS, specs are not really important as it has nothing to do with HA Voice.Voice Hardware: 1 HA Voice Preview Satellite 2 Satellite1 Small Squircle Enclosures 1 Pixel 7a used as a satellite/ hub with View Assist Voice Server Hardware: Beelink MiniPC with USB4 (the exact model isn’t important as long as it has USB4) USB4 eGPU enclosure GPUs The below table shows GPUs that I have tested with this setup. Response time will vary based on the model that is used.GPU Model Class Response Time (after prompt caching) Notes RTX 3090 24GB 20B-30B MoE, 9B Dense 1 – 2 seconds Efficiently and quickly runs models that are optimal for this setup. RX 7900XTX 24GB 20B-30B MoE, 9B Dense 1 – 2 seconds Efficiently and quickly runs models that are optimal for this setup. RTX 5060Ti 16GB 20B MoE, 9B Dense 1.5 – 3 seconds Quick enough to run models that are optimal for this setup with responses < 3 seconds. RX 9060XT 16GB 20B MoE, 9B Dense 1.5 – 4 seconds Quick enough to run models that are optimal for this setup with responses – (play | put on) [(some | a)] [{genre}] (music | tunes | tracks | playlist) [in] [the] [{def_area}] conditions: [] actions: – action: script.get_ma_playlist_id_from_name data: playlistname: >- {{ trigger.slots.genre | replace(‘jacking’, ‘jackin’) | replace(‘old school’, ‘oldskool’) | replace(‘tim liquor’,’tinlicker’) | replace(‘tim licker’,’tinlicker’) | replace(‘tin licker’,’tinlicker’) | replace(‘tin liquor’,’tinlicker’) | replace(‘anjunadeep’,’anjuna_deep’) | replace(‘ ‘, ‘_’) | lower }} response_variable: playlist_info – action: media_player.shuffle_set metadata: {} data: shuffle: true target: area_id: > {% if (trigger.slots.def_area | length >0)%} {{trigger.slots.def_area}} {%else%} {{area_id(trigger.device_id) }} {%endif%} – set_conversation_response: I’ve put on some {{ trigger.slots.genre }} music. enabled: true – action: music_assistant.play_media metadata: {} data: media_id: “{{ playlist_info.uri }}” enqueue: replace target: area_id: > {% if (trigger.slots.def_area | length >0)%} {{trigger.slots.def_area}} {%else%} {{area_id(trigger.device_id) }} {%endif%} enabled: true mode: single And the matching script:- alias: Get MA playlist ID from name description: “” mode: single variables: playlistname: “{{ playlistname }}” sequence: – action: music_assistant.get_library data: limit: 10 search: “{{ playlistname }}” media_type: playlist config_entry_id: 01JPFPPNTCVYQAA9JSBY4319HS response_variable: ma_playlist – repeat: count: “{{ ma_playlist[‘items’] | length }}” sequence: – variables: playlistinfo: name: “{{ ma_playlist[‘items’][repeat.index -1].name }}” uri: “{{ ma_playlist[‘items’][repeat.index -1].uri }}” – if: – condition: template value_template: >- {{ ma_playlist[‘items’][repeat.index -1].name | lower == playlistname | lower }} then: – stop: Returning playlist info as a dictionary. response_variable: playlistinfo 6 Likes TheOriginal92 (The Original92) November 12, 2025, 10:54am 3 How many entity’s are you exposing to Qwen 4B? I’m using Qwen 14B non thinking and exposing just 53 entities makes it behave very unreliable. Sometimes it appears to ignore or forget entities, sometimes features like brightness or volume are not set by the model.NathanCu (Nathan Curtis) November 12, 2025, 11:43am 4 You are describing context overrun. Your entity description plus tool description plus full prompt cannot exceed the context window set by your model. (default for qwen I think is 8K)look in ollama you will see it telling you how much it overran and adjust. You can adjust the number of exposed ens, exposed tools shrink your prompt, or if you have enough vram and your model supports it, crank the context window of the model up.(or all of the above) Sounds like you’re in 4 or 8k land and that would be expected at around 50 something depending on the length of your names etc. 2 Likes crzynik (Nicolas Mowen) November 12, 2025, 12:21pm 5 TheOriginal92: I’m using Qwen 14B non thinking and exposing just 53 entities makes it behave very unreliable. Right now I have 32. On top of what Nathan suggested, depending on which entities you have, maybe consider if all of those devices will be addressed individually.You can create many different types of groups in HA which would only be one entity to pass in. 2 Likes TheOriginal92 (The Original92) November 13, 2025, 1:10pm 6 Thanks for the hint. Infact I’m also using qwen3 4B instruct with its base 8k context. Since I’m using an A2000 ADA with 16Gb VRAM I now doubled the context. Results are better but not perfect. For example „turn on the light in living room“ sometimes turns on a light in another room, or also a fan or socket.I would love to use an 7-8B model of Qwen instruct. Do you know of any available? By the way, your post helped me a lot, please keep on updating if you make further progress. Tank you! TheOriginal92 (The Original92) November 13, 2025, 1:11pm 7 Good suggestion, I started to group all the lights for reduced entities NathanCu (Nathan Curtis) November 13, 2025, 1:15pm 8 You can absolutely fit gpt-oss:20b in a 16g card. It’s my mainline local inference and tbh is WAY more capable than qwen.You still have to manage the context but… Gibennthesamd context size, I’ve been more successful there. In Fridays party (no you don’t need the whole thing) I’m talking about building context how, what is needed and why. If you’re fitting in context, and it still misbehaves, then you have a grounding problem. Welcome to the see saw. Too much context – not done right, too little context – not done.at all 1 Like crzynik (Nicolas Mowen) November 13, 2025, 1:42pm 9 TheOriginal92: I would love to use an 7-8B model of Qwen instruct. Do you know of any available? Are you using the base qwen from Ollama? This are typically quite heavily quantized which is why I recommend picking from hugging face and getting a better model. unsloth/Qwen3-4B-Instruct-2507 · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.1 Like TheOriginal92 (The Original92) November 13, 2025, 2:47pm 10 Okay this is getting better and better. I tryed loading the ollama version of gpt-oss:20B into my 16G card but it did not fit. Any tips how i can make this work? Also: I am looking for a way so voice assist can memorize things, like preferenes or own findings. Is there any way to achive this?Thank you agian very much TheOriginal92 (The Original92) November 13, 2025, 2:51pm 11 initialy i was using the “latest” quant of huggingface, i think that is Q4_K_S. Right now I am running Q8_0 – not sure if that’s optimal. Any recommendations? 1 Like NathanCu (Nathan Curtis) November 13, 2025, 2:58pm 12 I’ll look at your card specifically on OSS 20 b but it was absolutely designed to fit ina 16G card… We should be able to figure it out.Whatever model you do end up in push that context sincoyas big as you can without overrunning… keep trying models. You want long context, reasoning tool use models. Also everything’s you just asked about is in the Friday thread… Sorry I’m 220 posts deep now but it’s in there mem needs some specific considerations and input the caveats there too. 2 Likes crzynik (Nicolas Mowen) December 4, 2025, 7:50pm 13 Had an interesting issue I ran into.I still prefer to have use local first enabled as it is a tad bit faster, and the “chime” is more pleasant than “Turned on the light” response. However, I was noticing some weird behavior when using What is the weather? where the answer was nonsensical, but asking What is the weather today? correctly used the llm_intents script. Now that Home Assistant 2025.12 shows you the tools / intents that are called and their responses, I was able to get more insight here.It turns out that Home Assistant has a weather intent HassGetWeather which was being called locally, but I didn’t have any weather entities exposed to assist so it was effectively trying to run that and then falling back to the LLM and the LLM was apparently just making up values based on the sensors it had access to.For now I just overwrote the local intent by creating an automation that triggered on the sentence What is the weather and re-implemented the logic, using the AI Task service to summarize the information. This is a workaround, I would really love it if Home Assistant exposed all of the intents that are available as well as a way to disable which ones you want to immediately hand off to the LLM.Example Automation alias: Override HassGetWeather description: “” triggers: – trigger: conversation command: – What is the weather – What’s the weather – How is the weather conditions: [] actions: – action: weather.get_forecasts metadata: {} target: entity_id: weather.forecast data: type: hourly response_variable: hourly_forecast – variables: items: “{{ 24 – now().hour }}” formatted_forecast: > {% set forecasts = hourly_forecast[“weather.forecast”][“forecast”] %} {% for item in forecasts[:items] %} – Time: {{ as_timestamp(item.datetime) | timestamp_custom(‘%-I%p’, true) | lower }}-{{ (as_timestamp(item.datetime) + 3600) | timestamp_custom(‘%-I%p’, true) | lower }} Temperature: {{ item.temperature | int }} General Condition: {{ item.condition }} Precipitation: {% if item.precipitation_probability < 20 %}unlikely{% elif item.precipitation_probability – You are a weather forecaster. Below is an hourly weather forecast, and your task is to summarize this information in one sentence. Summarize the forecast below in one to two sentences: {{ formatted_forecast }} response_variable: summary – set_conversation_response: “{{ summary.data }}” mode: single 3 Likes crzynik (Nicolas Mowen) December 31, 2025, 2:01pm 14 Had some family visit for the holidays and that exposed some issues with the current setup. The main problem being wake word activation, I found an improved OpenWakeWord training script for the ViewAssist device which helped. However, the bigger problem was that anytime there was a false activation the LLM would always end the response with a question, which created a loop.I had originally used a silence prompt to respond with ” ” but that seemed to cause issues where the speaker would make a static noise, and for some reason it seems less willing to say something like that vs a true word / phrase. We also noticed that when we were trying to activate with a command, if it heard you wrong the response was way too wordy, it often gave examples device names or areas which is entirely unnecessary.I adjusted my prompt for unclear request handling, and this has dramatically improved things. Handling Unclear Requests prompt section # Handling Unclear Requests When you receive input, FIRST determine if it is a request directed at you. Follow this decision hierarchy: ## Identify Questions First (Highest Priority) – If the input contains any question – including question marks, interrogative phrasing (“should I”, “am I”, “what”, “how”, “why”, “can I”, “do you think”, etc.), or rhetorical questions – treat it as a request for information and ANSWER IT. Questions are inherently directed at you, regardless of how casual, conversational, or rhetorical they sound. – Do not treat questions as “conversation not directed at you” even if they don’t explicitly address you by name or sound like internal monologue. – Questions seeking advice, opinions, information, or reassurance should always be answered.## When to Remain Silent – If the input is a complete, coherent STATEMENT (not a question) that appears to be part of a conversation not directed at you (someone talking to someone else, a statement that doesn’t address you and doesn’t seek information): respond “Sorry.” and do not ask follow-up questions. – If the input is clearly not a request or question meant for you (conversation fragments, background noise interpreted as text): respond “Sorry.” and do not ask follow-up questions.## When to Ask for Repetition – If the input seems garbled, nonsensical, or like you may have misheard, but appears to be an attempt to ask you a question or make a request: respond “Can you repeat that?” – If the input is incomplete or unclear but seems like it could be a question or request directed at you: respond “Can you repeat that?” – After asking “Can you repeat that?” once, if the user responds “No” or declines, do not ask again. Simply acknowledge with “Okay” or remain silent.## When to Ask for Specific Clarification – If you understand the user wants to do something but don’t know which device, room, or area: ask a short, specific follow-up question. For example: “Which room?” or “Which device?” or “What would you like to control?” – ABSOLUTELY NEVER provide examples, list options, or say “for example” when asking for clarification. Ask only the question itself, such as “Which fan?” or “Which room?” Do not add any additional text after the question.## General Rules for All Clarification Responses – Never give long explanations about not understanding. Keep all confusion responses to one short sentence ending with a question mark. – When the user provides a clear request after you asked for clarification, you MUST use the appropriate tools (weather tool, search tool, device controls, etc.) to fulfill that request. Do not provide answers based on conversation context alone — always use the required tools.- If the user responds “No”, “Nevermind”, or declines to provide clarification after you asked for it, simply acknowledge with “Okay” or “Understood” and wait for their next request. Do not ask follow-up questions or offer additional help unless the user makes a new request. 4 Likes About making inexpensive models smarter by providing tools and context. (local models, gpt-5-mini, gpt-4.1-mini, gpt-4o-mini .) About making inexpensive models smarter by providing tools and context.(local models, gpt-5-mini, gpt-4.1-mini, gpt-4o-mini .) crzynik (Nicolas Mowen) January 5, 2026, 7:13pm 15 I have created a script which leverages Frigate and its Home Assistant integration to get information about what is happening on cameras outside. This sends the current camera image to an AI task (must use a vision capable model) along with information from Frigate on the count and activity of object types. This enables asking Home Assistant questions like “Who is at my door?” or “I just heard a noise in the backyard, do you see anything?” Screenshot 2026-01-05 at 12.12.14 PM627×814 39.2 KB Note the question time will be longer as it has to run the vision analysis as well.Camera Analysis Script sequence: – variables: camera_snake_case: “{{ camera | lower | replace(‘ ‘, ‘_’) }}” primary_objects: [‘person’, ‘bear’] secondary_objects: [‘dog’, ‘cat’, ‘raccoon’, ‘squirrel’, ‘car’, ‘bicycle’, ‘rabbit’] sensor_info_text: | # Information from AI NVR # Primary Objects {% for obj in primary_objects %} {% set sensor_id = ‘sensor.’ ~ camera_snake_case ~ ‘_’ ~ obj ~ ‘_count’ %} {% set sensor_state = states(sensor_id) %} {% if sensor_state is not none and sensor_state != ‘unknown’ and sensor_state != ‘unavailable’ %} {% set active_sensor_id = ‘sensor.’ ~ camera_snake_case ~ ‘_’ ~ obj ~ ‘_active_count’ %} {% set active_state = states(active_sensor_id) | default(0) %} – Count of {{ obj }}s: {{ sensor_state }} ({{ active_state }} of which are active). {% endif %} {% endfor %} {% set last_face = states(‘sensor.’ ~ camera_snake_case ~ ‘_last_recognized_face’) | default(”) %} {% if last_face and last_face != ‘unknown’ and last_face != ‘None’ and last_face != ” %} – Name of recognized person: {{ last_face }}. {% endif %} # Secondary Objects {% for obj in secondary_objects %} {% set sensor_id = ‘sensor.’ ~ camera_snake_case ~ ‘_’ ~ obj ~ ‘_count’ %} {% set sensor_state = states(sensor_id) %} {% if sensor_state is not none and sensor_state != ‘unknown’ and sensor_state != ‘unavailable’ %} {% set active_sensor_id = ‘sensor.’ ~ camera_snake_case ~ ‘_’ ~ obj ~ ‘_active_count’ %} {% set active_state = states(active_sensor_id) | default(0) %} – Count of {{ obj }}s: {{ sensor_state }} ({{ active_state }} of which are active). {% endif %} {% endfor %} instructions_text: > {{ sensor_info_text }} # How to provide analysis ## General Guidelines The AI NVR sensor data above is authoritative and indicates the actual presence of objects in the camera view.Use these sensor counts as the definitive source of information about what is present. ## What to Report Report ONLY object types that have an active count greater than zero. Do not describe object types with zero active counts, even if the total count is greater than zero. Focus exclusively on actively moving or present objects.## Response Format For each object type with active count greater than zero, provide a concise summary that includes: – What the object(s) is/are – Location in the frame (e.g., foreground, background, left side, center) – Activity or movement being engaged in – Any relevant identifying details (only if significant) Keep each object type description to 1-3 sentences maximum. Be concise and factual.Do not describe stationary objects, non-active objects, or provide exhaustive lists of every object visible.## What to Exclude – Do not describe object types with zero active counts – Do not describe stationary or parked objects that are not active – Do not provide detailed lists of every object visible – Do not describe general scene elements or environmental details – Do not use headers, markdown formatting, or structured lists in the response ## When No Active Objects If all object types have zero active counts, simply state that no active objects are present in the frame. – action: ai_task.generate_data metadata: {} data: task_name: Camera Frame Analysis instructions: “{{ instructions_text | trim }}” attachments: media_content_id: media-source://camera/camera.{{ camera_snake_case }} media_content_type: application/vnd.apple.mpegurl metadata: title: Back Deck Cam thumbnail: /api/camera_proxy/camera.{{ camera_snake_case }} media_class: video navigateIds: – {} – media_content_type: app media_content_id: media-source://camera entity_id: ai_task.ollama_ai_task response_variable: analysis – variables: response: instructions: > # Camera Analysis Response Guidelines You have received camera analysis data from the vision model. Provide a concise, natural response to the user’s question about the camera view. ## Response Format – Summarize the analysis in a conversational, natural way suitable for text-to-speech – Focus on answering the user’s specific question (e.g., “who is at the door”, “what’s in the backyard”, “is anyone outside”) – Keep responses brief and to the point – typically 1-3 sentences – Only mention active objects and their relevant details – If no active objects are present, state that clearly – Do not repeat technical details or sensor counts unless directly relevant to the user’s question – Use natural language – avoid repeating the analysis verbatim ## Example Response Style If analysis shows: “One person is visible in the foreground, standing near the front door and appears to be waiting.” Good response: “There’s one person at the front door waiting.” Bad response: “Based on the camera analysis, there is one person visible in the foreground, standing near the front door and appears to be waiting.” output: “{{ analysis.data }}” – stop: Returning activity on camera response_variable: response fields: camera: selector: select: options: – Back Deck Cam – Back Gate Cam – Corner Cam – Front Cam – Front Door Cam – Side Cam required: true alias: Camera Analysis description: >- Analyzes camera feeds to identify active objects, people, and activity. Use this tool when users ask about what is happening outside, who is at the door, what is in the backyard, or any questions about activity visible on security cameras.Provides information about people, animals, vehicles, and other objects detected in the camera view. icon: mdi:camera-metering-matrix 1 Like maglat (Maglat) January 5, 2026, 8:39pm 16 Thats what I was looking for! Many thanks setting this up. This saves my time to do it on my own! Great! 1 Like maglat (Maglat) January 5, 2026, 8:49pm 17 What “weather.forecast” provider / integration you are using. I have “weather.home”.By using this, the automation breaks at my side at “UndefinedError: ‘dict object’ has no attribute ‘precipitation_probability’” crzynik (Nicolas Mowen) January 5, 2026, 8:54pm 18 I use PirateWeather, that is interesting though that the format is different for some weather providers. 1 Like NathanCu (Nathan Curtis) January 5, 2026, 8:56pm 19 Different providers provide different things. Most have temp and rainfall but the things like windspeed, UV, etc will be provider by provider.Your tool should account for missing data and inform the llm what to do in case data isn’t available.crzynik (Nicolas Mowen) January 5, 2026, 9:00pm 20 Yeah, I believe this is probably an issue within GitHub – skye-harris/llm_intents: Exposes internet search tools for use by LLM-backed Assist in Home Assistant I will make an issue for it next page → [{“GPU”:”RX 7900XTX 24GB”,”Model Class”:”20B-30B MoE, 9B Dense”,”Response Time (after prompt caching)”:”1 – 2 seconds”,”Notes”:”Efficiently and quickly runs models that are optimal for this setup.”},{“GPU”:”RTX 5060Ti 16GB”,”Model Class”:”20B MoE, 9B Dense”,”Response Time (after prompt caching)”:”1.5 – 3 seconds”,”Notes”:”Quick enough to run models that are optimal for this setup with responses < 3 seconds."},{"GPU":"RX 9060XT 16GB","Model Class":"20B MoE, 9B Dense","Response Time (after prompt caching)":"1.5 – 4 seconds","Notes":"Quick enough to run models that are optimal for this setup with responses < 4 seconds."},{"GPU":"RTX 3050 8GB","Model Class":"4B Dense","Response Time (after prompt caching)":"3 seconds","Notes":"Good for running small models with basic functionality."}]


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.