Our large-scale evaluation reveals fundamental limitations of current Vision-Language Models for safety-critical environments and demonstrates how HD-Guard enables reliable real-time hazard detection.
Our evaluation shows that many state-of-the-art Vision-Language Models struggle with real-world safety detection. Even models with relatively high safety scores often exhibit extremely high premature alarm rates, reaching up to 50%, which would make them impractical for real-world deployment.
Fine-grained error analysis reveals two major limitations: (1) models often miss critical objects in dynamic scenes, leading to visual perception failures when detecting potential hazards. (2) limited multimodal reasoning capability prevents models from identifying dangerous actions that causes hidden risks, such as placing flammable materials next to an open flame.
Our hierarchical dual-brain architecture combines fast perception with deep reasoning, enabling reliable hazard detection while maintaining low latency. HD-Guard achieves a strong balance between safety accuracy and real-time responsiveness.
The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts.
To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is constructed via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations.
Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Hover over the videos below to reveal the 4 critical keyframes: Intent Onset, Intervention Deadline, PNR (Point-of-No-Return), and Impact.
HomeSafe-Bench is constructed via a hybrid pipeline that guarantees both physical accuracy and visual realism. We combine physical simulation (BEHAVIOR-1K) with generative video synthesis (Veo-3.1) based on real-world NEISS hospital reports.
Figure 1: Pipeline for HomeSafe-Bench dataset construction, combining LLM-driven danger scenario generation with physical simulation and video generation.
Class Distribution
Key Frames Annotation
We propose HD-Guard, a real-time streaming architecture that resolves the latency-accuracy trade-off. It utilizes a lightweight FastBrain (MiniCPM-o 4.5) for continuous, high-frequency filtering and an asynchronous SlowBrain (Qwen3-VL-30B-Thinking) for deep, multi-modal reasoning and common sense verification.
Figure 2: The HD-Guard architecture, illustrating the traffic-light protocol and dynamic camera sampling rate adjusting between the FastBrain and SlowBrain.
Our comprehensive evaluation of open-source and closed-source models reveals that open-source models currently lead in specialized safety tasks. HD-Guard achieves an optimal trade-off between low end-to-end latency and high hazard detection quality.
Table 1: Main results across 4 metrics: Hazard Detection Rate (HDR), Effective Warning Precision (EWP), Phase Distribution Analysis (PDA), and Weighted Safety Score (WSS).
Figure 3: Fine-grained error analysis. HD-Guard effectively eliminates reasoning deficits in complex physical hazards (D3) while minimizing missed visual detections.
Figure 4: (Left) Danger Severity Assessment Calibration. (Right) Analysis of the Latency-Safety Tradeoff. HD-Guard pushes the Pareto frontier.
Figure 5: Ablation Study on Sampling Frequency. The system achieves an optimal balance at 5 FPS.
Comparing model performances on Perception Bottlenecks and Reasoning Bottlenecks. HD-Guard accurately localizes hazards within the optimal window.
Ground Truth Impact: 5.1s
(Optimal Window: 3.3s – 4.2s)
Ground Truth Impact: 8.0s
(Optimal Window: 3.1s – 3.7s)
@misc{pu2026homesafebenchevaluatingvisionlanguagemodels,
title={HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios},
author={Jiayue Pu and Zhongxiang Sun and Zilu Zhang and Xiao Zhang and Jun Xu},
year={2026},
eprint={2603.11975},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.11975},
}