HERO-VQL: Hierarchical, Egocentric and Robust Visual Query Localization

🔥 BMVC 2025 Oral Paper

1Kyung Hee University   2AI R&D Division, CJ Group

* Equally contributed first authors. † Corresponding author.

Abstract

In this work, we tackle the egocentric visual query localization (VQL), where a model should localize the query object in a long-form egocentric video. Frequent and abrupt viewpoint changes in egocentric videos cause significant object appearance variations and partial occlusions, making it difficult for existing methods to achieve accurate localization. To tackle these challenges, we introduce Hierarchical, Egocentric and RObust Visual Query Localization (HERO-VQL), a novel method inspired by human cognitive process in object recognition. We propose i) Top-down Attention Guidance (TAG) and ii) Egocentric Augmentation based Consistency Training (EgoACT). Top-down Attention Guidance refines the attention mechanism by leveraging the class token for high-level context and principal component score maps for fine-grained localization. To enhance learning in diverse and challenging matching scenarios, EgoAug enhances query diversity by replacing the query with a randomly selected corresponding object from ground-truth annotations and simulates extreme viewpoint changes by reordering video frames. Additionally, CT loss enforces stable object localization across different augmentation scenarios. Extensive experiments on VQ2D dataset validate that HERO-VQL effectively handles egocentric challenges, significantly outperforming baselines.

Problem

Teaser (a) Given an egocentric video and a query image of an object, the goal is to localize the last occurrence of the query object in the video. (b) Unlike third-person videos, egocentric videos undergo abrupt viewpoint changes due to the camera wearer’s movements. (c) These viewpoint changes introduce significant challenges in VQL, including variations in object appearance across perspectives and partial visibility when objects move out of the frame. For example, the appearance of a banana changes depending on the viewpoint, and a bottle becomes partially visible.


Architecture

Overview Given a video and a query image, we extract feature vectors using a pre-trained visual encoder. We feed the feature vectors through a spatial decoder, followed by a temporal module and a prediction head outputs per-frame bounding boxes and scores.

Top-down Attention Guidance

TAG TAG guides the spatial decoder’s attention using a high-level cue to capture the overall query context and a mid-level cue to enhance the understanding of fine-grained object parts.

Egocentric Augmentation based Consistency Training

EgoACT To enable robust matching, we replace the query image with a randomly selected corresponding object instance from the ground-truth. We reorder video frames based on object movement magnitude to simulate abrupt viewpoint changes. We also enforce temporal consistency, improving localization stability in egocentric videos.


Experiments

Comparison with state-of-the-art on VQ2D

We report the tAP25, stAP25, recovery (Rec. %), and the success rate (Succ.) on the validation set and the test set.

SOTA

Ablation study

To validate the effect of each component, we show the results on the VQ2D validation set. We report tAP25, stAP25, recovery (Rec. %), and success rate (Succ.) as percentages.

Ablation Study


Visualization of TAG

TAG We show the query image and four frames from each video with TAG’s high-level and mid-level attention guides. The high-level attention guide emphasize the region of the target object and the mid-level attention guide attends to specific object parts in the query feature. They improve the model’s ability to localize objects accurately, under varying perspectives or partial visibility.


Citation

@inproceedings{chang2025herovql,
  title={HERO-VQL: Hierarchical, Egocentric and Robust Visual Query Localization},
  author={Chang, Joohyun and Hong, Soyeon and Lee, Hyogun and Ha, Seong Jong and Lee, Dongho and Kim, Seong Tae and Choi, Jinwoo},
  booktitle={The Thirty Sixth British Machine Vision Conference},
  year={2025},
}