ECCV 2026

VideoSearch-R1: Iterative Video Retrieval and
Reasoning via Soft Query Refinement

An agentic framework that unifies inter-video retrieval and intra-video reasoning through multi-turn interaction with a video search engine.

1 KAIST 2 Korea University
 Equal contribution    Corresponding author
VideoSearch-R1 multi-turn interaction: a query is verified against a retrieved video, refined via Soft Query Refinement, and re-retrieved at rank 1 for temporal grounding.
An illustrative example of VideoSearch-R1. As an agentic AI system, VideoSearch-R1 enables multi-turn interaction through iterative video retrieval and reasoning, leveraging an external video search engine. At Turn 1 the ground-truth video is ranked 14th; after verifying the mismatch and applying Soft Query Refinement, the engine retrieves it at rank 1 in Turn 2, where the model then performs temporal grounding. This pipeline unifies corpus-level inter-video reasoning (video retrieval) with intra-video reasoning (temporal grounding).
Overview

Abstract

As video corpora continue to expand in both scale and task complexity, there is increasing demand for approaches that retrieve relevant videos from large-scale corpora (inter-video reasoning) and subsequently perform fine-grained, query-conditioned tasks (intra-video reasoning) within the retrieved content, such as temporal grounding. However, existing approaches typically treat retrieval as a preprocessing step, so when the initial retrieval fails there is no mechanism to refine the search, leading to the failure of subsequent fine-grained reasoning. Moreover, recent agentic frameworks typically assume the query-relevant video is already given, thereby bypassing retrieval. To address these limitations, we propose VideoSearch-R1, an agentic framework for iterative video retrieval and reasoning through multi-turn interaction with a video search engine. We introduce Soft Query Refinement (SQR) to refine search query tokens in a continuous latent space rather than rewriting queries in the discrete text space, enabling more efficient and fine-grained adjustments. SQR and its reasoning process are trained with Group Relative Policy Optimization (GRPO), guided by task-level rewards derived from retrieval and downstream tasks. VideoSearch-R1 achieves state-of-the-art performance across three datasets on Video Corpus Moment Retrieval (VCMR), while requiring significantly fewer generated tokens than explicit text-level query refinement.

Contributions

What’s new

Agentic retrieval & reasoning

VideoSearch-R1 iteratively retrieves candidate videos via a search engine, verifies query–video matching, refines the query, and performs intra-video reasoning—all through multi-turn interaction.

Soft Query Refinement

Instead of rewriting queries as text, SQR generates soft query tokens in a continuous latent space for fine-grained refinement, using 8 latent tokens versus 26.8 text tokens for hard refinement.

State-of-the-art VCMR

By jointly optimizing inter-video retrieval and intra-video reasoning with GRPO, VideoSearch-R1 sets a new state of the art on three VCMR benchmarks in both retrieval and temporal grounding.

Method

How VideoSearch-R1 works

VideoSearch-R1 framework: video corpus and search engine return a top-1 video; verification produces a reasoning trace and a match decision that branches to Soft Query Refinement or temporal grounding.
Iterative video retrieval and reasoning of VideoSearch-R1. Given an initial query q₁, the model retrieves the top-1 video from a corpus and performs verification, producing a reasoning trace rₜ and a matching decision yₜᵣᵉᵗ. If ‘not match’, it performs SQR by generating soft query tokens qₜˢᴡᶠᶏ to form a refined query qₜ₊₁ = [qₜ ‖ qₜˢᴡᶠᶏ]; if ‘match’, it conducts temporal grounding to predict the start and end timestamps.
1

Retrieve

Query the video search engine (Qwen3-VL-Embedding-2B) and return the top-1 candidate from a large-scale corpus.

2

Verify

Reason over the retrieved video and decide match / not match, emitting a reasoning trace.

3

Soft Query Refinement

If not matched, generate N=8 soft query tokens in latent space and append them to the original query.

4

Temporal Grounding

On a match, predict the precise start and end timestamps of the query-relevant moment.

Soft vs. hard query refinement. Unlike hard refinement, which rewrites queries in the discrete text space, SQR generates continuous soft query tokens appended to the original query for fine-grained adjustment, trained with an InfoNCE retrieval objective ret that provides richer discriminative supervision than next-token prediction—reaching superior retrieval with just 8 latent tokens instead of 26.8 rewritten text tokens.

Two-stage training. A Supervised Fine-Tuning cold start (from Qwen3-VL-2B-Instruct) initializes a structured reasoning template and meaningful soft query tokens—optimizing a verification loss, a temporal-grounding loss, and the InfoNCE retrieval loss ret that supervises the otherwise-unlabeled soft tokens against negative videos. GRPO reinforcement learning then explores reasoning trajectories under four complementary rewards—format, verification, retrieval (Rret=exp(−ℒret)), and temporal grounding (IoU)—propagating reward across both inter-video retrieval and intra-video reasoning for holistic optimization.

Experiments

Main results

Method VCMR (IoU/R@1) VER VR
0.30.50.7 Acc R@1
Charades-FIG
CONQUER1.20.72.8
SQuiDNet2.60.911.7
Qwen3-VL-2B (ZS)12.27.22.930.021.6
Qwen3-VL-2B (FT)12.910.47.274.721.6
VideoSearch-R116.513.48.275.724.6
DiDeMo-FIG
CONQUER5.53.714.8
SQuiDNet2.90.516.9
Qwen3-VL-2B (ZS)22.010.64.062.854.8
Qwen3-VL-2B (FT)23.622.116.773.154.8
VideoSearch-R133.330.219.774.659.0
ActivityNet-FIG
CONQUER3.01.613.5
SQuiDNet4.72.132.6
Qwen3-VL-2B (ZS)17.210.15.863.055.1
Qwen3-VL-2B (FT)29.119.211.483.155.1
VideoSearch-R133.822.312.383.361.1

 Scroll horizontally to see all columns

VideoSearch-R1 iteratively refines the query via SQR, lifting video retrieval despite using the same search engine—and consistently improving VCMR and verification over zero-shot and fine-tuned baselines.

Analysis

Why Soft Query Refinement works

Soft tokens incrementally steer the query embedding toward the target video; a couple of refinement turns suffice.

Two line plots: R@1 rises as the number of soft tokens increases; VCMR IoU/R@1 improves from turn 1 to 2 and saturates at three turns.
Left — effect of the number of soft tokens. As more soft tokens are appended, average R@1 rises consistently, showing that soft tokens progressively refine the query.  Right — effect of multi-turn inference. Performance jumps from the first to the second turn and saturates at T=3, so a small number of refinement turns balances accuracy and cost.
Qualitative example: as soft tokens increase, the retrieved video shifts from a coarse match to the ground-truth video that matches blonde hair and a light blue wall at rank 1.
Changes in the retrieved video as soft tokens increase. Without soft tokens the engine retrieves a coarse match (‘a woman brushing’). As tokens are added, results capture finer attributes (‘blonde hair’, ‘combed by a person’), and with eight soft tokens the ground-truth video—capturing ‘light blue wall’—is retrieved at rank 1.
Qualitative comparison: hard refinement's verbose rewrite still returns the wrong video, while Soft Query Refinement retrieves the ground-truth video and localizes the moment.
Qualitative comparison between SQR and HQR. Hard refinement’s verbose rewrite still returns the wrong video at rank 1, whereas SQR refines the query representation with eight latent tokens, retrieves the ground-truth video, and localizes the moment (start 0.0s, end 6.6s).
Cite

BibTeX

@inproceedings{lee2026videosearchr1,
  title     = {VideoSearch-R1: Iterative Video Retrieval and Reasoning
               via Soft Query Refinement},
  author    = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and
               Kim, Jongha and Kim, Hyunwoo J.},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}