VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

Overview

Abstract

As video corpora continue to expand in both scale and task complexity, there is increasing demand for approaches that retrieve relevant videos from large-scale corpora (inter-video reasoning) and subsequently perform fine-grained, query-conditioned tasks (intra-video reasoning) within the retrieved content, such as temporal grounding. However, existing approaches typically treat retrieval as a preprocessing step, so when the initial retrieval fails there is no mechanism to refine the search, leading to the failure of subsequent fine-grained reasoning. Moreover, recent agentic frameworks typically assume the query-relevant video is already given, thereby bypassing retrieval. To address these limitations, we propose VideoSearch-R1, an agentic framework for iterative video retrieval and reasoning through multi-turn interaction with a video search engine. We introduce Soft Query Refinement (SQR) to refine search query tokens in a continuous latent space rather than rewriting queries in the discrete text space, enabling more efficient and fine-grained adjustments. SQR and its reasoning process are trained with Group Relative Policy Optimization (GRPO), guided by task-level rewards derived from retrieval and downstream tasks. VideoSearch-R1 achieves state-of-the-art performance across three datasets on Video Corpus Moment Retrieval (VCMR), while requiring significantly fewer generated tokens than explicit text-level query refinement.

Contributions

What’s new

Agentic retrieval & reasoning

VideoSearch-R1 iteratively retrieves candidate videos via a search engine, verifies query–video matching, refines the query, and performs intra-video reasoning—all through multi-turn interaction.

Soft Query Refinement

Instead of rewriting queries as text, SQR generates soft query tokens in a continuous latent space for fine-grained refinement, using 8 latent tokens versus 26.8 text tokens for hard refinement.

State-of-the-art VCMR

By jointly optimizing inter-video retrieval and intra-video reasoning with GRPO, VideoSearch-R1 sets a new state of the art on three VCMR benchmarks in both retrieval and temporal grounding.

Method

How VideoSearch-R1 works

VideoSearch-R1 framework: video corpus and search engine return a top-1 video; verification produces a reasoning trace and a match decision that branches to Soft Query Refinement or temporal grounding. — **Iterative video retrieval and reasoning of VideoSearch-R1.** Given an initial query `q₁`, the model retrieves the top-1 video from a corpus and performs verification, producing a reasoning trace `rₜ` and a matching decision `yₜᵣᵉᵗ`. If *‘not match’*, it performs SQR by generating soft query tokens `qₜˢᴡᶠᶏ` to form a refined query `qₜ₊₁ = [qₜ ‖ qₜˢᴡᶠᶏ]`; if *‘match’*, it conducts temporal grounding to predict the start and end timestamps.

Retrieve

Query the video search engine (Qwen3-VL-Embedding-2B) and return the top-1 candidate from a large-scale corpus.

Verify

Reason over the retrieved video and decide match / not match, emitting a reasoning trace.

Soft Query Refinement

If not matched, generate N=8 soft query tokens in latent space and append them to the original query.

Temporal Grounding

On a match, predict the precise start and end timestamps of the query-relevant moment.

Soft vs. hard query refinement. Unlike hard refinement, which rewrites queries in the discrete text space, SQR generates continuous soft query tokens appended to the original query for fine-grained adjustment, trained with an InfoNCE retrieval objective ℒ_ret that provides richer discriminative supervision than next-token prediction—reaching superior retrieval with just 8 latent tokens instead of 26.8 rewritten text tokens.

Two-stage training. A Supervised Fine-Tuning cold start (from Qwen3-VL-2B-Instruct) initializes a structured reasoning template and meaningful soft query tokens—optimizing a verification loss, a temporal-grounding loss, and the InfoNCE retrieval loss ℒ_ret that supervises the otherwise-unlabeled soft tokens against negative videos. GRPO reinforcement learning then explores reasoning trajectories under four complementary rewards—format, verification, retrieval (R_ret=exp(−ℒ_ret)), and temporal grounding (IoU)—propagating reward across both inter-video retrieval and intra-video reasoning for holistic optimization.

Experiments

Main results

Method	VCMR (IoU/R@1)			VER	VR
Method	0.3	0.5	0.7	Acc	R@1
Charades-FIG
CONQUER	–	1.2	0.7	–	2.8
SQuiDNet	–	2.6	0.9	–	11.7
Qwen3-VL-2B (ZS)	12.2	7.2	2.9	30.0	21.6
Qwen3-VL-2B (FT)	12.9	10.4	7.2	74.7	21.6
VideoSearch-R1	16.5	13.4	8.2	75.7	24.6
DiDeMo-FIG
CONQUER	–	5.5	3.7	–	14.8
SQuiDNet	–	2.9	0.5	–	16.9
Qwen3-VL-2B (ZS)	22.0	10.6	4.0	62.8	54.8
Qwen3-VL-2B (FT)	23.6	22.1	16.7	73.1	54.8
VideoSearch-R1	33.3	30.2	19.7	74.6	59.0
ActivityNet-FIG
CONQUER	–	3.0	1.6	–	13.5
SQuiDNet	–	4.7	2.1	–	32.6
Qwen3-VL-2B (ZS)	17.2	10.1	5.8	63.0	55.1
Qwen3-VL-2B (FT)	29.1	19.2	11.4	83.1	55.1
VideoSearch-R1	33.8	22.3	12.3	83.3	61.1

Scroll horizontally to see all columns

VideoSearch-R1 iteratively refines the query via SQR, lifting video retrieval despite using the same search engine—and consistently improving VCMR and verification over zero-shot and fine-tuned baselines.

Analysis

Why Soft Query Refinement works

Soft tokens incrementally steer the query embedding toward the target video; a couple of refinement turns suffice.

Two line plots: R@1 rises as the number of soft tokens increases; VCMR IoU/R@1 improves from turn 1 to 2 and saturates at three turns. — **Left — effect of the number of soft tokens.** As more soft tokens are appended, average R@1 rises consistently, showing that soft tokens progressively refine the query. **Right — effect of multi-turn inference.** Performance jumps from the first to the second turn and saturates at `T=3`, so a small number of refinement turns balances accuracy and cost.

Qualitative example: as soft tokens increase, the retrieved video shifts from a coarse match to the ground-truth video that matches blonde hair and a light blue wall at rank 1. — **Changes in the retrieved video as soft tokens increase.** Without soft tokens the engine retrieves a coarse match (‘a woman brushing’). As tokens are added, results capture finer attributes (‘blonde hair’, ‘combed by a person’), and with eight soft tokens the ground-truth video—capturing ‘light blue wall’—is retrieved at rank 1.

Qualitative comparison: hard refinement's verbose rewrite still returns the wrong video, while Soft Query Refinement retrieves the ground-truth video and localizes the moment. — **Qualitative comparison between SQR and HQR.** Hard refinement’s verbose rewrite still returns the wrong video at rank 1, whereas SQR refines the query representation with eight latent tokens, retrieves the ground-truth video, and localizes the moment (start 0.0s, end 6.6s).

Abstract

What’s new

Agentic retrieval & reasoning

Soft Query Refinement

State-of-the-art VCMR

How VideoSearch-R1 works

Retrieve

Verify

Soft Query Refinement

Temporal Grounding

Main results

Why Soft Query Refinement works

BibTeX