LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

Siyuan Wang*1,2, Gaokai Zhang*1,3, Li Lyna Zhang1, Ning Shang1, Fan Yang1, Dongyao Chen2, Mao Yang1
1Microsoft Research Asia   2Shanghai Jiao Tong University   3Carnegie Mellon University
ICLR 2026 Oral
*Equal Contribution

Abstract

Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce.

In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly.

RL training on KeyChain data induces an emergent plan–retrieve–reason–recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.

100% NIAH Retrieval 16K → 128K Generalization Rivals o3-mini & DeepSeek-R1 Simple & Handy KeyChain

Method Overview

LoongRL constructs KeyChain training data by inserting UUID key-value chains into long-context documents, hiding the true question behind a chain of linked keys. The model must trace the correct chain, recover the original question, and then reason over the full context to answer.

Overview of KeyChain data construction

Figure 1: Overview of KeyChain data construction. Short multi-hop QA is transformed into high-difficulty long-context reasoning tasks.

KeyChain Data Example

Below is the skeleton of a KeyChain-augmented long-context question. UUID chains are inserted among documents, with only one chain leading to the correct question.

KeyChain training data example

Figure 2: A skeleton of KeyChain-augmented training data used in LoongRL.

KeyChain-Augmented Long-Context Question Template
Please read the following text.
<Document 0>
<original text>
{"UUIDB-n": "distracting question"}
<original text>
<Document 1>
{"UUIDA-1": "UUIDA-2"}
<Document 2>
{"UUIDB-1": "UUIDB-2"}
...
{"UUIDA-n": "correct question"}
...

In the context above, there is one correct question to answer. The correct question can only be found by following the correct consecutive chain of key:value pairs encoded with UUID strings, starting from "starting UUIDA-1".
Find the correct question first, then answer it.

Main Results

LoongRL delivers frontier-level long-context reasoning at much smaller scales (7B/14B), rivaling o3-mini and DeepSeek-R1, while preserving general short-context abilities.

Table 1: Long-Context Reasoning & General Short-Context Abilities

Model Long-Context Reasoning General & Short Reasoning
Avg. HotpotQA 2WikiMQA MuSiQue NarrativeQA QASPER Avg. MMLU MATH IFEval
o3-mini (medium) 74.583.089.064.060.760.5 92.186.998.091.5
DeepSeek-R1 74.982.791.372.266.961.4 90.590.897.383.3
GPT-4o 64.782.578.054.060.548.5 82.588.774.684.3
QwQ-32B 69.678.587.462.761.158.5 85.975.798.083.9
R1-Distill-LLaMA-70B 65.476.185.061.953.450.5 85.482.494.579.3
7B Scale
Qwen2.5-7B-Instruct 48.969.550.534.044.546.0 73.573.476.071.2
R1-Distill-Qwen-7B 31.240.253.311.18.942.5 69.962.392.854.7
LoongRL-7B 72.483.191.165.658.463.6 75.076.278.070.9
14B+ Scale
Qwen2.5-14B-Instruct 53.174.060.536.548.546.0 81.379.483.481.0
R1-Distill-Qwen-14B 64.977.587.058.051.051.0 81.076.693.972.6
R1-Distill-Qwen-32B 65.576.387.659.852.750.9 82.480.594.372.5
QwenLong-L1-32B 70.180.789.165.258.656.7 84.178.595.278.6
LoongRL-14B 74.282.293.367.563.464.5 80.780.583.278.4

Length Generalization

While trained only on 16K contexts, LoongRL generalizes impressively to contexts up to 128K tokens.

Table 2: Generalization from 16K Training to 128K Evaluation

Model NarrativeQA RULER
0-16K16K-32K32K-64K 16K32K64K128K
Qwen2.5-7B-Instruct 55.735.242.4 92.389.581.869.4
R1-Distill-Qwen-7B 55.735.242.4 18.94.41.40.9
LoongRL-7B 69.847.457.2 93.491.486.276.8
Qwen2.5-14B-Instruct 55.740.748.3 93.492.582.373.6
R1-Distill-Qwen-14B 63.035.954.6 85.782.060.228.2
R1-Distill-Qwen-32B 57.444.458.9 90.388.971.540.9
QwenLong-L1-32B 65.948.160.0 87.686.880.670.2
LoongRL-14B 69.555.264.3 95.495.187.179.9

Ablation Studies

KeyChain Data Effectiveness

Model HotpotQA2WikiMQAMuSiQueNarQAQASPERAvg.
Qwen2.5-7B-Instruct 69.550.534.044.546.048.9
LoongRL-7B (no KeyChain) 80.384.758.553.054.566.2
LoongRL-7B 83.191.165.658.463.672.4

Answer Verifier Comparison

Reward Verifier HotpotQA2WikiMQAMuSiQueNarQAQASPERAvg.
F1 score 79.586.458.046.655.065.1
LLM-as-a-judge 80.087.660.052.354.565.2
Exact match 82.791.366.351.054.969.2
Two-way Substring (ours) 83.191.165.658.463.672.4

Training Dynamics

Long-context reasoning accuracy and training response lengths throughout RL training, showing consistent improvements across each stage.

Training curves for 7B and 14B models

Figure 3: Long-context reasoning accuracy and training response lengths throughout RL training for both 7B and 14B models.

Emergent Reasoning Pattern

RL training with KeyChain data induces a structured plan–retrieve–reason–recheck reasoning pattern. Below we contrast two models on the same question.

Emergent reasoning patterns

Figure 4: Comparison of reasoning patterns. (a) KeyChain-trained model exhibits structured plan-retrieve-reason-recheck. (b) Non-KeyChain model shows mixed reasoning-with-retrieval.

Trajectory Examples

Question: Gary L. Bennett was a part of the space missions that have a primary destination of what celestial body?

KeyChain-Trained Model (Correct Answer: the Sun)

[Plan] To determine the primary destination of the space missions that Gary L. Bennett was a part of, we need to identify the space missions he worked on from the given passages and then find the primary destination of those missions.

[Retrieve] From Passage 1: "His professional career has included work on the Voyager, Galileo, and Ulysses space missions."

[Reason] Now, we need to find the primary destination of these missions. Let's check Passage 5 for information about the Ulysses mission:

[Retrieve] "Ulysses was a robotic space probe whose primary mission was to orbit the Sun and study it at all latitudes."

Therefore, the primary destination of the Ulysses space mission is the Sun.

[Recheck] Although we should check the other missions mentioned (Voyager and Galileo) for completeness... the question specifically asks about the space missions that Gary L. Bennett was a part of, and the Ulysses mission is clearly stated to have the Sun as its primary destination.

Answer: the Sun ✓

Non-KeyChain Model (Incorrect Answer: Jupiter)

[No systematic decomposition]

First, we need to identify the space missions mentioned in the passage that Gary L. Bennett was involved with... Bennett was involved in several space missions, including the Voyager, Galileo, and Ulysses space missions.

From these statements, we can deduce that Bennett was involved in missions to Jupiter, Saturn, Uranus, Neptune, and the Sun. However, the primary destinations mentioned directly in relation to Bennett's roles are Jupiter for the Galileo and Ulysses missions...

[Premature conclusion without recheck] Given the specific mention of Bennett's role in the Galileo mission to Jupiter... Jupiter stands out as a primary destination.

Answer: Jupiter ✗

Needle-in-a-Haystack Retrieval

LoongRL achieves perfect 100% retrieval accuracy on the Needle-in-a-Haystack benchmark, while other models struggle at various context depths.

NIAH LoongRL-7B

LoongRL-7B (100% accuracy)

NIAH R1-Distill-Qwen-7B

R1-Distill-Qwen-7B

NIAH Qwen2.5-7B-Instruct

Qwen2.5-7B-Instruct

NIAH QwenLong-L1-32B

QwenLong-L1-32B

Retrieval Trajectory Example

On the RULER variable tracking task, the KeyChain-trained model performs step-by-step, human-readable retrieval.

LoongRL (with KeyChain): Long-Context Retrieval on RULER VT

<think> Let's break it down step by step:

1. VAR QPE = 64886 — assigns 64886 to QPE.

2. VAR SEJ = VAR QPE — assigns QPE (64886) to SEJ.

3. VAR ZQO = VAR SEJ — assigns SEJ (64886) to ZQO.

4-5. ... (chain continues)

6. VAR SGMLJ = 92018

7. VAR PBDME = VAR SGMLJ — assigns 92018 to PBDME.

8. VAR EANSM = VAR PBDME — assigns 92018 to EANSM.

9. VAR QPKBX = VAR EANSM — assigns 92018 to QPKBX.

10. VAR YYZJM = VAR EANSM — assigns 92018 to YYZJM. </think>

Answer: SGMLJ, PBDME, EANSM, QPKBX, YYZJM ✓

Without KeyChain: Long-Context Retrieval on RULER VT

<think> The reasoning process involves identifying the lines where the value 92018 is first assigned to a variable, which is "VAR PBDME = 92018". Then, tracking this value through all subsequent assignments... The variables that inherit this value are EANSM, YYZJM, SGMLJ, and FAI. </think>

Answer: PBDME, EANSM, YYZJM, SGMLJ, FAI ✗ (conflates chains, includes wrong variable FAI)

BibTeX

@misc{wang2025loongrlreinforcementlearningadvanced,
      title={LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts},
      author={Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang},
      year={2025},
      eprint={2510.19363},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.19363},
}