Be part of our each day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra
Alibaba Group has launched QwenLong-L1, a brand new framework that permits giant language fashions (LLMs) to purpose over extraordinarily lengthy inputs. This growth may unlock a brand new wave of enterprise functions that require fashions to know and draw insights from in depth paperwork reminiscent of detailed company filings, prolonged monetary statements, or complicated authorized contracts.
The problem of long-form reasoning for AI
Current advances in giant reasoning fashions (LRMs), significantly by means of reinforcement studying (RL), have considerably improved their problem-solving capabilities. Analysis reveals that when educated with RL fine-tuning, LRMs purchase expertise much like human “gradual pondering,” the place they develop refined methods to sort out complicated duties.
Nevertheless, these enhancements are primarily seen when fashions work with comparatively quick items of textual content, sometimes round 4,000 tokens. The flexibility of those fashions to scale their reasoning to for much longer contexts (e.g., 120,000 tokens) stays a significant problem. Such long-form reasoning requires a strong understanding of the complete context and the flexibility to carry out multi-step evaluation. “This limitation poses a big barrier to sensible functions requiring interplay with exterior data, reminiscent of deep analysis, the place LRMs should gather and course of data from knowledge-intensive environments,” the builders of QwenLong-L1 write of their paper.
The researchers formalize these challenges into the idea of “long-context reasoning RL.” Not like short-context reasoning, which regularly depends on data already saved inside the mannequin, long-context reasoning RL requires fashions to retrieve and floor related data from prolonged inputs precisely. Solely then can they generate chains of reasoning primarily based on this included data.
Coaching fashions for this by means of RL is difficult and infrequently leads to inefficient studying and unstable optimization processes. Fashions battle to converge on good options or lose their skill to discover numerous reasoning paths.
QwenLong-L1: A multi-stage strategy
QwenLong-L1 is a reinforcement studying framework designed to assist LRMs transition from proficiency with quick texts to strong generalization throughout lengthy contexts. The framework enhances current short-context LRMs by means of a fastidiously structured, multi-stage course of:
Heat-up Supervised High-quality-Tuning (SFT): The mannequin first undergoes an SFT part, the place it’s educated on examples of long-context reasoning. This stage establishes a stable basis, enabling the mannequin to floor data precisely from lengthy inputs. It helps develop basic capabilities in understanding context, producing logical reasoning chains, and extracting solutions.
Curriculum-Guided Phased RL: At this stage, the mannequin is educated by means of a number of phases, with the goal size of the enter paperwork regularly growing. This systematic, step-by-step strategy helps the mannequin stably adapt its reasoning methods from shorter to progressively longer contexts. It avoids the instability typically seen when fashions are abruptly educated on very lengthy texts.
Issue-Conscious Retrospective Sampling: The ultimate coaching stage incorporates difficult examples from the previous coaching phases, making certain the mannequin continues to be taught from the toughest issues. This prioritizes tough cases and encourages the mannequin to discover extra numerous and sophisticated reasoning paths.

Past this structured coaching, QwenLong-L1 additionally makes use of a definite reward system. Whereas coaching for short-context reasoning duties typically depends on strict rule-based rewards (e.g., an accurate reply in a math drawback), QwenLong-L1 employs a hybrid reward mechanism. This combines rule-based verification, which ensures precision by checking for strict adherence to correctness standards, with an “LLM-as-a-judge.” This choose mannequin compares the semanticity of the generated reply with the bottom fact, permitting for extra flexibility and higher dealing with of the various methods right solutions may be expressed when coping with lengthy, nuanced paperwork.
Placing QwenLong-L1 to the check
The Alibaba group evaluated QwenLong-L1 utilizing doc question-answering (DocQA) as the first process. This state of affairs is very related to enterprise wants, the place AI should perceive dense paperwork to reply complicated questions.
Experimental outcomes throughout seven long-context DocQA benchmarks confirmed QwenLong-L1’s capabilities. Notably, the QWENLONG-L1-32B mannequin (primarily based on DeepSeek-R1-Distill-Qwen-32B) achieved efficiency similar to Anthropic’s Claude-3.7 Sonnet Pondering, and outperformed fashions like OpenAI’s o3-mini and Qwen3-235B-A22B. The smaller QWENLONG-L1-14B mannequin additionally outperformed Google’s Gemini 2.0 Flash Pondering and Qwen3-32B.

An essential discovering related to real-world functions is how RL coaching leads to the mannequin growing specialised long-context reasoning behaviors. The paper notes that fashions educated with QwenLong-L1 grow to be higher at “grounding” (linking solutions to particular elements of a doc), “subgoal setting” (breaking down complicated questions), “backtracking” (recognizing and correcting their very own errors mid-reasoning), and “verification” (double-checking their solutions).
As an example, whereas a base mannequin may get sidetracked by irrelevant particulars in a monetary doc or get caught in a loop of over-analyzing unrelated data, the QwenLong-L1 educated mannequin demonstrated a capability to interact in efficient self-reflection. It may efficiently filter out these distractor particulars, backtrack from incorrect paths, and arrive on the right reply.
Methods like QwenLong-L1 may considerably increase the utility of AI within the enterprise. Potential functions embrace authorized tech (analyzing 1000’s of pages of authorized paperwork), finance (deep analysis on annual experiences and monetary filings for danger evaluation or funding alternatives) and customer support (analyzing lengthy buyer interplay histories to supply extra knowledgeable assist). The researchers have launched the code for the QwenLong-L1 recipe and the weights for the educated fashions.