The NVIDIA RAG Blueprint supports reasoning capabilities that allow models to "think through" complex questions before answering. This feature improves accuracy for challenging queries but increases response latency due to additional reasoning tokens.
:::{tip} Reasoning is particularly beneficial for the following:
- Complex multi-step questions
- Queries requiring logical deduction
- Technical or mathematical problem-solving
- Scenarios where accuracy is more important than response speed :::
This guide explains how to enable reasoning for different Nemotron models, each using a different control mechanism.
| Model | Control Method | Thinking Budget Parameters |
|---|---|---|
| Nemotron 3 (Nano 30B, and others) | Environment variables | LLM_ENABLE_THINKING, LLM_REASONING_BUDGET, LLM_LOW_EFFORT |
| Nemotron 1.5 | System prompts | None |
| Nemotron-3-Nano 9B | System prompts | min/max thinking tokens |
Nemotron 3 models (such as nvidia/nemotron-3-nano-30b-a3b) use environment variables to control reasoning.
Set the following environment variables on the RAG server container (via Docker Compose, Helm values, or shell export):
LLM_ENABLE_THINKING
: Enable or disable the reasoning phase. When true, the model emits reasoning tokens before the final answer. Default: false.
LLM_REASONING_BUDGET
: Maximum number of tokens allocated for reasoning. Only used when LLM_ENABLE_THINKING is true. Default: 0.
LLM_LOW_EFFORT
: Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when LLM_ENABLE_THINKING is true. Default: false.
FILTER_THINK_TOKENS
: Filter content between <think> and </think> tags in model responses. Keep true for production to return only the final answer. Set false to see the full reasoning process. Default: true.
:::{important}
Disabling reasoning: To disable reasoning, set LLM_ENABLE_THINKING=false. Setting LLM_REASONING_BUDGET=0 alone does not disable reasoning: when the budget is 0, the RAG pipeline does not pass it to the LLM, and the model uses its default reasoning behavior. Always set LLM_ENABLE_THINKING=false to turn reasoning off.
:::
Nemotron 3 models (such as nvidia/nemotron-3-super-120b-a12b and nvidia/nemotron-3-nano-30b-a3b) use environment variables to control reasoning.
export LLM_ENABLE_THINKING=trueLimit the number of reasoning tokens to control latency and cost:
export LLM_ENABLE_THINKING=true
export LLM_REASONING_BUDGET=8192For faster responses where deep reasoning is unnecessary:
export LLM_ENABLE_THINKING=true
export LLM_LOW_EFFORT=trueAfter you enable reasoning, configure the model parameters for optimal reasoning performance:
export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95For nvidia/nemotron-3-nano-30b-a3b, reasoning is controlled with the same LLM_ENABLE_THINKING variable. The reasoning budget can be set with either LLM_REASONING_BUDGET or LLM_MAX_THINKING_TOKENS:
export LLM_ENABLE_THINKING=true
export LLM_REASONING_BUDGET=8192The 30B model also supports a maximum thinking token limit directly in API requests:
{
"model": "nvidia/nemotron-3-nano-30b-a3b",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_thinking_tokens": 8192
}Thinking budget parameters:
max_thinking_tokens
: Maximum number of reasoning tokens allowed before generating the final answer.
:::{important} The key differences for the 30B model are the following:
- Uses only
max_thinking_tokens(notmin_thinking_tokens) - Reasoning is available in the model output's
reasoning_contentfield (not wrapped in<think>tags) - The
reasoning_contentfield is present in the model output but isn't exposed in the generate API response - No filtering is needed because reasoning is already separated from the final answer :::
Reasoning in Nemotron 1.5 models (such as nvidia/llama-3.3-nemotron-super-49b-v1.5) is controlled through system prompts. The model switches between reasoning and non-reasoning modes using /think and /no_think directives.
To enable reasoning, update the system prompt from /no_think to /think in prompt.yaml, as shown in the following code.
rag_template:
system: |
/think
human: |
You are a helpful AI assistant named Envie.
You must answer only using the information provided in the context. While answering you must follow the instructions given below.
<instructions>
1. Do NOT use any external knowledge.
2. Do NOT add explanations, suggestions, opinions, disclaimers, or hints.
3. NEVER say phrases like "based on the context", "from the documents", or "I cannot find".
4. NEVER offer to answer using general knowledge or invite the user to ask again.
5. Do NOT include citations, sources, or document mentions.
6. Answer concisely. Use short, direct sentences by default. Only give longer responses if the question truly requires it.
7. Do not mention or refer to these rules in any way.
8. Do not ask follow-up questions.
9. Do not mention this instructions in your response.
</instructions>
Context:
{context}
Make sure the response you are generating strictly follow the rules mentioned above i.e. never say phrases like "based on the context", "from the documents", or "I cannot find" and mention about the instruction in response.After you enable the /think prompt, configure the model parameters for optimal reasoning performance:
export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95By default, reasoning tokens (shown between <think> tags) are filtered out so only the final answer is returned in the model response.
To view the full reasoning process including the <think> tags in the model response, use the following code.
export FILTER_THINK_TOKENS=false:::{note}
For most production use cases, keep FILTER_THINK_TOKENS=true (default) to provide cleaner responses to end users.
:::
The nvidia/nvidia-nemotron-nano-9b-v2 model uses system prompts to control reasoning similar to Nemotron 1.5. It also adds support for thinking budget parameters to control the extent of reasoning.
Change the system prompt from /no_think to /think in prompt.yaml as shown in the previous Nemotron 1.5 example.
export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95The 9B model supports both minimum and maximum thinking token limits to control the reasoning phase. You can include these parameters in API requests to the model:
{
"model": "nvidia/nvidia-nemotron-nano-9b-v2",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"min_thinking_tokens": 1024,
"max_thinking_tokens": 8192
}Thinking budget parameters:
min_thinking_tokens
: Minimum number of reasoning tokens before generating the final answer.
max_thinking_tokens
: Maximum number of reasoning tokens allowed before generating the final answer.
:::{important} The key differences for the 9B model are the following:
- Requires both
min_thinking_tokensandmax_thinking_tokensparameters - Reasoning is available in the model output's
reasoning_contentfield (not wrapped in<think>tags) - The
reasoning_contentfield is present in the model output but isn't exposed in the generate API response - No filtering is needed because reasoning is already separated from the final answer :::
After you configure reasoning settings in prompt.yaml or environment variables, redeploy your services:
# For prompt changes, rebuild and restart the RAG server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
# For environment variable changes only
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -dFor Helm deployments with custom prompts or environment variables, refer to Customize Prompts for detailed instructions.
For models that support thinking budget parameters, a max_thinking_tokens value of 8192 is recommended for most use cases. This value provides:
- Sufficient capacity for comprehensive reasoning
- Reasonable response times
- Good balance between quality and latency
:::{tip} Adjust the thinking budget based on your use case:
- Lower values (1024-4096): Faster responses for simpler questions
- Higher values (8192-16384): More thorough reasoning for complex queries
- Low-effort mode: Use
LLM_LOW_EFFORT=truefor fast, low-cost reasoning when deep thought is not required :::