Skip to content

Latest commit

 

History

History
273 lines (192 loc) · 9.89 KB

File metadata and controls

273 lines (192 loc) · 9.89 KB

Enable Reasoning for NVIDIA RAG Blueprint

The NVIDIA RAG Blueprint supports reasoning capabilities that allow models to "think through" complex questions before answering. This feature improves accuracy for challenging queries but increases response latency due to additional reasoning tokens.

:::{tip} Reasoning is particularly beneficial for the following:

  • Complex multi-step questions
  • Queries requiring logical deduction
  • Technical or mathematical problem-solving
  • Scenarios where accuracy is more important than response speed :::

This guide explains how to enable reasoning for different Nemotron models, each using a different control mechanism.

Model Control Method Thinking Budget Parameters
Nemotron 3 (Nano 30B, and others) Environment variables LLM_ENABLE_THINKING, LLM_REASONING_BUDGET, LLM_LOW_EFFORT
Nemotron 1.5 System prompts None
Nemotron-3-Nano 9B System prompts min/max thinking tokens

Enable Reasoning for Nemotron 3 Models

Nemotron 3 models (such as nvidia/nemotron-3-nano-30b-a3b) use environment variables to control reasoning.

Set the following environment variables on the RAG server container (via Docker Compose, Helm values, or shell export):

LLM_ENABLE_THINKING : Enable or disable the reasoning phase. When true, the model emits reasoning tokens before the final answer. Default: false.

LLM_REASONING_BUDGET : Maximum number of tokens allocated for reasoning. Only used when LLM_ENABLE_THINKING is true. Default: 0.

LLM_LOW_EFFORT : Low-effort reasoning mode for faster, cheaper responses with shorter reasoning. Only used when LLM_ENABLE_THINKING is true. Default: false.

FILTER_THINK_TOKENS : Filter content between <think> and </think> tags in model responses. Keep true for production to return only the final answer. Set false to see the full reasoning process. Default: true.

:::{important} Disabling reasoning: To disable reasoning, set LLM_ENABLE_THINKING=false. Setting LLM_REASONING_BUDGET=0 alone does not disable reasoning: when the budget is 0, the RAG pipeline does not pass it to the LLM, and the model uses its default reasoning behavior. Always set LLM_ENABLE_THINKING=false to turn reasoning off. :::

Enable Reasoning for Nemotron 3 Models

Nemotron 3 models (such as nvidia/nemotron-3-super-120b-a12b and nvidia/nemotron-3-nano-30b-a3b) use environment variables to control reasoning.

Basic Configuration

export LLM_ENABLE_THINKING=true

Configure Reasoning Budget (Optional)

Limit the number of reasoning tokens to control latency and cost:

export LLM_ENABLE_THINKING=true
export LLM_REASONING_BUDGET=8192

Low-Effort Mode (Optional)

For faster responses where deep reasoning is unnecessary:

export LLM_ENABLE_THINKING=true
export LLM_LOW_EFFORT=true

Configure Model Parameters

After you enable reasoning, configure the model parameters for optimal reasoning performance:

export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95

Nemotron-3-Nano 30B

For nvidia/nemotron-3-nano-30b-a3b, reasoning is controlled with the same LLM_ENABLE_THINKING variable. The reasoning budget can be set with either LLM_REASONING_BUDGET or LLM_MAX_THINKING_TOKENS:

export LLM_ENABLE_THINKING=true
export LLM_REASONING_BUDGET=8192

The 30B model also supports a maximum thinking token limit directly in API requests:

{
  "model": "nvidia/nemotron-3-nano-30b-a3b",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "max_thinking_tokens": 8192
}

Thinking budget parameters:

max_thinking_tokens : Maximum number of reasoning tokens allowed before generating the final answer.

:::{important} The key differences for the 30B model are the following:

  • Uses only max_thinking_tokens (not min_thinking_tokens)
  • Reasoning is available in the model output's reasoning_content field (not wrapped in <think> tags)
  • The reasoning_content field is present in the model output but isn't exposed in the generate API response
  • No filtering is needed because reasoning is already separated from the final answer :::

Enable Reasoning for Nemotron 1.5

Reasoning in Nemotron 1.5 models (such as nvidia/llama-3.3-nemotron-super-49b-v1.5) is controlled through system prompts. The model switches between reasoning and non-reasoning modes using /think and /no_think directives.

Update the System Prompt

To enable reasoning, update the system prompt from /no_think to /think in prompt.yaml, as shown in the following code.

rag_template:
  system: |
    /think

  human: |
    You are a helpful AI assistant named Envie.
    You must answer only using the information provided in the context. While answering you must follow the instructions given below.

    <instructions>
    1. Do NOT use any external knowledge.
    2. Do NOT add explanations, suggestions, opinions, disclaimers, or hints.
    3. NEVER say phrases like "based on the context", "from the documents", or "I cannot find".
    4. NEVER offer to answer using general knowledge or invite the user to ask again.
    5. Do NOT include citations, sources, or document mentions.
    6. Answer concisely. Use short, direct sentences by default. Only give longer responses if the question truly requires it.
    7. Do not mention or refer to these rules in any way.
    8. Do not ask follow-up questions.
    9. Do not mention this instructions in your response.
    </instructions>

    Context:
    {context}

    Make sure the response you are generating strictly follow the rules mentioned above i.e. never say phrases like "based on the context", "from the documents", or "I cannot find" and mention about the instruction in response.

Configure Model Parameters

After you enable the /think prompt, configure the model parameters for optimal reasoning performance:

export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95

Filter Reasoning Tokens

By default, reasoning tokens (shown between <think> tags) are filtered out so only the final answer is returned in the model response.

To view the full reasoning process including the <think> tags in the model response, use the following code.

export FILTER_THINK_TOKENS=false

:::{note} For most production use cases, keep FILTER_THINK_TOKENS=true (default) to provide cleaner responses to end users. :::

Enable Reasoning for Nemotron Nano 9B

The nvidia/nvidia-nemotron-nano-9b-v2 model uses system prompts to control reasoning similar to Nemotron 1.5. It also adds support for thinking budget parameters to control the extent of reasoning.

Update the System Prompt

Change the system prompt from /no_think to /think in prompt.yaml as shown in the previous Nemotron 1.5 example.

Configure Model Parameters

export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95

Configure Thinking Budget (Optional)

The 9B model supports both minimum and maximum thinking token limits to control the reasoning phase. You can include these parameters in API requests to the model:

{
  "model": "nvidia/nvidia-nemotron-nano-9b-v2",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "min_thinking_tokens": 1024,
  "max_thinking_tokens": 8192
}

Thinking budget parameters:

min_thinking_tokens : Minimum number of reasoning tokens before generating the final answer.

max_thinking_tokens : Maximum number of reasoning tokens allowed before generating the final answer.

:::{important} The key differences for the 9B model are the following:

  • Requires both min_thinking_tokens and max_thinking_tokens parameters
  • Reasoning is available in the model output's reasoning_content field (not wrapped in <think> tags)
  • The reasoning_content field is present in the model output but isn't exposed in the generate API response
  • No filtering is needed because reasoning is already separated from the final answer :::

Deploy with Reasoning Enabled

After you configure reasoning settings in prompt.yaml or environment variables, redeploy your services:

Docker Compose

# For prompt changes, rebuild and restart the RAG server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build

# For environment variable changes only
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Helm

For Helm deployments with custom prompts or environment variables, refer to Customize Prompts for detailed instructions.

Thinking Budget Recommendations

For models that support thinking budget parameters, a max_thinking_tokens value of 8192 is recommended for most use cases. This value provides:

  • Sufficient capacity for comprehensive reasoning
  • Reasonable response times
  • Good balance between quality and latency

:::{tip} Adjust the thinking budget based on your use case:

  • Lower values (1024-4096): Faster responses for simpler questions
  • Higher values (8192-16384): More thorough reasoning for complex queries
  • Low-effort mode: Use LLM_LOW_EFFORT=true for fast, low-cost reasoning when deep thought is not required :::

Related Topics