Skip to content

add support for qwen3.5 vl model#693

Merged
RobotSail merged 10 commits intoinstructlab:mainfrom
RobotSail:add-qwen3-vl-support
Mar 6, 2026
Merged

add support for qwen3.5 vl model#693
RobotSail merged 10 commits intoinstructlab:mainfrom
RobotSail:add-qwen3-vl-support

Conversation

@RobotSail
Copy link
Member

@RobotSail RobotSail commented Mar 3, 2026

This PR adds support for the qwen3.5 vl model.

Summary by CodeRabbit

  • New Features

    • Automatic detection, direct loading, and extraction of text backbones from vision‑language models for text‑only training.
    • Option to prefer local optimized kernels when available.
    • Improved tokenizer/text‑config reconciliation for consistent vocab and token‑ID alignment.
    • Mixed attention handling for models with vision + text components.
  • Bug Fixes

    • Better handling of M‑RoPE models with automatic attention selection and SDPA fallback.
    • Improved validation and clearer errors when multiple module targets are specified.

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Collects multiple no-split module classes for FSDP wrap policies, adds VLM detection/loading/extraction and utilities, detects M‑RoPE to adjust FlashAttention usage (preferring SDPA when needed), and exposes Model._use_local_mamba_kernels() to attempt swapping local transformer kernels.

Changes

Cohort / File(s) Summary
FSDP wrap-policy
src/instructlab/training/accelerator.py
get_fsdp_config now resolves every name in self.model._no_split_modules via get_module_class_from_name, collects resolved classes into a set for transformer_layer_cls, and raises ValueError if none resolve.
Model loading & attention kernels
src/instructlab/training/model.py
Introduces VLM-aware loading paths (detect/extract or direct-load text backbones), refactors FlashAttention selection to consider M‑RoPE and timm vision towers (favoring SDPA or mixed/eager where appropriate), retains GPT‑OSS kernel logic, and adds public Model._use_local_mamba_kernels() to attempt swapping local mamba kernels with logging/fallbacks.
VLM utilities (new)
src/instructlab/training/vlm_utils.py
New module providing detection/loading/extraction helpers: is_vlm_with_causal_lm, is_vlm_for_direct_loading, load_vlm_for_text_training, extract_causal_lm_from_vlm, _find_text_backbone, has_mrope, needs_sdpa, has_timm_vision_tower, and related helpers for determining attention/kernel compatibility.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as Client
  participant Model as ModelLoader
  participant VLM as vlm_utils
  participant HF as HuggingFaceAuto
  participant Kernels as KernelSelector

  CLI->>Model: load_model(model_path, load_kwargs)
  Model->>VLM: is_vlm_with_causal_lm(model_path)?
  alt VLM wraps CausalLM
    VLM-->>Model: True
    Model->>VLM: extract_causal_lm_from_vlm(model_path, load_kwargs)
    VLM-->>Model: PreTrained CausalLM
  else VLM direct-load candidate
    Model->>VLM: is_vlm_for_direct_loading(model_path)?
    alt direct-load
      Model->>VLM: load_vlm_for_text_training(model_path, load_kwargs)
      VLM-->>Model: PreTrained model
    else standard
      Model->>HF: AutoModelForCausalLM.from_pretrained(model_path, **load_kwargs)
      HF-->>Model: PreTrained model
    end
  end
  Model->>VLM: has_mrope(model_path)?
  alt M‑RoPE detected
    Model->>Kernels: disable FlashAttention2 -> use SDPA
  else no M‑RoPE
    Model->>Kernels: select flash/eager per GPU & GPT‑OSS rules
  end
  Kernels->>Model: optionally _use_local_mamba_kernels() to swap kernels
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through configs, found backbones tucked away,

M‑RoPE made Flash yawn, so SDPA saved the day.
I swapped some kernels, nudged the model free,
Ate a carrot, hummed a tune, and tipped my floppy knee. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add support for qwen3.5 vl model' directly aligns with the PR's main objective of adding support for the Qwen3.5 VL (Vision-Language) model. While the changes are broader than just the single model, the title accurately captures the primary driver of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Mar 3, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/accelerator.py`:
- Around line 148-157: The code currently raises a ValueError when no classes
are resolved from self.model._no_split_modules; instead, try resolving names
against the underlying HF model (e.g., check getattr(self.model, "model", None)
or getattr(self.model, "hf_model", None) and call get_module_class_from_name
against that if the first pass yields nothing), and if still no layer_classes,
do not raise—instead emit a clear warning (use warnings.warn or the module
logger) that resolution failed and that FSDP will fall back to a safe wrap
policy, then continue with an empty layer_classes so the downstream FSDP wrap
logic uses the safe/default policy. Ensure you reference
get_module_class_from_name, self.model._no_split_modules, and layer_classes in
the change.

In `@src/instructlab/training/model.py`:
- Around line 125-131: The _has_mrope() helper should catch config-load errors
and validate types: wrap AutoConfig.from_pretrained(model_path) in a try/except
and return False on any exception, then retrieve rope_scaling safely by checking
attributes and types (use getattr(config, "rope_scaling", None) and
getattr(getattr(config, "text_config", None), "rope_scaling", None) but ensure
the resulting rope_scaling is a dict or has __contains__ before testing
"mrope_section"); if type checks fail, return False so M-RoPE detection is a
safe, non-fatal fallback.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f02ea6 and a1e6782.

📒 Files selected for processing (2)
  • src/instructlab/training/accelerator.py
  • src/instructlab/training/model.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/instructlab/training/vlm_utils.py (1)

145-157: ⚠️ Potential issue | 🟠 Major

Make has_mrope() a best-effort probe.

This still lets AutoConfig.from_pretrained() failures and non-mapping rope fields abort model init, even though the check only decides between FA2 and SDPA. It should return False unless it can safely confirm mrope_section.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 145 - 157, Make
has_mrope() a best-effort probe by catching errors from
AutoConfig.from_pretrained and returning False on exception, and by treating
rope fields safely: after loading config (in has_mrope) wrap the from_pretrained
call in a try/except and return False on failure, then when collecting
candidates check both config and getattr(config, "text_config", None) but only
inspect rope_scaling/rope_parameters if they exist and are mappings (use
isinstance(rope, Mapping) or similar) before looking for "mrope_section"; only
return True when you can safely confirm the key exists, otherwise return False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/model.py`:
- Around line 123-133: Replace the hardcoded GPU index 0 with the actual CUDA
device for this process (use torch.cuda.current_device() when querying
capabilities) so the SM check uses the pinned/LOCAL_RANK GPU; update the call
torch.cuda.get_device_capability(0) to
torch.cuda.get_device_capability(torch.cuda.current_device()) (and keep the rest
of the logic that sets self.base_model_args["attn_implementation"] and the
logger warning intact) to avoid selecting flash-attn3 based on the wrong device.

In `@src/instructlab/training/vlm_utils.py`:
- Line 29: The utility currently hardcodes AutoConfig.from_pretrained(...,
trust_remote_code=True); update the three functions is_vlm_with_causal_lm,
extract_causal_lm_from_vlm, and has_mrope to accept an optional parameter
trust_remote_code: bool = False and pass that parameter into their calls to
AutoConfig.from_pretrained(model_path, trust_remote_code=trust_remote_code) (and
propagate it through any internal helper calls), so callers can control whether
remote code is trusted instead of forcing True; ensure default remains False.
- Around line 117-125: Avoid instantiating a full random-weight model: create
the causal wrapper using an empty-weight context (e.g.,
transformers.init_empty_weights) around causal_lm_class(text_config) so no large
tensors are allocated, then attach the real submodules (set text_model.model =
backbone and text_model.lm_head = vlm.lm_head) and copy quantization metadata
from vlm (e.g., text_model.is_loaded_in_4bit, text_model.is_loaded_in_8bit,
text_model.hf_quantizer = getattr(vlm, 'hf_quantizer', None)) so
Model.prepare_peft_model()/prepare_model_for_kbit_training() sees the correct
flags; locate code around AutoConfig.from_pretrained, text_config,
causal_lm_class, and text_model to implement this change.

---

Duplicate comments:
In `@src/instructlab/training/vlm_utils.py`:
- Around line 145-157: Make has_mrope() a best-effort probe by catching errors
from AutoConfig.from_pretrained and returning False on exception, and by
treating rope fields safely: after loading config (in has_mrope) wrap the
from_pretrained call in a try/except and return False on failure, then when
collecting candidates check both config and getattr(config, "text_config", None)
but only inspect rope_scaling/rope_parameters if they exist and are mappings
(use isinstance(rope, Mapping) or similar) before looking for "mrope_section";
only return True when you can safely confirm the key exists, otherwise return
False.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 483f2af4-0503-4536-ae08-c08400328ba2

📥 Commits

Reviewing files that changed from the base of the PR and between a1e6782 and 5aa350c.

📒 Files selected for processing (2)
  • src/instructlab/training/model.py
  • src/instructlab/training/vlm_utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
src/instructlab/training/vlm_utils.py (3)

192-218: ⚠️ Potential issue | 🟠 Major

Make has_mrope() a safe fallback.

This runs during Model.__init__, so a config fetch failure or an unexpected non-mapping rope_* value will currently abort model startup just to decide an attention backend. Please catch config-load errors and return False, and only test "mrope_section" on mapping-like values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 192 - 218, The has_mrope
function can crash during Model.__init__ if AutoConfig.from_pretrained fails or
if rope_scaling/rope_parameters are not mapping-like; wrap the config load in a
try/except and return False on any exception, then when iterating candidates
only treat mapping-like values (e.g., isinstance(rope, dict) or
collections.abc.Mapping) before checking for the "mrope_section" key; keep
checks against both top-level config and getattr(config, "text_config", None) as
before and return any("mrope_section" in rope for rope in candidates) only after
filtering to valid mappings.

15-32: ⚠️ Potential issue | 🟠 Major

Don’t force trust_remote_code=True in the VLM probes.

These helpers unconditionally trust remote code during config inspection, but the subsequent model load path does not thread the same trust decision through load_kwargs. That both widens the trust boundary and can make probe/load behavior disagree. Please take trust_remote_code from the caller and apply it consistently across both probe and load paths.

Also applies to: 45-59, 179-179, 206-206

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 15 - 32, The probes
currently call AutoConfig.from_pretrained with trust_remote_code=True
unconditionally (e.g., in is_vlm_with_causal_lm) which mis-aligns with model
load trust decisions; add a trust_remote_code: bool parameter to the probe
functions (including is_vlm_with_causal_lm and the other VLM probe helpers
referenced around lines 45-59, 179, 206), pass that value into
AutoConfig.from_pretrained, and ensure the same trust_remote_code flag is
threaded into any subsequent model-loading/inspection codepaths (e.g., the
load_kwargs or calls that later load the model) so the probe and load behavior
use the same trust decision.

178-186: ⚠️ Potential issue | 🟠 Major

Avoid allocating a second full CausalLM wrapper here.

causal_lm_class(text_config) materializes a fresh random-weight model before model and lm_head are immediately replaced. For multi-B checkpoints this creates a large transient RAM spike, and the replacement wrapper also drops quantization flags that Model.prepare_peft_model() in src/instructlab/training/model.py:286-293 relies on. Build the wrapper with empty/meta weights and copy the quantization metadata from vlm.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 178 - 186, The current
code instantiates a full random-weight CausalLM via causal_lm_class(text_config)
which creates a large transient allocation and loses quantization metadata;
instead construct the CausalLM wrapper with empty/meta weights (e.g., use an
empty init context such as init_empty_weights or the model.config-only
constructor) so no heavy tensors are allocated, then assign text_model.model =
backbone and text_model.lm_head = vlm.lm_head, and copy any
quantization/metadata flags from vlm to text_model (so Model.prepare_peft_model
in src/instructlab/training/model.py can detect them) before returning/using
text_model; locate these symbols: AutoConfig.from_pretrained,
MODEL_FOR_CAUSAL_LM_MAPPING, causal_lm_class, text_model, vlm, and
Model.prepare_peft_model to implement the change.
src/instructlab/training/model.py (1)

120-133: ⚠️ Potential issue | 🟠 Major

Use the process-local CUDA device for the GPT-OSS SM check.

Line 123 still probes device 0. In multi-GPU runs this process may be pinned to a different GPU, so attn_implementation can be chosen from the wrong device capability.

Suggested fix
-                    major, _ = torch.cuda.get_device_capability(0)
+                    device_id = torch.cuda.current_device()
+                    major, _ = torch.cuda.get_device_capability(device_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/model.py` around lines 120 - 133, The SM-level check
currently calls torch.cuda.get_device_capability(0) which always probes device
0; change it to probe the process-local CUDA device by using
torch.cuda.current_device() (e.g., device = torch.cuda.current_device()) and
then call torch.cuda.get_device_capability(device). Update the logic in the
self.is_gpt_oss branch that sets self.base_model_args["attn_implementation"] and
the logger.warning message to use the retrieved capability from the correct
device so the attn_implementation choice and warning reflect the process's
actual GPU.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/model.py`:
- Around line 377-385: The mismatch check uses a truthiness test that skips
valid EOS IDs of 0; change the condition in the reconciliation block to
explicitly check for None: replace the middle clause `and
self.tokenizer.eos_token_id` with `and self.tokenizer.eos_token_id is not None`
so the guarded expression becomes `if text_config.eos_token_id is not None and
self.tokenizer.eos_token_id is not None and text_config.eos_token_id !=
self.tokenizer.eos_token_id:` and keep the warning and assignment that set
`text_config.eos_token_id = self.tokenizer.eos_token_id`.

---

Duplicate comments:
In `@src/instructlab/training/model.py`:
- Around line 120-133: The SM-level check currently calls
torch.cuda.get_device_capability(0) which always probes device 0; change it to
probe the process-local CUDA device by using torch.cuda.current_device() (e.g.,
device = torch.cuda.current_device()) and then call
torch.cuda.get_device_capability(device). Update the logic in the
self.is_gpt_oss branch that sets self.base_model_args["attn_implementation"] and
the logger.warning message to use the retrieved capability from the correct
device so the attn_implementation choice and warning reflect the process's
actual GPU.

In `@src/instructlab/training/vlm_utils.py`:
- Around line 192-218: The has_mrope function can crash during Model.__init__ if
AutoConfig.from_pretrained fails or if rope_scaling/rope_parameters are not
mapping-like; wrap the config load in a try/except and return False on any
exception, then when iterating candidates only treat mapping-like values (e.g.,
isinstance(rope, dict) or collections.abc.Mapping) before checking for the
"mrope_section" key; keep checks against both top-level config and
getattr(config, "text_config", None) as before and return any("mrope_section" in
rope for rope in candidates) only after filtering to valid mappings.
- Around line 15-32: The probes currently call AutoConfig.from_pretrained with
trust_remote_code=True unconditionally (e.g., in is_vlm_with_causal_lm) which
mis-aligns with model load trust decisions; add a trust_remote_code: bool
parameter to the probe functions (including is_vlm_with_causal_lm and the other
VLM probe helpers referenced around lines 45-59, 179, 206), pass that value into
AutoConfig.from_pretrained, and ensure the same trust_remote_code flag is
threaded into any subsequent model-loading/inspection codepaths (e.g., the
load_kwargs or calls that later load the model) so the probe and load behavior
use the same trust decision.
- Around line 178-186: The current code instantiates a full random-weight
CausalLM via causal_lm_class(text_config) which creates a large transient
allocation and loses quantization metadata; instead construct the CausalLM
wrapper with empty/meta weights (e.g., use an empty init context such as
init_empty_weights or the model.config-only constructor) so no heavy tensors are
allocated, then assign text_model.model = backbone and text_model.lm_head =
vlm.lm_head, and copy any quantization/metadata flags from vlm to text_model (so
Model.prepare_peft_model in src/instructlab/training/model.py can detect them)
before returning/using text_model; locate these symbols:
AutoConfig.from_pretrained, MODEL_FOR_CAUSAL_LM_MAPPING, causal_lm_class,
text_model, vlm, and Model.prepare_peft_model to implement the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ec0b7c0-a94e-4319-b818-e090fd278c58

📥 Commits

Reviewing files that changed from the base of the PR and between 5aa350c and 158da76.

📒 Files selected for processing (2)
  • src/instructlab/training/model.py
  • src/instructlab/training/vlm_utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
src/instructlab/training/model.py (2)

118-122: ⚠️ Potential issue | 🟠 Major

Use the current CUDA device for the SM check.

In distributed runs this process may be pinned to a nonzero LOCAL_RANK. Querying device 0 here can select vllm-flash-attn3 from the wrong GPU and then fail or fall back unnecessarily on the actual training device.

Suggested fix
-                    major, _ = torch.cuda.get_device_capability(0)
+                    device_id = torch.cuda.current_device()
+                    major, _ = torch.cuda.get_device_capability(device_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/model.py` around lines 118 - 122, The SM check
currently calls torch.cuda.get_device_capability(0) which always queries device
0; update the check to use the actual CUDA device for this process by calling
torch.cuda.current_device() (or an available self.local_rank if present) and
pass that index to torch.cuda.get_device_capability so the is_gpt_oss branch
(and the flash-attn3 major version check) uses the correct GPU device instead of
device 0.

389-392: ⚠️ Potential issue | 🟡 Minor

Handle eos_token_id == 0.

The truthiness check skips valid EOS IDs of 0, so the mismatch fix never runs for those tokenizers.

Suggested fix
-            and self.tokenizer.eos_token_id
+            and self.tokenizer.eos_token_id is not None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/model.py` around lines 389 - 392, The condition that
checks EOS token IDs uses a truthiness check which skips valid id 0; update the
if to explicitly test for None instead of truthiness: replace "and
self.tokenizer.eos_token_id" with "and self.tokenizer.eos_token_id is not None"
(so the clause reads: if text_config.eos_token_id is not None and
self.tokenizer.eos_token_id is not None and text_config.eos_token_id !=
self.tokenizer.eos_token_id) so mismatches where eos_token_id == 0 are correctly
detected and handled; keep the same variables (text_config.eos_token_id and
self.tokenizer.eos_token_id) to locate the change.
src/instructlab/training/vlm_utils.py (2)

32-32: ⚠️ Potential issue | 🟠 Major

Make trust_remote_code caller-controlled.

These config probes currently force remote-code execution on their own. A model classification check should not opt into repo-supplied Python unless the caller explicitly asked for that. Please add a trust_remote_code parameter (default False) and thread it through these probes and the later VLM load paths instead of hardcoding True.

Also applies to: 59-59, 179-179, 206-206, 237-237, 258-258

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` at line 32, Change the hardcoded
trust_remote_code=True usage to a caller-controlled boolean: add a
trust_remote_code: bool = False parameter to the functions that call
AutoConfig.from_pretrained (and any helper probes) and pass that parameter into
AutoConfig.from_pretrained instead of True; thread the same parameter through
the VLM load paths that later instantiate the model so repository code execution
is only enabled when the caller explicitly sets trust_remote_code=True, and
update all call sites to accept/forward the new parameter.

178-186: ⚠️ Potential issue | 🟠 Major

Preserve the loaded backbone state when extracting the standalone CausalLM.

causal_lm_class(text_config) materializes another full decoder only to overwrite it, which is a large RAM spike for bigger VLMs. The returned wrapper also drops the loaded k-bit/runtime state (is_loaded_in_4bit, is_loaded_in_8bit, hf_quantizer) and keeps a fresh config object instead of the backbone's config, so Model.prepare_peft_model() can miss the k-bit path and later tokenizer fixes land on the wrong config. Using .model / .lm_head directly here is also only safe for a subset of HF CausalLM wrappers. Build the wrapper under an empty/meta init context, transplant the backbone/head through the generic wrapper APIs, and carry over the loaded metadata before returning it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 178 - 186, The current
code creates a full decoder via causal_lm_class(text_config) then overwrites its
internals, which spikes RAM and loses k-bit/runtime metadata; instead
instantiate the wrapper under an empty/meta init (so no weights are
materialized), use the wrapper's official APIs to set/transpose the backbone and
lm_head (replace text_model.model and text_model.lm_head via those generic
transplant APIs) and copy the backbone's config and loaded-state flags
(is_loaded_in_4bit, is_loaded_in_8bit, hf_quantizer) into the new wrapper before
returning it, so Model.prepare_peft_model and tokenizer-fixup see the original
backbone config and quantization state; keep references to
MODEL_FOR_CAUSAL_LM_MAPPING, causal_lm_class, text_model, backbone and
vlm.lm_head to locate the change.
🧹 Nitpick comments (1)
src/instructlab/training/model.py (1)

589-592: Text-only LoRA on this branch will still target vision projections.

Because projection_layer_names / prepare_peft_model() are suffix-based, keeping the full VLM here means the default/common q_proj / k_proj targets also attach adapters to the vision tower. For a "text-only" path that's a sizable extra memory/trainable-params hit; consider filtering adapter target discovery to text-module prefixes or bypassing this branch when LoRA is enabled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/model.py` around lines 589 - 592, The current branch
keeps the full VLM when is_vlm_with_causal_lm()/is_vlm_for_direct_loading()
returns true, which lets suffix-based projection_layer_names /
prepare_peft_model() attach adapters to vision projections; detect when
LoRA/text-only training is requested and either (a) only extract/load the text
LM (use extract_causal_lm_from_vlm or a new helper) instead of preserving the
full VLM, or (b) restrict projection_layer_names discovery to text-module
prefixes (e.g., "text_encoder", "lm_head", "transformer.text", etc.) before
calling prepare_peft_model; update the conditional around
extract_causal_lm_from_vlm/load_vlm_for_text_training to branch on the
LoRA/text-only flag and apply one of these fixes so adapters are not attached to
the vision tower.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/instructlab/training/model.py`:
- Around line 118-122: The SM check currently calls
torch.cuda.get_device_capability(0) which always queries device 0; update the
check to use the actual CUDA device for this process by calling
torch.cuda.current_device() (or an available self.local_rank if present) and
pass that index to torch.cuda.get_device_capability so the is_gpt_oss branch
(and the flash-attn3 major version check) uses the correct GPU device instead of
device 0.
- Around line 389-392: The condition that checks EOS token IDs uses a truthiness
check which skips valid id 0; update the if to explicitly test for None instead
of truthiness: replace "and self.tokenizer.eos_token_id" with "and
self.tokenizer.eos_token_id is not None" (so the clause reads: if
text_config.eos_token_id is not None and self.tokenizer.eos_token_id is not None
and text_config.eos_token_id != self.tokenizer.eos_token_id) so mismatches where
eos_token_id == 0 are correctly detected and handled; keep the same variables
(text_config.eos_token_id and self.tokenizer.eos_token_id) to locate the change.

In `@src/instructlab/training/vlm_utils.py`:
- Line 32: Change the hardcoded trust_remote_code=True usage to a
caller-controlled boolean: add a trust_remote_code: bool = False parameter to
the functions that call AutoConfig.from_pretrained (and any helper probes) and
pass that parameter into AutoConfig.from_pretrained instead of True; thread the
same parameter through the VLM load paths that later instantiate the model so
repository code execution is only enabled when the caller explicitly sets
trust_remote_code=True, and update all call sites to accept/forward the new
parameter.
- Around line 178-186: The current code creates a full decoder via
causal_lm_class(text_config) then overwrites its internals, which spikes RAM and
loses k-bit/runtime metadata; instead instantiate the wrapper under an
empty/meta init (so no weights are materialized), use the wrapper's official
APIs to set/transpose the backbone and lm_head (replace text_model.model and
text_model.lm_head via those generic transplant APIs) and copy the backbone's
config and loaded-state flags (is_loaded_in_4bit, is_loaded_in_8bit,
hf_quantizer) into the new wrapper before returning it, so
Model.prepare_peft_model and tokenizer-fixup see the original backbone config
and quantization state; keep references to MODEL_FOR_CAUSAL_LM_MAPPING,
causal_lm_class, text_model, backbone and vlm.lm_head to locate the change.

---

Nitpick comments:
In `@src/instructlab/training/model.py`:
- Around line 589-592: The current branch keeps the full VLM when
is_vlm_with_causal_lm()/is_vlm_for_direct_loading() returns true, which lets
suffix-based projection_layer_names / prepare_peft_model() attach adapters to
vision projections; detect when LoRA/text-only training is requested and either
(a) only extract/load the text LM (use extract_causal_lm_from_vlm or a new
helper) instead of preserving the full VLM, or (b) restrict
projection_layer_names discovery to text-module prefixes (e.g., "text_encoder",
"lm_head", "transformer.text", etc.) before calling prepare_peft_model; update
the conditional around extract_causal_lm_from_vlm/load_vlm_for_text_training to
branch on the LoRA/text-only flag and apply one of these fixes so adapters are
not attached to the vision tower.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c5cdfe4-e446-4dfc-a4ba-1fa8462fce2a

📥 Commits

Reviewing files that changed from the base of the PR and between 158da76 and 8853104.

📒 Files selected for processing (2)
  • src/instructlab/training/model.py
  • src/instructlab/training/vlm_utils.py

- Fix eos_token_id truthiness check (0 is valid)
- Add isinstance guard for RopeParameters in mrope detection
- Add hasattr fallback for non-dict rope objects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/instructlab/training/model.py (1)

125-129: ⚠️ Potential issue | 🟠 Major

Use the process-local CUDA device for the GPT-OSS capability check.

In distributed runs this process may be pinned to a nonzero GPU, so get_device_capability(0) can choose vllm-flash-attn3 based on the wrong card. Query torch.cuda.current_device() here instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/model.py` around lines 125 - 129, The GPT-OSS GPU
capability check in the is_gpt_oss branch uses
torch.cuda.get_device_capability(0), which can be wrong in
multi-process/distributed runs; change it to query the process-local device via
torch.cuda.current_device() and pass that device index to
torch.cuda.get_device_capability(...) so the flash-attn3 decision in the
is_gpt_oss block uses the correct local GPU.
src/instructlab/training/vlm_utils.py (2)

174-185: ⚠️ Potential issue | 🟠 Major

Preserve quantization state when extracting the text LM.

causal_lm_class(text_config) builds a second full wrapper, then Lines 181-182 replace only .model and .lm_head. That drops quantization metadata from the loaded VLM, so Model.prepare_peft_model() in src/instructlab/training/model.py:310-319 will miss the k-bit path for quantized VLMs. Please build the wrapper under an empty/meta-weight context and copy flags like is_loaded_in_4bit, is_loaded_in_8bit, and hf_quantizer from vlm before returning it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 174 - 185, When
constructing the standalone CausalLM wrapper, create the wrapper
(causal_lm_class(text_config)) inside an empty/meta-weight context so it doesn't
allocate real weights, then assign the loaded backbone and lm_head as you do
(text_model.model = backbone; text_model.lm_head = vlm.lm_head) and also copy
quantization metadata from the original vlm onto the new wrapper (e.g.,
text_model.is_loaded_in_4bit = vlm.is_loaded_in_4bit,
text_model.is_loaded_in_8bit = vlm.is_loaded_in_8bit, text_model.hf_quantizer =
vlm.hf_quantizer) so downstream code in Model.prepare_peft_model can detect the
k-bit path correctly before deleting vlm and returning text_model.

15-32: ⚠️ Potential issue | 🟠 Major

Stop forcing trust_remote_code=True in the VLM probes.

These helpers are part of ordinary model classification/loading, but every AutoConfig.from_pretrained() here bypasses the caller’s trust policy. In this PR, src/instructlab/training/model.py routes through them before the actual model load, so config inspection alone now opts into remote code execution. Please thread a trust_remote_code argument from the caller and default it to False.

Also applies to: 45-59, 174-175, 188-202, 223-258

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/vlm_utils.py` around lines 15 - 32, The VLM probe
functions (e.g., is_vlm_with_causal_lm) currently force
AutoConfig.from_pretrained(..., trust_remote_code=True); change their signatures
to accept a trust_remote_code: bool = False parameter, pass that value to all
AutoConfig.from_pretrained calls in this module (including the other probe
helpers referenced), and update the caller in src/instructlab/training/model.py
to thread the caller's trust_remote_code flag into these probes so config
inspection no longer unconditionally opts into remote code execution. Ensure
default remains False and propagate the flag through any internal helper calls
within vlm_utils.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/instructlab/training/model.py`:
- Around line 125-129: The GPT-OSS GPU capability check in the is_gpt_oss branch
uses torch.cuda.get_device_capability(0), which can be wrong in
multi-process/distributed runs; change it to query the process-local device via
torch.cuda.current_device() and pass that device index to
torch.cuda.get_device_capability(...) so the flash-attn3 decision in the
is_gpt_oss block uses the correct local GPU.

In `@src/instructlab/training/vlm_utils.py`:
- Around line 174-185: When constructing the standalone CausalLM wrapper, create
the wrapper (causal_lm_class(text_config)) inside an empty/meta-weight context
so it doesn't allocate real weights, then assign the loaded backbone and lm_head
as you do (text_model.model = backbone; text_model.lm_head = vlm.lm_head) and
also copy quantization metadata from the original vlm onto the new wrapper
(e.g., text_model.is_loaded_in_4bit = vlm.is_loaded_in_4bit,
text_model.is_loaded_in_8bit = vlm.is_loaded_in_8bit, text_model.hf_quantizer =
vlm.hf_quantizer) so downstream code in Model.prepare_peft_model can detect the
k-bit path correctly before deleting vlm and returning text_model.
- Around line 15-32: The VLM probe functions (e.g., is_vlm_with_causal_lm)
currently force AutoConfig.from_pretrained(..., trust_remote_code=True); change
their signatures to accept a trust_remote_code: bool = False parameter, pass
that value to all AutoConfig.from_pretrained calls in this module (including the
other probe helpers referenced), and update the caller in
src/instructlab/training/model.py to thread the caller's trust_remote_code flag
into these probes so config inspection no longer unconditionally opts into
remote code execution. Ensure default remains False and propagate the flag
through any internal helper calls within vlm_utils.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc36d129-e0da-4044-a9b2-c93bd7382ab6

📥 Commits

Reviewing files that changed from the base of the PR and between 8853104 and 06e3622.

📒 Files selected for processing (2)
  • src/instructlab/training/model.py
  • src/instructlab/training/vlm_utils.py

- Fix isort ordering in vlm_utils.py and model.py
- Fix pylint: use 'from torch import nn', mark unused-argument
- Mock needs_sdpa and get_module_class_from_name in unit tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mergify mergify bot added the testing Relates to testing label Mar 6, 2026
RobotSail and others added 3 commits March 6, 2026 06:43
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- accelerator.py: fall back to warning + default wrap policy instead of
  ValueError when no _no_split_modules resolve; try underlying HF model
  as secondary target
- model.py: use torch.cuda.current_device() instead of hardcoded 0
- vlm_utils.py: add trust_remote_code param (default False) to all
  config-loading functions; use init_empty_weights for CausalLM shell;
  copy quantization metadata from VLM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove fabricated claim about C API incompatibility
- Accurately describe the issue as PyTorch/CUDA ABI mismatch
- Broaden exception handling to catch AttributeError

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot added the one-approval label Mar 6, 2026
@RobotSail RobotSail merged commit 0c6614a into instructlab:main Mar 6, 2026
14 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants