Skip to content

Fix BnB quantization in vLLM#53

Draft
ItzikVa wants to merge 1 commit into
devfrom
issue-16
Draft

Fix BnB quantization in vLLM#53
ItzikVa wants to merge 1 commit into
devfrom
issue-16

Conversation

@ItzikVa
Copy link
Copy Markdown
Collaborator

@ItzikVa ItzikVa commented May 21, 2026

BitsAndBytes 4-bit quantization packs weights as uint8 with shape [total_elements//2, 1], which breaks the existing weight.shape-based dimension detection in SwitchedLoRALinear.init().

Fix:

  • Prefer input_size_per_partition / output_size_per_partition attributes
  • Fall back to weight.shape only for non-parallel layers
  • Add dtype guard: if weight dtype is non-floating-point (uint8 for BnB), default to bfloat16 for LoRA buffer allocation

Also adds vLLM quantization tests (BnB INT4 + FP8) that verify:

  • Base model weights are actually quantized
  • LoRA/aLoRA weights remain in full precision
  • Adapters activate correctly under quantization
  • LoRA dimensions are not corrupted by packed weight shapes

BitsAndBytes 4-bit quantization packs weights as uint8 with shape
[total_elements//2, 1], which breaks the existing weight.shape-based
dimension detection in SwitchedLoRALinear.__init__().

Fix:
- Prefer input_size_per_partition / output_size_per_partition attributes
  (always correct, regardless of weight packing format)
- Fall back to weight.shape only for non-parallel layers
- Add dtype guard: if weight dtype is non-floating-point (uint8 for BnB),
  default to bfloat16 for LoRA buffer allocation

Also adds vLLM quantization tests (BnB INT4 + FP8) that verify:
- Base model weights are actually quantized
- LoRA/aLoRA weights remain in full precision
- Adapters activate correctly under quantization
- LoRA dimensions are not corrupted by packed weight shapes

Closes #16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant