Optimizers
Overview
Axolotl supports all optimizers supported by transformers OptimizerNames
Here is a list of optimizers supported by transformers as of v4.54.0:
adamw_torchadamw_torch_fusedadamw_torch_xlaadamw_torch_npu_fusedadamw_apex_fusedadafactoradamw_anyprecisionadamw_torch_4bitadamw_torch_8bitademamixsgdadagradadamw_bnb_8bitadamw_8bit# alias for adamw_bnb_8bitademamix_8bitlion_8bitlion_32bitpaged_adamw_32bitpaged_adamw_8bitpaged_ademamix_32bitpaged_ademamix_8bitpaged_lion_32bitpaged_lion_8bitrmsproprmsprop_bnbrmsprop_bnb_8bitrmsprop_bnb_32bitgalore_adamwgalore_adamw_8bitgalore_adafactorgalore_adamw_layerwisegalore_adamw_8bit_layerwisegalore_adafactor_layerwiselomoadalomogrokadamwschedule_free_radamschedule_free_adamwschedule_free_sgdapollo_adamwapollo_adamw_layerwisestable_adamw
Custom Optimizers
Enable custom optimizers by passing a string to the optimizer argument. Each optimizer will receive beta and epsilon args, however, some may accept additional args which are detailed below.
optimi_adamw
optimizer: optimi_adamwao_adamw_4bit
Deprecated: Please use adamw_torch_4bit.
ao_adamw_8bit
Deprecated: Please use adamw_torch_8bit.
ao_adamw_fp8
optimizer: ao_adamw_fp8adopt_adamw
GitHub: https://github.com/iShohei220/adopt Paper: https://arxiv.org/abs/2411.02853
optimizer: adopt_adamwcame_pytorch
GitHub: https://github.com/yangluo7/CAME/tree/master Paper: https://arxiv.org/abs/2307.02047
optimizer: came_pytorch
# optional args (defaults below)
adam_beta1: 0.9
adam_beta2: 0.999
adam_beta3: 0.9999
adam_epsilon: 1e-30
adam_epsilon2: 1e-16muon
Blog: https://kellerjordan.github.io/posts/muon/ Paper: https://arxiv.org/abs/2502.16982v1
optimizer: muondion
Microsoft’s Dion (DIstributed OrthoNormalization) optimizer is a scalable and communication-efficient orthonormalizing optimizer that uses low-rank approximations to reduce gradient communication.
GitHub: https://github.com/microsoft/dion Paper: https://arxiv.org/pdf/2504.05295 Note: Implementation written for PyTorch 2.7+ for DTensor
optimizer: dion
dion_lr: 0.01
dion_momentum: 0.95
lr: 0.00001 # learning rate for embeddings and parameters that fallback to AdamWq_galore_adamw8bit
Q-GaLore extends GaLore with two extra ideas: an INT4-quantized projection matrix and an adaptive SVD scheduler that skips re-projection when a layer’s gradient subspace stabilizes. Both are wired up in axolotl. The third Q-GaLore trick — INT8 weight wrapping — is not yet implemented and is tracked as a follow-up.
GitHub: https://github.com/VITA-Group/Q-GaLore Paper: https://arxiv.org/abs/2407.08296
Install: pip install axolotl[qgalore]
This optimizer is for full fine-tuning. It is incompatible with adapter
(LoRA/QLoRA), load_in_8bit, and load_in_4bit. DeepSpeed is currently gated
off; FSDP requires fsdp_version: 2 with use_orig_params: true.
optimizer: q_galore_adamw8bit
bf16: true
# which parameter substrings get the low-rank projection
# (defaults to ["attn", "mlp"] if unset — matches the reference impl)
optim_target_modules:
- attn
- mlp
# Q-GaLore hyperparameters (defaults shown)
qgalore_rank: 256
qgalore_update_proj_gap: 200 # max steps between SVD refreshes
qgalore_scale: 0.25
qgalore_proj_type: std
qgalore_proj_quant: true # INT-quantize the projection matrix P
qgalore_proj_bits: 4 # bitwidth for P
qgalore_proj_group_size: 256 # must divide P's last dim evenly
qgalore_cos_threshold: 0.4 # skip SVD if P_t is this similar to P_{t-1}
qgalore_gamma_proj: 2 # grow update_proj_gap by this factor when stable
qgalore_queue_size: 5