diff --git a/docs/mindformers/docs/source_en/feature/configuration.md b/docs/mindformers/docs/source_en/feature/configuration.md index 1aacfe4cd386eb67493a60bd51fef6456cf8cc27..6ba5b648a30eec8c78de6846d493918ad01114ea 100644 --- a/docs/mindformers/docs/source_en/feature/configuration.md +++ b/docs/mindformers/docs/source_en/feature/configuration.md @@ -136,6 +136,7 @@ Because different model configurations may vary, here are some common model conf | model.model_config.softmax_compute_dtype | string | Required | 'float32' | The dtype used to compute the softmax during attention computation. | | model.model_config.rotary_dtype | string | Required | 'float32' | Computed dtype for custom rotated position embeddings. | | model.model_config.init_method_std | float | Required | 0.02 | The standard deviation of the zero-mean normal for the default initialization method, corresponding to `initializer_range` in HuggingFace. If `init_method` and `output_layer_init_method` are provided, this method is not used. | +| model.model_config.param_init_std_rules | list[dict] | Optional | None | Custom rules for parameter initialization standard deviation. Each rule contains `target` (regex pattern for parameter name) and `init_method_std` (std value, ≥0), for example: `[{"target": ".*weight", "init_method_std": 0.02}]` | | model.model_config.moe_grouped_gemm | bool | Required | False | When there are multiple experts per level, compress multiple local (potentially small) GEMMs in a single kernel launch to leverage grouped GEMM capabilities for improved utilization and performance. | | model.model_config.num_moe_experts | int | Optional | None | The number of experts to use for the MoE layer, corresponding to `n_routed_experts` in HuggingFace. When set, the MLP is replaced by the MoE layer. Setting this to None disables the MoE. | | model.model_config.num_experts_per_tok | int | Required | 2 | The number of experts to route each token to. | diff --git a/docs/mindformers/docs/source_zh_cn/feature/configuration.md b/docs/mindformers/docs/source_zh_cn/feature/configuration.md index 30f0530152c171a1c03376f94b58e44ffe5f5a5e..f70a33629409a6b1aa40cdcd65307eb4354eebca 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/configuration.md +++ b/docs/mindformers/docs/source_zh_cn/feature/configuration.md @@ -136,6 +136,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/ | model.model_config.softmax_compute_dtype | string | 可选 | 'float32' | 用于在注意力计算期间计算 softmax 的 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。 | | model.model_config.rotary_dtype | string | 可选 | 'float32' | 自定义旋转位置嵌入的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。 | | model.model_config.init_method_std | float | 可选 | 0.02 | 默认初始化方法的零均值正态的标准偏差,对应 HuggingFace 中的 `initializer_range` 。如果提供了 `init_method` 和 `output_layer_init_method` ,则不使用此方法。 | +| model.model_config.param_init_std_rules | list[dict] | 可选 | None | 自定义参数初始化标准差规则列表。每条规则包含 `target` (参数名正则)和 `init_method_std` (std值,≥0)。示例:`[{"target": ".*weight", "init_method_std": 0.02}]` | | model.model_config.moe_grouped_gemm | bool | 可选 | False | 当每个等级有多个专家时,在单次内核启动中压缩多个本地(可能很小)gemm,以利用分组 GEMM 功能来提高利用率和性能。 | | model.model_config.num_moe_experts | int | 可选 | None | 用于 MoE 层的专家数量,对应 HuggingFace 中的 `n_routed_experts` 。设置后,将用 MoE 层替换 MLP。设置为 None 则不使用 MoE。 | | model.model_config.num_experts_per_tok | int | 可选 | 2 | 每个 token 路由到的专家数量。 |