From 3b5421c060ef2e83474d5b8491a993b59d1b5dc7 Mon Sep 17 00:00:00 2001
From: lanxiang <ls89862947@qq.com>
Date: Fri, 21 Nov 2025 11:14:55 +0800
Subject: [PATCH] =?UTF-8?q?=E6=96=B0=E5=A2=9ESlidingWindowAttention?=
 =?UTF-8?q?=E5=92=8CSharedKVCrossAttention=E6=A8=A1=E5=9D=97=E8=AF=B4?=
 =?UTF-8?q?=E6=98=8E?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../docs/source_en/feature/configuration.md   |   5 +
 .../feature/images/sliding_window.png         | Bin 0 -> 9060 bytes
 .../feature/other_training_features.md        |  58 ++++++++++
 .../source_zh_cn/feature/configuration.md     | 107 +++++++++---------
 .../feature/images/sliding_window.png         | Bin 0 -> 9060 bytes
 .../feature/other_training_features.md        |  58 ++++++++++
 6 files changed, 177 insertions(+), 51 deletions(-)
 create mode 100644 docs/mindformers/docs/source_en/feature/images/sliding_window.png
 create mode 100644 docs/mindformers/docs/source_zh_cn/feature/images/sliding_window.png

diff --git a/docs/mindformers/docs/source_en/feature/configuration.md b/docs/mindformers/docs/source_en/feature/configuration.md
index 61c145cb4b..13cadd2816 100644
--- a/docs/mindformers/docs/source_en/feature/configuration.md
+++ b/docs/mindformers/docs/source_en/feature/configuration.md
@@ -159,6 +159,11 @@ Because different model configurations may vary, here are some common model conf
 | model.model_config.moe_router_num_groups                  | int             | Optional  | None          | The number of expert groups to use for group-limited routing. Equivalent to `n_group` in HuggingFace.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
 | model.model_config.moe_router_group_topk                  | int             | Optional  | None          | The number of selected groups for group-limited routing. Equivalent to `topk_group` in HuggingFace.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
 | model.model_config.moe_router_topk                        | int             | Optional  | 2             | The number of experts to route each token to. Equivalent to `num_experts_per_tok` in HuggingFace. When used with `moe_router_num_groups` and `moe_router_group_topk`, first group `moe_router_num_groups`, then select `moe_router_group_topk`, and then select `moe_router_topk` experts from `moe_router_group_topk`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| model.model_config.window_size                            | tuple(int, int)       | Optional   | None          | If not `None`, then will use sliding window attention. The size of the window is specified by the numbers inside the tuple. `window_size[0]` represents `pre_tokens`, while `window_size[1]` represents `next_tokens`. -1 is special value meaning "infinite window size".                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.window_attn_skip_freq                  | int / list(int)       | Optional   | None          | Frequency of full attention layers among sliding window attention layers. Accepts either: - An integer N: Represents a (N-1):1 ratio, one full attention layer after (N-1) SWA layers. - A list that defines a custom pattern, e.g.: [1,1,1,1,0,0,0,0], where 1 represents SWA.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| model.model_config.model_architecture                     | str                   | Optional   | 'decoder_only' | Model structure. The optional types are `decoder_only` and `yoco`, currently only `yoco` supports SharedKVCrossAttention.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| model.model_config.num_encoder_layers                     | int                   | Optional   | None          | The number of encoder layers. The `num_encoder_layers` or the `num_decoder_layers` should be set while use the `yoco` model structure.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| model.model_config.num_decoder_layers                     | int                   | Optional   | None          | The number of decoder layers. The `num_encoder_layers` or the `num_decoder_layers` should be set while use the `yoco` model structure.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
 
 ### Model Training Configuration
 
diff --git a/docs/mindformers/docs/source_en/feature/images/sliding_window.png b/docs/mindformers/docs/source_en/feature/images/sliding_window.png
new file mode 100644
index 0000000000000000000000000000000000000000..a7f218e487add3ee210ee772637a2aa718b26d2f
GIT binary patch
literal 9060
zcmeHNX;4$?w#DawGTJSJGTEpI$S4s731EZ6RhbbYg5(AvC?JE(5}*|kMFsoH6dBTL
zgD65oWRgGx8WKqWZKfbI7}7u(LL?9~y>nt{rJ7gus@|<vx4ZfWP?hB5>~HVA)?RBL
zZaUZ<la^4D5EB!VKK_rxPGVwf$iUBzjT?Yh!qP{>#Kd;$96$X1sVL9sT*+rXKCw+}
zw!5ING7mx;4a=9TcHN-7Kh^nkLH?fKZKaD|_X}LU`ytip$rD#A$AD;}R;q-{Ix`~V
zYU;!F*P6BVRCKwnm;XlF$w6*SQQ_-p!Prcr3V-T{P&X8DaeC-PTuyDzpzlB^ZJyUN
zK%-yLt2ne#{o32{3FS=Gxiw;9GuxsI9;Ud&4P)@$Bc48EZ<L!t)tzILGK^#`ejARF
zH|mPq8iUMxQh3fWYtroQlRE2S^-Oa`;N|n$koHrZrCK%%s>D-654!J?@+8rdW-rsw
zzI!H5%N2Z*@qRk%;7{~vPXs=4riBcV57(7&jui*4Ej0e{f@TGLM9*g&oqSLs@k!lU
z;Mz{5OTD7IGi{Ej-fB2CbOpH9Pr9034c_`B8VvRb;_2)m;M(%*9xM;=*7;<pvm@c5
zVOQjLK;&nv^<(sbTfcSs3x<6Hk+J=tn3!%Zc;eCJY^I~Ja;s$m9X+>N-i(;Tt7X||
zyneNsw2%{4t6(}>9J5Njbs(}9t8|AeQVq3QQOMAxzi{9#5@p=CYsCIJ@d0ddm<T{8
z(d@}a5EFhYnjZj8(i1SFohzUMz)aQ|%xGcy*T4e0Tde@hsDsW57y?+ZISYcqbw&3B
z$jN#oMDGf$WEe3G!$G*8S%EA7m550Yp5;bg;}F1bW9MHt@Mlh0)Fc@(v6Gr0uT_c2
zr#TOeJrDNil?D~Gi0lw_bo8MbAf0OyO^rmn*hqmee~18bDJ~J}z05_WzT<1A!9>c9
z6Az1M>s&i|?RzkbTI3&Zez}SGOR)A1Sv$9g$ndr~MD{#b_@4}q?G_;&pklSRVAT&H
zQfx#NYsex)e*#PXrwuT@LDv!(il+|>N09_X(EY)QuhD|n6Q=m{Oz9~i0SW$UC<uG}
zf_HoV3icNc7Z;`8&<iW3lW>ZpA7y#097Do!k^iH4CV+##zZz(@|6M}%B)Y}@CMNbg
zY#vBjQMWvj<#7VZzt{^lHCUE>GS2htDw#${*S=aQ4}@dfh<Q1Dm2&xvFHHWrVulMP
zLJN7SYn7U(qkD=@uG9yg$q@@%t-450to};v@tGX|x`F2#l4MoI#4_waEmt0vh&Ips
z7dqZu$XGz-fiVMQYIzJydW8e60D|f7*Z*_!utH9BQ?Q0*G$$iw5J@-?K8xvvNh{H0
zC|9^FRL4T<1F`h?8~=GJSpmurVDMWPlev8_Kqau!?*J;#;1nnZK!*M@tpJthS@>1?
zMP3T{7Mz1n5&(?+g@{3w_W)FZAd!AW#DL1P0#%?0@5`<Zkf(`46}Zw&0rFI`>kHK{
zTQ5MKDn9-H3VaeeUER)C^pQ2z0+i7o&gIwb$(e~Ks&VK7<^g)yPz`f$XBMd4{%|BW
z#|%m5`TyzNGNlszAT^15MdWVKbpD|Txn1-w4H#PW1|Z}9P>O>tuU_RexFAVijmz9U
zEsRlAxM^^Rg`)9}62P$n{3@uRgA!4s<vXKCiz#-XSOg@#CwIBHpc2p3lqlN*%1xkl
ziumPR7Xyl(cC&$g7}SHnOlz85@of(c&AtH}3z&_JWjzHtjPN`y^53`l_a^NtA=|bk
zvG*qEApozyJ`gN|?y@%kb-dPcX+_=u)bS2Nqw&QXfJWn|DiLBu-T*WjheAYf6?uas
zz-U|<71e#&8-PZmNd(koZveO%@nliwESne5XvqE913dM`Y9F+hLuxMXMzqlE%FE#4
zgj+1;VMN4^<*murJ1F5b%dTz32I=2>$vG{uYGgm7zIV!QwGCN#-em)_OOfle3EC!*
z6c@RAL}746WLi&Y5e(~@$$O^tiW)MwsTcm#wZx;^8P&@97!y|R)s#$V_OH{f6Vx(G
zYvfkjujM}GyfS>WF78|CUXGi8_p@G&6Jz)31}@SH<7-D3;`Uc%H1yjDT^3T!3G7v{
zL7iz6GZK*NEQ71Cb(oZqShsB-s4YR4p*UDl<b|jqS&NFmfmau;QYB_vDDgg~zBSVo
z?y8&~@tP(!MvH1phgSylzIaoBtvGJo>{5S`iW@%Jlu8Ry?IUck)ZJDpcyMXvV|1zD
z>e4H?UcL&!U{hYYB&1@<bw0xY&Yvsy?+H|<<(=`2kIE#l{E=;(u(|Ge<57lxNL%I3
zscqEL(4eF_3s~M36eOkA?-LaE;{uBa-BopW^`rF2jo;b(x?QK~+Zy+b>8+#UGK2`}
zIJLnJQt6!i4c@($(yD_fPddCdj`!&pVeCS^CD~j5+2Cy*GBLJo*I7gc)-l;Mne~nJ
zle{fCZYfPYC{%!8UkctkivP(ZzBd5T%*QmbG@1-JrM!2;A1SCq(Y91Jqt+rYicjzo
zBvXc>DS|#HFYC6XtBu#Eiyxt2>C<lVSZKT&<4n&u^VTROus*ZJ_5{`6F)Mo$Ftq{Z
zwr&A24FK795;|}eQR0c>d_;Hffat|FWeGBR`9`77q&V3HoDlAB!w=``<nx(t(aae~
zM21d7)S$s5F^ti|^~#GucL;=E#>^7zZBuQH^~h1E3(&Zrg*o;_MPDF?NWp7R!Y>D^
za1Nsecr6i2zUzasO<COGNd)Hy6IN~%%8edqY))|09wc&4pm>^@yiJMK?9Uzw`ZLI4
zAaxbAbvQ{tDr9eeA9u7jg>Wxn`}#~7lulKNoV#hAA7iF-o;9VK4}s3A$s?=hb=vh{
z#m)vN<`7Lf8cM0k3DO<sAhdqZtoq27scg>yY1q$Af^qMT#b6g)MM4YX_#X1eg2w@R
zpnPADR7s95-zu2nQuyf2sC#QrUhiYW1Bynno+beTyhjy^=d9R<V*68b@|~u_$JMDI
zRRa77Ou^d?{hKxBuh{wy9jEHOXC^$-yoEUsXlZWP%WXihybBCZIyUo*>jRyGwBGe$
z4M{x`cDZWVerK|%%wJ97P6fa>vs|ec$0>Dsxas-O<lT|*ENAmzB2<&zQRz*hJ#WYX
z{@k(1$I;Y#LqQes_6$;+97I2a5bonZ%i7jqu16jvux#@wUXi5v>b{p7$3kX@A5!|C
zLLaC3nm$&!MdN<Qf-A3Ix?97Mm5_oJ;*xrZx<C|fwb*W)ADQV?$OInGW9mc}lP
z4VC*4N{Z7a&f|L94N2~e_nNY=@R3Bb0~C{ccpjN`Rsm^hpr*Dj03ZIMI=^C_ZB;`*
zY89mCDIguwh*P_71zpI<PH4$3I3^W&*At%nGzZfEdcHhr#-cHlrCU967E(L)u|$rR
zvz5EA4gXAz9-JQit2d)~cxuN~0=0}d!7j!3cr%*sD__MD_#L!<4bID9-x0xJvb>IK
zGi#!eO%3cfAoN_G3?3YiN3wMn-vdQ_p}5ZeeqLnI1uI0L#b<FOtNn$^!y*%JDt)^;
zu?=}6nq4}l##*LvrgU`s5&Y4wgf{bIwysT>L9z~jqR)q}a|#B=sZdt`F0n+d{AD}+
zyTlTO^81#%Gg>6EMC*w!+w$-{79|EB>HS%y3RhEqqOCa*H#~%eQ=-W4BsJ?6Yhiiw
zC6Q+%e!VUYLoLlUroMqFgzs7&rRnG-;lqIbAxOJ9M^f^!O2~(XbhBC9+*ntU;;2#R
z$T9yOeUG}S68xvvrG!D$`|Kg^Td2RI`0o=A9W4d|qLF9H9P^Eh7~$5k>S}k<LK<7r
zB{fxIKT1b;Ka^SCp#o}na~N1Nj{i6tjzf-&iSu*QyvHT6D5|ZdN_eOHJnqPy-=81(
zjDP<di1@`Ga+w9HyCJ`3R)0OZdh(6Er^&>Popo6>JwK1-c?>+kumanx3XjaYL}o}@
zdQ{p;7=<>@Q-M6p#`cUy-%?Us?mOV&$Xk$(XRv5rmFSmgV`m*_a$1=R#wEZq8NN$Q
zY{?i$Gna%}j%;#DL2t*pl?9JpmmVT971Xb7R!X#k&J*$cS;e}!W{<mQorqp)cf196
zZfu_Ujn&2n@6L}Ql5#oeiuk3>>RrC<*CB*~OYbi{HaD?U1TRmwvz&YVsNLkZpQZ%9
zk8>4DFg7z$%B-3YAGZB$yg=`<YVhk<T$O8^t?)-<rfUT1G4wJ)Q}3~hQ%hs1@y*TF
zvOMBV<^3Kmfpgjzw>;F+2qDDh=4PvtoepV|^{v#s+m8Xy3Xc|hq&d$=gHH|5m(wgb
z@RtOO_yuO~!)p@s+99)fj5!c$^T+G4te8bZ%Iv8eHgum_@tJZz_GDL)J(2IU4{eht
zzfIhGyZW{JF#R@^%AI3ExgjxopB_|Du}5%&hSyGBOhau}zxL{#Lr%Yj*?$##RL>(E
z12n}m#fR=(jal5^7r-o})a|L#tI}U`GCiOst4)(LXyZgc>(1Zb(udizep`+yl$ID0
z&b}ALZF(-qvxUn8Ir6h{M8zxlxbh7&s8557C~(WAcJ$9WWUbehr}1&TM<kX*isg+~
zaVZ)`m4HnlSmu}t4wV>wV^xTle%>g9Pp>WYYrSub-`*7IoYq%I%Fv7Xr4s|oDtjN%
zcX9hRg5s#eLM16K03(m{XQ@bRQNO0-Dm7<*?m7G}Z?6vD##1GJH(a75^<Ocq!^aUB
z=SFgsx$bjVYsAc{6zN2EMVLh*knqLwnL9Q6Z|yiYyOpkp6z~hHKE>GJy6^42g;~Ql
zS@LG1_F@1c)cPgci+P1X<I#H?Zch(fRd#)LnqdYR98f@d;}RMUZ+s9_(r3v$^ORh(
zf7Zb|vDyXKX0U-!;$9h;EmWNh-#Oe|C0X-evSq4IN4rK+rEgHbSE*s-?P+)uBRv1-
zF@{fR{Pg>B|4P+BxlhbiEplxG2RpA2)%x(*#s^3_Yew;22g@y~nj7Go>YO*w$&BoC
z&+GQ78OXG014j)HfGb%qaOw#sv;$7y%4Q}@T8Wf{6{*s3CT=J*hILtjDGR>27k;)j
zs&ZW8UK3u?#63&)_H_KF*{m1NE#Fnx>tVIIB$9m3KK~a{l{B5J_DS(>(6|bZ1)26f
zb(W=2*>PdEm`HnLHyr6DS2lO{ox#@Zq6ZtXw+TgK49hV_>SMz0TbAvZj2;GsU{S%9
zYUv7Ohw2XrFzb3Y)X30IAY@>~{EsJ_pOg_cTb)l!zOpnv9RJ89LX8FAC-ugskuP1Z
zcy&67cQyfYtQ2Y<QF)tb2Z%xBjw6u_^!rWu0i@PF%Zl%L94cBmgsaiW?8)qbxM!8}
z9t^+K+s~`5Pu=jCFut}S-YYr|WuK~?GO%?UE#2wPfF3C#Cv@1R4bjNzCS@AHT`GKC
zaKtLF&x)@x-goZJ8waQSUv|a}q0|6mUEk-%)!Cr9d$wgF-Bs!nW}QKn+O28plkZw{
z+%2~2U~&iYFCmVuF^67?+XgGKM_+F2tEiNL?avRkFgD-c_0n+X?Cd!Xciz>PkyK+5
zgqDKYG*)A7Fj>kKq{TV%M5)Q_@bHHb%9?w&u&&BXY!T5+f63FE??|`Ms)&>gOzC0P
zusr|{xpUqKZ-`gm9_h_<-cL1X5olF~4ZGKe=dU}U(eGAmNJ@v(4Qwpqb(AdJbudNd
z8fphNC>Ib4lq8q}R|9@-eS6Se9M#n0NBwm}IniffP0`}g?KK^XoBPC@8IbD(^U-6<
z2|5Wuw%C`dN7{a@x7<D3FsxVhIS0oR-2&%t@RXB^c4}vvRH(O1uI=EpzIz(hgVb)J
z;EgF4C#5wmbL5ez+DEW<c7K2t87ocww7tVhH>kXsf3XyQHqOqGujYiXTX;Q}dS43D
z(dHL*vH~Te)8FL&f?9CjKoiUT_66Zt(q^kdbz22V=IF?DtGIMfr04$Fnozr5D5MSl
zB<-K{K({u@*_$0>%$Iux%H41jO9;1&<eN1Dc$n>IDGdcqJ4Saf;A8)b?KGR~HRiKR
zQlUidh3A~RFn-ENGnza7xG%IW$xNawM^RQ*sbc6L+BxfF*PLX%?3nS4e2sgtC+_&9
zZ)1<uM)l5zUN`Vh<XXHLdT*4yI1s8{+XTa7{}IP^*lgW8o?Bpnc4WTV-WXKgZb;<C
zqYt3oTIrXZZK7i;xr?@zr!8*f<iT5zFQU%bR!{skRVw({+n}4wizQ9zM^2QSF7@&W
zz*ywKXPSAJRN_U_3tHB{_e`BY@~Wn7d#IeL{Vi;$BXn@9httfyHrjUppZ?>XDF(3v
zXa}u{!98SRRIJDPyaP$*%xYjhID{rYjEPQr-&dBp83-Tv(Es^Yp)}7D<pi%sMkX!+
R|2-pi{D|G*603h+{cjcu_8|ZO

literal 0
HcmV?d00001

diff --git a/docs/mindformers/docs/source_en/feature/other_training_features.md b/docs/mindformers/docs/source_en/feature/other_training_features.md
index 11ad434633..070ce0a533 100644
--- a/docs/mindformers/docs/source_en/feature/other_training_features.md
+++ b/docs/mindformers/docs/source_en/feature/other_training_features.md
@@ -189,4 +189,62 @@ model_config:
   ...
   use_fused_swiglu: True
   ...
+```
+
+## SlidingWindowAttention
+
+### Overview
+
+SlidingWindowAttention is a sparse attention mechanism that solves the problem of quadratic increase in computational complexity with sequence length in standard Transformer models by restricting each token to only focus on other tokens within a local window. The core idea is to narrow the attention range from global to a fixed window size.
+
+### Configuration and Usage
+
+#### YAML Parameter Configuration
+
+While use the SlidingWindowAttention module, you need to configure the `window_size` and `window_attn_skip_freq` items under the `model_config` item in the configuration file.
+
+The type of `window_size` is `Tuple[int, int]`, where `window_size[0]` represents `pre_tokens`, and `window_size[1]` represents `next_tokens`. Both are integers not less than -1, where -1 is a special value representing "infinite window size". The default starting point is the bottom right corner, as shown in the following figure:
+
+![/expert_load](./images/sliding_window.png)
+
+The type of `window_attn_skip_freq` is `Union[int, List[int]]`, which represents the frequency of the entire attention layer in the sliding window attention layer. Accept any of the following options:
+- Integer N: represents the ratio of (N-1): 1, which is a fully focused layer after (N-1) SWA layers.
+- Define a list of custom modes, such as [1,1,1,1,0,0,0], where 1 represents SWA.
+
+Example:
+
+```yaml
+model_config:
+  ...
+  window_size: (10, 0)
+  window_attn_skip_freq: 2
+  ...
+```
+
+## SharedKVCrossAttention
+
+### Overview
+
+SharedKVCrossAttention is an attention mechanism that only requires one KV cache and shares the KV cache generated by the decoder through cross attention multiplexing.
+
+### Configuration and Usage
+
+#### YAML Parameter Configuration
+
+When using the SharedKVCrossAttention module, you need to configure the `model_architecture`, `num_encoder_layers`, `num_decoder_layers`, and `num_layers` items under the `model_config` item in the configuration file.
+
+`model_architecture` represents the model structure, with optional types of `decoder_only` and `yoco`. Currently, only `yoco` supports SharedKVCrossAttention.
+
+`num_encoder_layers` represents the number of encoder layers, and `num_decoder_layers` represents the number of decoder layers. The sum of the two is equal to the size of `num_layers`. When `model_architecture` is set to `yoco`, SharedKVCrossAttention will be enabled after the end of the encoder layers, that is, when the decoder layers begin. If `num_decoder_layers` is set to 1 and `num_encoder_layers` is set to 1, then SharedKVCrossAttention will be enabled starting from the second layer. If only `num_encoder_layers` is configured, SharedKVCrossAttention will not be enabled. If only `num_decoder_layers` is configured, then SharedKVCrossAttention will be enabled from the first layer onwards.
+
+Example:
+
+```yaml
+model_config:
+  ...
+  model_architecture: "yoco"
+  num_layers: 2
+  num_encoder_layers: 1
+  num_decoder_layers: 1
+  ...
 ```
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/feature/configuration.md b/docs/mindformers/docs/source_zh_cn/feature/configuration.md
index 303b61b100..73f8ce3d19 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/configuration.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/configuration.md
@@ -108,57 +108,62 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/
 
 由于不同的模型配置会有差异，这里介绍 MindSpore Transformers 中模型常用配置。
 
-| 参数                                                        | 数据类型                  | 是否可选 | 默认值        | 取值说明                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
-|-----------------------------------------------------------|-----------------------|------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| model.model_config.model_type                             | string                | 必选   | None       | 设置模型配置类，模型配置类需要与模型类匹配使用，即模型配置类中应包含所有模型类使用的参数。例如 `qwen3`、`deepseek_v3` 等。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.architectures                          | string / list(string) | 必选   | None       | 设置模型类，构建模型时可以根据模型类对模型进行实例化。例如可设置为 `["Qwen3ForCausalLM"]`、`["DeepseekV3ForCausalLM"]`、`"Qwen3MoeForCausalLM"` 等。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
-| model.model_config.offset                                 | int / list(int)       | 可选   | 0          | 在流水线并行（PP）中，设置每个stage层数的偏移量：当模型层数无法均分时，用于精确分配各阶段的层数。<br><br>**规则1（基础PP）**：当 `pipeline_interleave = 1` 时，`offset` 为长度为 `pipeline_stage` 的列表。<br> - `offset[i]` 表示第 `i` 个阶段在基础层数上**额外增加**的层数。<br> - **约束**：`sum(offset)` 必须等于 `num_layers % pipeline_stage`。<br> - **示例**：当`pipeline_stage=4`、`num_layers=5`时，设 `offset=[0,0,1,0]`，则各阶段层数为：[1, 1, 2, 1]。<br><br>**规则2（启用交错）**：当 `pipeline_interleave > 1` 时，`offset` 为**嵌套列表**，格式为 `offset[interleave_id][stage_id]`。<br> - 外层列表长度 = `pipeline_interleave`，内层列表长度 = `pipeline_stage`。<br> - **约束**：所有内层偏移值之和必须等于 `num_layers % (pipeline_stage * pipeline_interleave)`。<br> - **示例**：当`pipeline_interleave = 2`、`pipeline_stage = 2`、`num_layers = 5`时，设 `offset = [[0,0],[1,0]]`，则表示第二个交错组中的第一个阶段多分配1层。 |
-| model.model_config.vocab_size                             | int                   | 可选   | 128000     | 模型的词表大小。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.hidden_size                            | int                   | 必选   | 0          | Transformer 隐藏层大小。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
-| model.model_config.ffn_hidden_size                        | int                   | 可选   | None       | Transformer 前馈层大小，对应 HuggingFace 中的 `intermediate_size` 。若不配置，默认设置为 `4 * hidden_size`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-| model.model_config.num_layers                             | int                   | 必选   | 0          | Transformer 层数，对应 HuggingFace 中的 `num_hidden_layers`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
-| model.model_config.max_position_embeddings                | int                   | 可选   | 4096       | 模型可以处理的最大序列长度。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| model.model_config.hidden_act                             | string                | 可选   | 'gelu'     | 用于 MLP 中的非线性的激活函数。可选配：`'gelu'`、`'silu'`、`'swiglu'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| model.model_config.num_attention_heads                    | int                   | 必选   | 0          | Transformer 注意力头数。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
-| model.model_config.num_query_groups                       | int                   | 可选   | None       | 组查询注意力机制的查询组数量，对应 HuggingFace 中的 `num_key_value_heads` 。若不配置，则使用普通注意力机制。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.kv_channels                            | int                   | 可选   | None       | 多头注意力机制中的投影权重维度，对应 HuggingFace 中的 `head_dim`。若不配置，则默认为 `hidden_size // num_attention_heads`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-| model.model_config.layernorm_epsilon                      | float                 | 可选   | 1e-5       | 任何 LayerNorm 操作的 Epsilon 值。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| model.model_config.add_bias_linear                        | bool                  | 可选   | True       | 如果开启此项，则将在所有线性层中包含一个偏差项（QKV 投影、core attention 之后以及 MLP 层中的两个）。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| model.model_config.tie_word_embeddings                    | bool                  | 可选   | True       | 是否共享输入和输出 embedding 权重。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
-| model.model_config.use_flash_attention                    | bool                  | 可选   | True       | 是否在注意力层中使用 Flash Attention。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| model.model_config.use_contiguous_weight_layout_attention | bool                  | 可选   | False      | 确定 Self Attention 的 QKV 线性投影中的权重排列。仅影响 Self Attention 层。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.hidden_dropout                         | float                 | 可选   | 0.1        | Transformer 隐藏状态的 Dropout 概率。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
-| model.model_config.attention_dropout                      | float                 | 可选   | 0.1        | 后注意力层的 Dropout 概率。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
-| model.model_config.position_embedding_type                | string                | 可选   | 'rope'     | 用于注意层的位置嵌入类型。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
-| model.model_config.params_dtype                           | string                | 可选   | 'float32'  | 初始化权重时使用的 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| model.model_config.compute_dtype                          | string                | 可选   | 'bfloat16' | Linear 层的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
-| model.model_config.layernorm_compute_dtype                | string                | 可选   | 'float32'  | LayerNorm 层的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
-| model.model_config.softmax_compute_dtype                  | string                | 可选   | 'float32'  | 用于在注意力计算期间计算 softmax 的 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.rotary_dtype                           | string                | 可选   | 'float32'  | 自定义旋转位置嵌入的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| model.model_config.init_method_std                        | float                 | 可选   | 0.02       | 默认初始化方法的零均值正态的标准偏差，对应 HuggingFace 中的 `initializer_range` 。如果提供了 `init_method` 和 `output_layer_init_method` ，则不使用此方法。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
-| model.model_config.moe_grouped_gemm                       | bool                  | 可选   | False      | 当每个等级有多个专家时，在单次内核启动中压缩多个本地（可能很小）gemm，以利用分组 GEMM 功能来提高利用率和性能。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-| model.model_config.num_moe_experts                        | int                   | 可选   | None       | 用于 MoE 层的专家数量，对应 HuggingFace 中的 `n_routed_experts` 。设置后，将用 MoE 层替换 MLP。设置为 None 则不使用 MoE。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
-| model.model_config.num_experts_per_tok                    | int                   | 可选   | 2          | 每个 token 路由到的专家数量。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
-| model.model_config.moe_ffn_hidden_size                    | int                   | 可选   | None       | MoE 前馈网络隐藏层大小，对应 HuggingFace 中的 `moe_intermediate_size` 。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
-| model.model_config.moe_router_dtype                       | string                | 可选   | 'float32'  | 用于路由和专家输出加权平均的数据类型。对应 HuggingFace 中的 `router_dense_type` 。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-| model.model_config.gated_linear_unit                      | bool                  | 可选   | False      | 对 MLP 中的第一个线性层使用门控线性单元。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
-| model.model_config.norm_topk_prob                         | bool                  | 可选   | True       | 是否使用 top-k 概率进行归一化。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| model.model_config.moe_router_pre_softmax                 | bool                  | 可选   | False      | 为 MoE 启用 pre-softmax（pre-sigmoid）路由，这意味着 softmax 会在 top-k 选择之前进行。默认情况下，softmax 会在 top-k 选择之后进行。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
-| model.model_config.moe_token_drop_policy                  | string                | 可选   | 'probs'    | 丢弃 token 的策略。可以是 `'probs'` 或 `'position'`。如果是 `'probs'` ，则丢弃概率最低的 token。 如果是 `'position'` ，则丢弃每个批次末尾的 token。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-| model.model_config.moe_router_topk_scaling_factor         | float                 | 可选   | None       | Top-K 路由选择中路由得分的缩放因子，对应 HuggingFace 中的 `routed_scaling_factor` 。仅在启用 `moe_router_pre_softmax` 时有效。默认为 `None`，表示不缩放。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| model.model_config.moe_aux_loss_coeff                     | float                 | 可选   | 0.0        | 辅助损耗的缩放系数。建议初始值为 `1e-2`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.moe_router_load_balancing_type         | string                | 可选   | 'aux_loss' | 路由器的负载均衡策略。 `'aux_loss'` 对应于 GShard 和 SwitchTransformer 中使用的负载均衡损失；`'seq_aux_loss'` 对应于 DeepSeekV2 和 DeepSeekV3 中使用的负载均衡损失，用于计算每个样本的损失；`'sinkhorn'` 对应于 S-BASE 中使用的均衡算法，`'none'` 表示无负载均衡。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
-| model.model_config.moe_permute_fusion                     | bool                  | 可选   | False      | 是否使用 moe_token_permute 融合算子，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| model.model_config.moe_router_force_expert_balance        | bool                  | 可选   | False      | 是否在专家路由中使用强制负载均衡。此选项仅用于性能测试，不用于一般用途，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
-| model.model_config.use_interleaved_weight_layout_mlp      | bool                  | 可选   | True       | 确定 MLP 的 linear_fc1 投影中的权重排列。仅影响 MLP 层。 <br>1. 为 True 时，使用交错排布：`[Gate_weights[0], Hidden_weights[0], Gate_weights[1], Hidden_weights[1], ...]`。<br> 2. 为 False 时，使用连续排布：`[Gate_weights, Hidden_weights]`。<br>注意：这会影响张量内存布局，但不会影响数学等价性。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-| model.model_config.moe_router_enable_expert_bias          | bool                  | 可选   | False      | 是否在无辅助损失负载均衡策略中，采用动态专家偏差的 TopK 路由。路由决策基于路由得分与专家偏差之和。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
-| model.model_config.enable_expert_relocation               | bool                  | 可选   | False      | 是否启用动态专家迁移功能，以实现 MoE 模型中的负载平衡。启用后，专家将根据其负载历史记录在设备之间动态重新分配，以提高训练效率和负载平衡，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| model.model_config.expert_relocation_initial_iteration    | int                   | 可选   | 20         | 启动专家迁移的初始迭代。专家迁移将在经过这么多次训练迭代后开始。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
-| model.model_config.expert_relocation_freq                 | int                   | 可选   | 50         | 训练迭代中专家迁移的频率。初始迭代后，每 N 次迭代执行一次专家迁移。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| model.model_config.print_expert_load                      | bool                  | 可选   | False      | 是否打印专家负载信息。启用后，将在训练期间打印详细的专家负载统计信息，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
-| model.model_config.moe_router_num_groups                  | int                   | 可选   | None       | 用于分组路由的专家分组数量，等价于 HuggingFace 中的 `n_group`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| model.model_config.moe_router_group_topk                  | int                   | 可选   | None       | 组限制路由的选定组数，等价于 HuggingFace 中的 `topk_group`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| model.model_config.moe_router_topk                        | int                   | 可选   | 2          | 每个 token 路由到的专家数量，等价于 HuggingFace 中的 `num_experts_per_tok`。配合 `moe_router_num_groups` 和 `moe_router_group_topk` 一起使用时，先分组 `moe_router_num_groups`，然后选出 `moe_router_group_topk`，再从 `moe_router_group_topk` 中选出 `moe_router_topk` 个专家。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| 参数                                                        | 数据类型                  | 是否可选 | 默认值           | 取值说明                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+|-----------------------------------------------------------|-----------------------|------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| model.model_config.model_type                             | string                | 必选   | None          | 设置模型配置类，模型配置类需要与模型类匹配使用，即模型配置类中应包含所有模型类使用的参数。例如 `qwen3`、`deepseek_v3` 等。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.architectures                          | string / list(string) | 必选   | None          | 设置模型类，构建模型时可以根据模型类对模型进行实例化。例如可设置为 `["Qwen3ForCausalLM"]`、`["DeepseekV3ForCausalLM"]`、`"Qwen3MoeForCausalLM"` 等。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| model.model_config.offset                                 | int / list(int)       | 可选   | 0             | 在流水线并行（PP）中，设置每个stage层数的偏移量：当模型层数无法均分时，用于精确分配各阶段的层数。<br><br>**规则1（基础PP）**：当 `pipeline_interleave = 1` 时，`offset` 为长度为 `pipeline_stage` 的列表。<br> - `offset[i]` 表示第 `i` 个阶段在基础层数上**额外增加**的层数。<br> - **约束**：`sum(offset)` 必须等于 `num_layers % pipeline_stage`。<br> - **示例**：当`pipeline_stage=4`、`num_layers=5`时，设 `offset=[0,0,1,0]`，则各阶段层数为：[1, 1, 2, 1]。<br><br>**规则2（启用交错）**：当 `pipeline_interleave > 1` 时，`offset` 为**嵌套列表**，格式为 `offset[interleave_id][stage_id]`。<br> - 外层列表长度 = `pipeline_interleave`，内层列表长度 = `pipeline_stage`。<br> - **约束**：所有内层偏移值之和必须等于 `num_layers % (pipeline_stage * pipeline_interleave)`。<br> - **示例**：当`pipeline_interleave = 2`、`pipeline_stage = 2`、`num_layers = 5`时，设 `offset = [[0,0],[1,0]]`，则表示第二个交错组中的第一个阶段多分配1层。 |
+| model.model_config.vocab_size                             | int                   | 可选   | 128000        | 模型的词表大小。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.hidden_size                            | int                   | 必选   | 0             | Transformer 隐藏层大小。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| model.model_config.ffn_hidden_size                        | int                   | 可选   | None          | Transformer 前馈层大小，对应 HuggingFace 中的 `intermediate_size` 。若不配置，默认设置为 `4 * hidden_size`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| model.model_config.num_layers                             | int                   | 必选   | 0             | Transformer 层数，对应 HuggingFace 中的 `num_hidden_layers`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| model.model_config.max_position_embeddings                | int                   | 可选   | 4096          | 模型可以处理的最大序列长度。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| model.model_config.hidden_act                             | string                | 可选   | 'gelu'        | 用于 MLP 中的非线性的激活函数。可选配：`'gelu'`、`'silu'`、`'swiglu'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| model.model_config.num_attention_heads                    | int                   | 必选   | 0             | Transformer 注意力头数。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| model.model_config.num_query_groups                       | int                   | 可选   | None          | 组查询注意力机制的查询组数量，对应 HuggingFace 中的 `num_key_value_heads` 。若不配置，则使用普通注意力机制。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.kv_channels                            | int                   | 可选   | None          | 多头注意力机制中的投影权重维度，对应 HuggingFace 中的 `head_dim`。若不配置，则默认为 `hidden_size // num_attention_heads`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| model.model_config.layernorm_epsilon                      | float                 | 可选   | 1e-5          | 任何 LayerNorm 操作的 Epsilon 值。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| model.model_config.add_bias_linear                        | bool                  | 可选   | True          | 如果开启此项，则将在所有线性层中包含一个偏差项（QKV 投影、core attention 之后以及 MLP 层中的两个）。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| model.model_config.tie_word_embeddings                    | bool                  | 可选   | True          | 是否共享输入和输出 embedding 权重。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| model.model_config.use_flash_attention                    | bool                  | 可选   | True          | 是否在注意力层中使用 Flash Attention。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| model.model_config.use_contiguous_weight_layout_attention | bool                  | 可选   | False         | 确定 Self Attention 的 QKV 线性投影中的权重排列。仅影响 Self Attention 层。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.hidden_dropout                         | float                 | 可选   | 0.1           | Transformer 隐藏状态的 Dropout 概率。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| model.model_config.attention_dropout                      | float                 | 可选   | 0.1           | 后注意力层的 Dropout 概率。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| model.model_config.position_embedding_type                | string                | 可选   | 'rope'        | 用于注意层的位置嵌入类型。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| model.model_config.params_dtype                           | string                | 可选   | 'float32'     | 初始化权重时使用的 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| model.model_config.compute_dtype                          | string                | 可选   | 'bfloat16'    | Linear 层的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| model.model_config.layernorm_compute_dtype                | string                | 可选   | 'float32'     | LayerNorm 层的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| model.model_config.softmax_compute_dtype                  | string                | 可选   | 'float32'     | 用于在注意力计算期间计算 softmax 的 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.rotary_dtype                           | string                | 可选   | 'float32'     | 自定义旋转位置嵌入的计算 dtype。可以配置为 `'float32'`、`'float16'`、`'bfloat16'`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| model.model_config.init_method_std                        | float                 | 可选   | 0.02          | 默认初始化方法的零均值正态的标准偏差，对应 HuggingFace 中的 `initializer_range` 。如果提供了 `init_method` 和 `output_layer_init_method` ，则不使用此方法。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| model.model_config.moe_grouped_gemm                       | bool                  | 可选   | False         | 当每个等级有多个专家时，在单次内核启动中压缩多个本地（可能很小）gemm，以利用分组 GEMM 功能来提高利用率和性能。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| model.model_config.num_moe_experts                        | int                   | 可选   | None          | 用于 MoE 层的专家数量，对应 HuggingFace 中的 `n_routed_experts` 。设置后，将用 MoE 层替换 MLP。设置为 None 则不使用 MoE。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| model.model_config.num_experts_per_tok                    | int                   | 可选   | 2             | 每个 token 路由到的专家数量。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| model.model_config.moe_ffn_hidden_size                    | int                   | 可选   | None          | MoE 前馈网络隐藏层大小，对应 HuggingFace 中的 `moe_intermediate_size` 。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| model.model_config.moe_router_dtype                       | string                | 可选   | 'float32'     | 用于路由和专家输出加权平均的数据类型。对应 HuggingFace 中的 `router_dense_type` 。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| model.model_config.gated_linear_unit                      | bool                  | 可选   | False         | 对 MLP 中的第一个线性层使用门控线性单元。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| model.model_config.norm_topk_prob                         | bool                  | 可选   | True          | 是否使用 top-k 概率进行归一化。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| model.model_config.moe_router_pre_softmax                 | bool                  | 可选   | False         | 为 MoE 启用 pre-softmax（pre-sigmoid）路由，这意味着 softmax 会在 top-k 选择之前进行。默认情况下，softmax 会在 top-k 选择之后进行。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| model.model_config.moe_token_drop_policy                  | string                | 可选   | 'probs'       | 丢弃 token 的策略。可以是 `'probs'` 或 `'position'`。如果是 `'probs'` ，则丢弃概率最低的 token。 如果是 `'position'` ，则丢弃每个批次末尾的 token。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| model.model_config.moe_router_topk_scaling_factor         | float                 | 可选   | None          | Top-K 路由选择中路由得分的缩放因子，对应 HuggingFace 中的 `routed_scaling_factor` 。仅在启用 `moe_router_pre_softmax` 时有效。默认为 `None`，表示不缩放。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| model.model_config.moe_aux_loss_coeff                     | float                 | 可选   | 0.0           | 辅助损耗的缩放系数。建议初始值为 `1e-2`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.moe_router_load_balancing_type         | string                | 可选   | 'aux_loss'    | 路由器的负载均衡策略。 `'aux_loss'` 对应于 GShard 和 SwitchTransformer 中使用的负载均衡损失；`'seq_aux_loss'` 对应于 DeepSeekV2 和 DeepSeekV3 中使用的负载均衡损失，用于计算每个样本的损失；`'sinkhorn'` 对应于 S-BASE 中使用的均衡算法，`'none'` 表示无负载均衡。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| model.model_config.moe_permute_fusion                     | bool                  | 可选   | False         | 是否使用 moe_token_permute 融合算子，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| model.model_config.moe_router_force_expert_balance        | bool                  | 可选   | False         | 是否在专家路由中使用强制负载均衡。此选项仅用于性能测试，不用于一般用途，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| model.model_config.use_interleaved_weight_layout_mlp      | bool                  | 可选   | True          | 确定 MLP 的 linear_fc1 投影中的权重排列。仅影响 MLP 层。 <br>1. 为 True 时，使用交错排布：`[Gate_weights[0], Hidden_weights[0], Gate_weights[1], Hidden_weights[1], ...]`。<br> 2. 为 False 时，使用连续排布：`[Gate_weights, Hidden_weights]`。<br>注意：这会影响张量内存布局，但不会影响数学等价性。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| model.model_config.moe_router_enable_expert_bias          | bool                  | 可选   | False         | 是否在无辅助损失负载均衡策略中，采用动态专家偏差的 TopK 路由。路由决策基于路由得分与专家偏差之和。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| model.model_config.enable_expert_relocation               | bool                  | 可选   | False         | 是否启用动态专家迁移功能，以实现 MoE 模型中的负载平衡。启用后，专家将根据其负载历史记录在设备之间动态重新分配，以提高训练效率和负载平衡，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| model.model_config.expert_relocation_initial_iteration    | int                   | 可选   | 20            | 启动专家迁移的初始迭代。专家迁移将在经过这么多次训练迭代后开始。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| model.model_config.expert_relocation_freq                 | int                   | 可选   | 50            | 训练迭代中专家迁移的频率。初始迭代后，每 N 次迭代执行一次专家迁移。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| model.model_config.print_expert_load                      | bool                  | 可选   | False         | 是否打印专家负载信息。启用后，将在训练期间打印详细的专家负载统计信息，默认为 `False`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| model.model_config.moe_router_num_groups                  | int                   | 可选   | None          | 用于分组路由的专家分组数量，等价于 HuggingFace 中的 `n_group`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| model.model_config.moe_router_group_topk                  | int                   | 可选   | None          | 组限制路由的选定组数，等价于 HuggingFace 中的 `topk_group`。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| model.model_config.moe_router_topk                        | int                   | 可选   | 2             | 每个 token 路由到的专家数量，等价于 HuggingFace 中的 `num_experts_per_tok`。配合 `moe_router_num_groups` 和 `moe_router_group_topk` 一起使用时，先分组 `moe_router_num_groups`，然后选出 `moe_router_group_topk`，再从 `moe_router_group_topk` 中选出 `moe_router_topk` 个专家。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| model.model_config.window_size                            | tuple(int, int)       | 可选   | None          | 如果不是`None`，则将使用滑动窗口注意。窗口的大小由指定元组中的数字；`window_size[0]`代表`pre_tokens`，`window_size[1]`代表`next_tokens`。-1是特殊值，表示“无限窗口大小”。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| model.model_config.window_attn_skip_freq                  | int / list(int)       | 可选   | None          | 滑动窗口关注层中全关注层的频率。整数N：表示（N-1）:1的比率，在（N-1）个SWA层之后的一个全关注层。定义自定义模式的列表，例如：[1,1,1,1,0,0,0]，其中1表示SWA。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| model.model_config.model_architecture                     | str                   | 可选   | 'decoder_only' | 模型结构。可选类型为`decoder_only`和`yoco`，目前只有`yoco`支持SharedKVCrossAttention。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| model.model_config.num_encoder_layers                     | int                   | 可选   | None          | 编码器层数。当模型结构设定为`yoco`时，编码器层数和解码器层数至少设定一个值。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| model.model_config.num_decoder_layers                     | int                   | 可选   | None          | 解码器层数。当模型结构设定为`yoco`时，编码器层数和解码器层数至少设定一个值。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
 
 ### 模型训练配置
 
diff --git a/docs/mindformers/docs/source_zh_cn/feature/images/sliding_window.png b/docs/mindformers/docs/source_zh_cn/feature/images/sliding_window.png
new file mode 100644
index 0000000000000000000000000000000000000000..a7f218e487add3ee210ee772637a2aa718b26d2f
GIT binary patch
literal 9060
zcmeHNX;4$?w#DawGTJSJGTEpI$S4s731EZ6RhbbYg5(AvC?JE(5}*|kMFsoH6dBTL
zgD65oWRgGx8WKqWZKfbI7}7u(LL?9~y>nt{rJ7gus@|<vx4ZfWP?hB5>~HVA)?RBL
zZaUZ<la^4D5EB!VKK_rxPGVwf$iUBzjT?Yh!qP{>#Kd;$96$X1sVL9sT*+rXKCw+}
zw!5ING7mx;4a=9TcHN-7Kh^nkLH?fKZKaD|_X}LU`ytip$rD#A$AD;}R;q-{Ix`~V
zYU;!F*P6BVRCKwnm;XlF$w6*SQQ_-p!Prcr3V-T{P&X8DaeC-PTuyDzpzlB^ZJyUN
zK%-yLt2ne#{o32{3FS=Gxiw;9GuxsI9;Ud&4P)@$Bc48EZ<L!t)tzILGK^#`ejARF
zH|mPq8iUMxQh3fWYtroQlRE2S^-Oa`;N|n$koHrZrCK%%s>D-654!J?@+8rdW-rsw
zzI!H5%N2Z*@qRk%;7{~vPXs=4riBcV57(7&jui*4Ej0e{f@TGLM9*g&oqSLs@k!lU
z;Mz{5OTD7IGi{Ej-fB2CbOpH9Pr9034c_`B8VvRb;_2)m;M(%*9xM;=*7;<pvm@c5
zVOQjLK;&nv^<(sbTfcSs3x<6Hk+J=tn3!%Zc;eCJY^I~Ja;s$m9X+>N-i(;Tt7X||
zyneNsw2%{4t6(}>9J5Njbs(}9t8|AeQVq3QQOMAxzi{9#5@p=CYsCIJ@d0ddm<T{8
z(d@}a5EFhYnjZj8(i1SFohzUMz)aQ|%xGcy*T4e0Tde@hsDsW57y?+ZISYcqbw&3B
z$jN#oMDGf$WEe3G!$G*8S%EA7m550Yp5;bg;}F1bW9MHt@Mlh0)Fc@(v6Gr0uT_c2
zr#TOeJrDNil?D~Gi0lw_bo8MbAf0OyO^rmn*hqmee~18bDJ~J}z05_WzT<1A!9>c9
z6Az1M>s&i|?RzkbTI3&Zez}SGOR)A1Sv$9g$ndr~MD{#b_@4}q?G_;&pklSRVAT&H
zQfx#NYsex)e*#PXrwuT@LDv!(il+|>N09_X(EY)QuhD|n6Q=m{Oz9~i0SW$UC<uG}
zf_HoV3icNc7Z;`8&<iW3lW>ZpA7y#097Do!k^iH4CV+##zZz(@|6M}%B)Y}@CMNbg
zY#vBjQMWvj<#7VZzt{^lHCUE>GS2htDw#${*S=aQ4}@dfh<Q1Dm2&xvFHHWrVulMP
zLJN7SYn7U(qkD=@uG9yg$q@@%t-450to};v@tGX|x`F2#l4MoI#4_waEmt0vh&Ips
z7dqZu$XGz-fiVMQYIzJydW8e60D|f7*Z*_!utH9BQ?Q0*G$$iw5J@-?K8xvvNh{H0
zC|9^FRL4T<1F`h?8~=GJSpmurVDMWPlev8_Kqau!?*J;#;1nnZK!*M@tpJthS@>1?
zMP3T{7Mz1n5&(?+g@{3w_W)FZAd!AW#DL1P0#%?0@5`<Zkf(`46}Zw&0rFI`>kHK{
zTQ5MKDn9-H3VaeeUER)C^pQ2z0+i7o&gIwb$(e~Ks&VK7<^g)yPz`f$XBMd4{%|BW
z#|%m5`TyzNGNlszAT^15MdWVKbpD|Txn1-w4H#PW1|Z}9P>O>tuU_RexFAVijmz9U
zEsRlAxM^^Rg`)9}62P$n{3@uRgA!4s<vXKCiz#-XSOg@#CwIBHpc2p3lqlN*%1xkl
ziumPR7Xyl(cC&$g7}SHnOlz85@of(c&AtH}3z&_JWjzHtjPN`y^53`l_a^NtA=|bk
zvG*qEApozyJ`gN|?y@%kb-dPcX+_=u)bS2Nqw&QXfJWn|DiLBu-T*WjheAYf6?uas
zz-U|<71e#&8-PZmNd(koZveO%@nliwESne5XvqE913dM`Y9F+hLuxMXMzqlE%FE#4
zgj+1;VMN4^<*murJ1F5b%dTz32I=2>$vG{uYGgm7zIV!QwGCN#-em)_OOfle3EC!*
z6c@RAL}746WLi&Y5e(~@$$O^tiW)MwsTcm#wZx;^8P&@97!y|R)s#$V_OH{f6Vx(G
zYvfkjujM}GyfS>WF78|CUXGi8_p@G&6Jz)31}@SH<7-D3;`Uc%H1yjDT^3T!3G7v{
zL7iz6GZK*NEQ71Cb(oZqShsB-s4YR4p*UDl<b|jqS&NFmfmau;QYB_vDDgg~zBSVo
z?y8&~@tP(!MvH1phgSylzIaoBtvGJo>{5S`iW@%Jlu8Ry?IUck)ZJDpcyMXvV|1zD
z>e4H?UcL&!U{hYYB&1@<bw0xY&Yvsy?+H|<<(=`2kIE#l{E=;(u(|Ge<57lxNL%I3
zscqEL(4eF_3s~M36eOkA?-LaE;{uBa-BopW^`rF2jo;b(x?QK~+Zy+b>8+#UGK2`}
zIJLnJQt6!i4c@($(yD_fPddCdj`!&pVeCS^CD~j5+2Cy*GBLJo*I7gc)-l;Mne~nJ
zle{fCZYfPYC{%!8UkctkivP(ZzBd5T%*QmbG@1-JrM!2;A1SCq(Y91Jqt+rYicjzo
zBvXc>DS|#HFYC6XtBu#Eiyxt2>C<lVSZKT&<4n&u^VTROus*ZJ_5{`6F)Mo$Ftq{Z
zwr&A24FK795;|}eQR0c>d_;Hffat|FWeGBR`9`77q&V3HoDlAB!w=``<nx(t(aae~
zM21d7)S$s5F^ti|^~#GucL;=E#>^7zZBuQH^~h1E3(&Zrg*o;_MPDF?NWp7R!Y>D^
za1Nsecr6i2zUzasO<COGNd)Hy6IN~%%8edqY))|09wc&4pm>^@yiJMK?9Uzw`ZLI4
zAaxbAbvQ{tDr9eeA9u7jg>Wxn`}#~7lulKNoV#hAA7iF-o;9VK4}s3A$s?=hb=vh{
z#m)vN<`7Lf8cM0k3DO<sAhdqZtoq27scg>yY1q$Af^qMT#b6g)MM4YX_#X1eg2w@R
zpnPADR7s95-zu2nQuyf2sC#QrUhiYW1Bynno+beTyhjy^=d9R<V*68b@|~u_$JMDI
zRRa77Ou^d?{hKxBuh{wy9jEHOXC^$-yoEUsXlZWP%WXihybBCZIyUo*>jRyGwBGe$
z4M{x`cDZWVerK|%%wJ97P6fa>vs|ec$0>Dsxas-O<lT|*ENAmzB2<&zQRz*hJ#WYX
z{@k(1$I;Y#LqQes_6$;+97I2a5bonZ%i7jqu16jvux#@wUXi5v>b{p7$3kX@A5!|C
zLLaC3nm$&!MdN<Qf-A3Ix?97Mm5_oJ;*xrZx<C|fwb*W)ADQV?$OInGW9mc}lP
z4VC*4N{Z7a&f|L94N2~e_nNY=@R3Bb0~C{ccpjN`Rsm^hpr*Dj03ZIMI=^C_ZB;`*
zY89mCDIguwh*P_71zpI<PH4$3I3^W&*At%nGzZfEdcHhr#-cHlrCU967E(L)u|$rR
zvz5EA4gXAz9-JQit2d)~cxuN~0=0}d!7j!3cr%*sD__MD_#L!<4bID9-x0xJvb>IK
zGi#!eO%3cfAoN_G3?3YiN3wMn-vdQ_p}5ZeeqLnI1uI0L#b<FOtNn$^!y*%JDt)^;
zu?=}6nq4}l##*LvrgU`s5&Y4wgf{bIwysT>L9z~jqR)q}a|#B=sZdt`F0n+d{AD}+
zyTlTO^81#%Gg>6EMC*w!+w$-{79|EB>HS%y3RhEqqOCa*H#~%eQ=-W4BsJ?6Yhiiw
zC6Q+%e!VUYLoLlUroMqFgzs7&rRnG-;lqIbAxOJ9M^f^!O2~(XbhBC9+*ntU;;2#R
z$T9yOeUG}S68xvvrG!D$`|Kg^Td2RI`0o=A9W4d|qLF9H9P^Eh7~$5k>S}k<LK<7r
zB{fxIKT1b;Ka^SCp#o}na~N1Nj{i6tjzf-&iSu*QyvHT6D5|ZdN_eOHJnqPy-=81(
zjDP<di1@`Ga+w9HyCJ`3R)0OZdh(6Er^&>Popo6>JwK1-c?>+kumanx3XjaYL}o}@
zdQ{p;7=<>@Q-M6p#`cUy-%?Us?mOV&$Xk$(XRv5rmFSmgV`m*_a$1=R#wEZq8NN$Q
zY{?i$Gna%}j%;#DL2t*pl?9JpmmVT971Xb7R!X#k&J*$cS;e}!W{<mQorqp)cf196
zZfu_Ujn&2n@6L}Ql5#oeiuk3>>RrC<*CB*~OYbi{HaD?U1TRmwvz&YVsNLkZpQZ%9
zk8>4DFg7z$%B-3YAGZB$yg=`<YVhk<T$O8^t?)-<rfUT1G4wJ)Q}3~hQ%hs1@y*TF
zvOMBV<^3Kmfpgjzw>;F+2qDDh=4PvtoepV|^{v#s+m8Xy3Xc|hq&d$=gHH|5m(wgb
z@RtOO_yuO~!)p@s+99)fj5!c$^T+G4te8bZ%Iv8eHgum_@tJZz_GDL)J(2IU4{eht
zzfIhGyZW{JF#R@^%AI3ExgjxopB_|Du}5%&hSyGBOhau}zxL{#Lr%Yj*?$##RL>(E
z12n}m#fR=(jal5^7r-o})a|L#tI}U`GCiOst4)(LXyZgc>(1Zb(udizep`+yl$ID0
z&b}ALZF(-qvxUn8Ir6h{M8zxlxbh7&s8557C~(WAcJ$9WWUbehr}1&TM<kX*isg+~
zaVZ)`m4HnlSmu}t4wV>wV^xTle%>g9Pp>WYYrSub-`*7IoYq%I%Fv7Xr4s|oDtjN%
zcX9hRg5s#eLM16K03(m{XQ@bRQNO0-Dm7<*?m7G}Z?6vD##1GJH(a75^<Ocq!^aUB
z=SFgsx$bjVYsAc{6zN2EMVLh*knqLwnL9Q6Z|yiYyOpkp6z~hHKE>GJy6^42g;~Ql
zS@LG1_F@1c)cPgci+P1X<I#H?Zch(fRd#)LnqdYR98f@d;}RMUZ+s9_(r3v$^ORh(
zf7Zb|vDyXKX0U-!;$9h;EmWNh-#Oe|C0X-evSq4IN4rK+rEgHbSE*s-?P+)uBRv1-
zF@{fR{Pg>B|4P+BxlhbiEplxG2RpA2)%x(*#s^3_Yew;22g@y~nj7Go>YO*w$&BoC
z&+GQ78OXG014j)HfGb%qaOw#sv;$7y%4Q}@T8Wf{6{*s3CT=J*hILtjDGR>27k;)j
zs&ZW8UK3u?#63&)_H_KF*{m1NE#Fnx>tVIIB$9m3KK~a{l{B5J_DS(>(6|bZ1)26f
zb(W=2*>PdEm`HnLHyr6DS2lO{ox#@Zq6ZtXw+TgK49hV_>SMz0TbAvZj2;GsU{S%9
zYUv7Ohw2XrFzb3Y)X30IAY@>~{EsJ_pOg_cTb)l!zOpnv9RJ89LX8FAC-ugskuP1Z
zcy&67cQyfYtQ2Y<QF)tb2Z%xBjw6u_^!rWu0i@PF%Zl%L94cBmgsaiW?8)qbxM!8}
z9t^+K+s~`5Pu=jCFut}S-YYr|WuK~?GO%?UE#2wPfF3C#Cv@1R4bjNzCS@AHT`GKC
zaKtLF&x)@x-goZJ8waQSUv|a}q0|6mUEk-%)!Cr9d$wgF-Bs!nW}QKn+O28plkZw{
z+%2~2U~&iYFCmVuF^67?+XgGKM_+F2tEiNL?avRkFgD-c_0n+X?Cd!Xciz>PkyK+5
zgqDKYG*)A7Fj>kKq{TV%M5)Q_@bHHb%9?w&u&&BXY!T5+f63FE??|`Ms)&>gOzC0P
zusr|{xpUqKZ-`gm9_h_<-cL1X5olF~4ZGKe=dU}U(eGAmNJ@v(4Qwpqb(AdJbudNd
z8fphNC>Ib4lq8q}R|9@-eS6Se9M#n0NBwm}IniffP0`}g?KK^XoBPC@8IbD(^U-6<
z2|5Wuw%C`dN7{a@x7<D3FsxVhIS0oR-2&%t@RXB^c4}vvRH(O1uI=EpzIz(hgVb)J
z;EgF4C#5wmbL5ez+DEW<c7K2t87ocww7tVhH>kXsf3XyQHqOqGujYiXTX;Q}dS43D
z(dHL*vH~Te)8FL&f?9CjKoiUT_66Zt(q^kdbz22V=IF?DtGIMfr04$Fnozr5D5MSl
zB<-K{K({u@*_$0>%$Iux%H41jO9;1&<eN1Dc$n>IDGdcqJ4Saf;A8)b?KGR~HRiKR
zQlUidh3A~RFn-ENGnza7xG%IW$xNawM^RQ*sbc6L+BxfF*PLX%?3nS4e2sgtC+_&9
zZ)1<uM)l5zUN`Vh<XXHLdT*4yI1s8{+XTa7{}IP^*lgW8o?Bpnc4WTV-WXKgZb;<C
zqYt3oTIrXZZK7i;xr?@zr!8*f<iT5zFQU%bR!{skRVw({+n}4wizQ9zM^2QSF7@&W
zz*ywKXPSAJRN_U_3tHB{_e`BY@~Wn7d#IeL{Vi;$BXn@9httfyHrjUppZ?>XDF(3v
zXa}u{!98SRRIJDPyaP$*%xYjhID{rYjEPQr-&dBp83-Tv(Es^Yp)}7D<pi%sMkX!+
R|2-pi{D|G*603h+{cjcu_8|ZO

literal 0
HcmV?d00001

diff --git a/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md b/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md
index a2b9a21fd1..09cd62fa75 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md
@@ -190,3 +190,61 @@ model_config:
   use_fused_swiglu: True
   ...
 ```
+
+## SlidingWindowAttention
+
+### 概述
+
+SlidingWindowAttention是一种稀疏注意力机制，通过限制每个token仅关注局部窗口内的其他token，解决标准Transformer模型计算复杂度随序列长度二次增长的问题。其核心思想是将注意力范围从全局缩小到固定窗口大小。
+
+### 配置与使用
+
+#### YAML 参数配置
+
+用户在使用SlidingWindowAttention模块时，需要配置文件中的 `model_config` 项下配置`window_size` 项和`window_attn_skip_freq` 项。
+
+`window_size`类型为`Tuple[int, int]`，其中`window_size[0]`代表`pre_tokens`，`window_size[1]`代表`next_tokens`。二者均为不小于-1的整数，-1是特殊值，表示“无限窗口大小”。默认起点为右下角，如下图所示：
+
+![/expert_load](./images/sliding_window.png)
+
+`window_attn_skip_freq`类型为`Union[int, List[int]]`，表示滑动窗口关注层中全关注层的频率。接受以下任一选项：
+- 整数N：表示（N-1）:1的比率，在（N-1）个SWA层之后的一个全关注层。
+- 定义自定义模式的列表，例如：[1,1,1,1,0,0,0]，其中1表示SWA。
+
+配置示例：
+
+```yaml
+model_config:
+  ...
+  window_size: (10, 0)
+  window_attn_skip_freq: 2
+  ...
+```
+
+## SharedKVCrossAttention
+
+### 概述
+
+SharedKVCrossAttention是一种仅需一次KV缓存并通过交叉注意力复用自解码器生成的共享KV缓存的注意力机制。
+
+### 配置与使用
+
+#### YAML 参数配置
+
+用户在使用SharedKVCrossAttention模块时，需要配置文件中的 `model_config` 项下配置`model_architecture` 项、`num_encoder_layers`、`num_decoder_layers`和`num_layers`项。
+
+`model_architecture`表示模型结构，可选类型为`decoder_only`和`yoco`，目前只有`yoco`支持SharedKVCrossAttention。
+
+`num_encoder_layers`表示编码器层数，`num_decoder_layers`表示解码器层数，二者相加等于`num_layers`的大小。当`model_architecture`设定为`yoco`时，SharedKVCrossAttention会在编码器的层数结束后使能，即解码器层数开始时使能。如设定`num_encoder_layers`为1，`num_decoder_layers`为1，那么SharedKVCrossAttention会在第二层开始使能。如果只是配置了`num_encoder_layers`，则SharedKVCrossAttention不会使能。如若是只配置了`num_decoder_layers`，那么SharedKVCrossAttention便会在第一层开始使能。
+
+配置示例：
+
+```yaml
+model_config:
+  ...
+  model_architecture: "yoco"
+  num_layers: 2
+  num_encoder_layers: 1
+  num_decoder_layers: 1
+  ...
+```
-- 
Gitee