# MoLE **Repository Path**: guizhiyu/MoLE ## Basic Information - **Project Name**: MoLE - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-03 - **Last Updated**: 2025-07-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MoLE Official code of ''Mixture of Lookup Experts''.

MoLE is a novel edge-friendly LLM architecture. With the same number of activated parameters, MoLE achieves: + Latency and memory overhead comparable to dense models + Performance on par with Mixture-of-Experts (MoE) models.

## Environment + torch 2.0.1 + transformers 4.38.2 ## Pretraining Please refer to ```pretrain``` folder. ## Models #### Dense Models + modeling_dense.py #### Moe Models + modeling_moe.py #### MoLE Models + modeling_mole.py (for training) + modeling_mole_rep.py (for inference) ## Checkpoints All these models are trained on a 100B-token subset of the [Pile](https://github.com/EleutherAI/the-pile) dataset. For the MoLE model, we only provide the checkpoints before re-parameterization (i.e., for the training phase). Re-parameterization can be performed using the script provided below. | Models | # Activated Params | URL | | ----- | ----- | ---| | Dense | 160M | 🤗 [JieShibo/Dense-160M](https://huggingface.co/JieShibo/Dense-160M) | | MoE-10E | 160M | 🤗 [JieShibo/MoE-160M-10E](https://huggingface.co/JieShibo/MoE-160M-10E) | | MoLE-4E | 160M | 🤗 [JieShibo/MoLE-160M-4E](https://huggingface.co/JieShibo/MoLE-160M-4E)| | MoE-34E | 160M | 🤗 [JieShibo/MoE-160M-34E](https://huggingface.co/JieShibo/MoE-160M-34E) | | MoLE-16E | 160M | 🤗 [JieShibo/MoLE-160M-16E](https://huggingface.co/JieShibo/MoLE-160M-16E)| | Dense | 410M | 🤗 [JieShibo/Dense-410M](https://huggingface.co/JieShibo/Dense-410M) | | MoE-10E | 410M | 🤗 [JieShibo/MoE-410M-10E](https://huggingface.co/JieShibo/MoE-410M-10E) | | MoLE-4E | 410M | 🤗 [JieShibo/MoLE-410M-4E](https://huggingface.co/JieShibo/MoLE-410M-4E)| ## Reparameterize MoLE for Inference ```bash python reparameterize.py --from_path --to_path ``` ## Inference ```python3 from transformers import AutoTokenizer from modeling_mole_rep import MoleForCausalLM model = MoleForCausalLM.from_pretrained(model_path, device_map='cuda') tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) inputs = tokenizer("Hello, I am", return_tensors="pt").to(model.device) tokens = model.generate(**inputs, max_length=10) print(tokenizer.decode(tokens[0])) ``` Note that since the offloading of LUTs involves the support of the file system, the above demo still puts LUTs in the GPU memory. Alternatively, you can try the following demo, which offloads LUTs to CPU memory. This demo has not been specially optimized, so there may be some inefficiencies. ```python3 from transformers import AutoTokenizer from modeling_mole_rep import MoleForCausalLM model = MoleForCausalLM.from_pretrained(model_path, device_map='cpu') model.model.embed_tokens.cuda() model.model.layers.cuda() model.model.norm.cuda() model.lm_head.cuda() model.model._buffers["causal_mask"] = model.model._buffers["causal_mask"].cuda() model.model.moe_table.weight.data = model.model.moe_table.weight.data.pin_memory() tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) inputs = tokenizer("Hello, I am", return_tensors="pt").to('cuda') tokens = model.generate(**inputs, max_length=10) print(tokenizer.decode(tokens[0])) ``` ## Citation ``` @article{jie2025mole, title={Mixture of Lookup Experts}, author={Jie, Shibo and Tang, Yehui and Han, Kai and Li, Yitong and Tang, Duyu and Deng, Zhi-Hong and Wang, Yunhe}, journal={arXiv preprint arXiv:2503.15798}, year={2025} } ```