# Math-LLaVA **Repository Path**: hawksilent/Math-LLaVA ## Basic Information - **Project Name**: Math-LLaVA - **Description**: Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-11 - **Last Updated**: 2025-11-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Math-LLaVA This repository contains the code, data and model for the paper titled "Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models". [Paper](http://arxiv.org/abs/2406.17294v2), [Image Dataset](https://huggingface.co/datasets/Zhiqiang007/MathV360K/tree/main), [Model](https://huggingface.co/Zhiqiang007/Math-LLaVA/tree/main) ![ex1](pipeline.png) ## Latest News 🔥 * [2023-06-26] We released [Math-LLaVA checkpoints](https://huggingface.co/Zhiqiang007/Math-LLaVA/tree/main). The Math-LLaVA-13B model achieves **46.6%** on MathVista testmini, achieves **38.3%** on MMMU, and achieves **15.69%** on MATH-V. * [2024-06-25] Release [paper](http://arxiv.org/abs/2406.17294v2), [code](https://github.com/HZQ950419/Math-LLaVA) and [MathV360K dataset](https://huggingface.co/datasets/Zhiqiang007/MathV360K/tree/main). ## Install Packages ``` cd Math-LLaVA conda create -n math_llava python=3.10 -y conda activate math_llava pip install -e . ``` ## Enable Deepspeed and Flash-attention ``` pip install -e ".[train]" pip install flash-attn --no-build-isolation ``` ## Data Preparation "train_samples_all_tuning.json" corresponds to the annotations of qa pairs for finetuning. Download [image dataset](https://huggingface.co/datasets/Zhiqiang007/MathV360K/tree/main). Place the data in the root directory or other directory. Data structure: ``` ├── data_images/ │ ├── TabMWP/images/ │ ├── IconQA/images/ │ ├── ... ├── train_samples_all_tuning.json ``` ## Run full-finetuning ``` sh finetune_task.sh ``` ## MathVista Evaluation You can download and unzip images of MathVista using the following commands: ``` cd ./evaluation_mathvista/mathvista_data wget https://huggingface.co/datasets/AI4Math/MathVista/resolve/main/images.zip unzip images.zip ``` Generate the response on testmini subset: ``` cd evaluation_mathvista python response.py --output_dir ./mathvista_outputs --output_file responses.json --model_path your/model/path --model_base None ``` Extract the short answer text for score calculation by ChatGPT. Please refer [OpenAI API key](https://platform.openai.com/account/api-keys). ``` python extract_answer.py --output_file responses.json ``` Calculate the final score: ``` python calculate_score.py --output_file responses.json --score_file responses_score.json ``` ## MMMU Evaluation Generate the response: ``` cd eval_mmmu python mmmu_response.py --output_path mmmu_eval_output.json --model_path ``` Calculate the score: ``` python mmmu_only_eval.py --output_path mmmu_eval_output.json --answer_path ./answer_dict_val.json ``` ## Results on MathVista Accuracy scores on the testmini subset: | Model | ALL | FQA |GPS |MWP |TQA |VQA | |-----------------------|--------|--------|--------|--------|--------|--------| | miniGPT4-7B |**23.1**|**18.6**|**26.0**|**13.4**|**30.4**|**30.2**| | InstructBLIP-7B |**25.3**|**23.1**|**20.7**|**18.3**|**32.3**|**35.2**| | LLaVA-13B |**26.1**|**26.8**|**29.3**|**16.1**|**32.3**|**26.3**| | SPHINX-V1-13B |**27.5**|**23.4**|**23.1**|**21.5**|**39.9**|**34.1**| | LLaVA-1.5-13B |**27.6**|**-**|**-**|**-**|**-**|**-**| | OmniLMM-12B |**34.9**|**45.0**|**17.8**|**26.9**|**44.9**|**39.1**| | Math-LLaVA-13B |**46.6**|**37.2**|**57.7**|**56.5**|**51.3**|**33.5**| ## Results on MMMU Accuracy scores on the validation set: | Model | ALL | |-----------------------|--------| | miniGPT4-7B |**26.8**| | mPLUG-Owl-7B |**32.7**| | InstructBLIP-7B |**32.9**| | SPHINX-13B |**32.9**| | LLaVA-1.5-13B |**36.4**| | Math-LLaVA-13B |**38.3**| ## Results on MATH-V We also test on [MATH-V](https://github.com/mathvision-cuhk/MATH-V), a more challenging dataset: | Model | ALL | |-----------------------|--------| | Qwen-VL-Plus |**10.72**| | LLaVA-1.5-13B |**11.12**| | ShareGPT4V-13B |**11.88**| | InternLM-XComposer2-VL|**14.54**| | Math-LLaVA-13B |**15.69**| ## Acknowledgement The project is built on top of the amazing [LLaVA](https://github.com/haotian-liu/LLaVA) repository, [MathVista](https://github.com/lupantech/MathVista) and [MMMU](https://github.com/MMMU-Benchmark/MMMU). Thanks for their contributions! If you find our code and dataset helpful to your research, please consider citing us with this BibTeX: ```bibtex @misc{shihu2024mathllava, title={Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models}, author={Wenhao Shi and Zhiqiang Hu and Yi Bin and Junhua Liu and Yang Yang and See-Kiong Ng and Lidong Bing and Roy Ka-Wei Lee}, year={2024}, eprint={2406.17294}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```