# generate-it **Repository Path**: jiguo_li/generate-it ## Basic Information - **Project Name**: generate-it - **Description**: A collection of models for image<->text generation in ACM MM 2021. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-02-16 - **Last Updated**: 2022-07-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Bi-directional Image and Text Generation ## UMT-BITG ([image & text generator](it-generator/README.md)) [Unifying Multimodal Transformer for Bi-directional Image and Text Generation](https://arxiv.org/abs/2110.09753), \ Yupan Huang, Bei Liu, Yutong Lu, in ACM MM 2021 (Industrial Track). ## UMT-DBITG ([diverse image & text generator](diverse-it-generator/README.md)) [A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation](https://arxiv.org/abs/2110.09756), \ Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu, in ACM MM 2021 (Video and Demo Track). Poster or slides are available in the `assets` folder by visiting [OneDrive](https://mail2sysueducn-my.sharepoint.com/:f:/g/personal/huangyp28_mail2_sysu_edu_cn/EkCFDwd2bQpKtYwyBi3A8ukBHWyNMQ_Tkw9ZeQhYOTMTBA?e=xsMWPO). ## Data & Pre-trained Models Download preprocessed data and our pre-trained models by visiting [OneDrive](https://mail2sysueducn-my.sharepoint.com/:f:/g/personal/huangyp28_mail2_sysu_edu_cn/EkCFDwd2bQpKtYwyBi3A8ukBHWyNMQ_Tkw9ZeQhYOTMTBA?e=xsMWPO). We suggest following our data structures, which is consistent with the paths in `config.py`. You may need to modify the `root_path` in `config.py`. In addition, please following the instructions to prepare some other data: * Download grid features in path `data/grid_features` provided by X-LXMERT or follow [feature extraction](https://github.com/allenai/x-lxmert/blob/master/feature_extraction/README.md) to extract these features. ``` wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_train_grid8.h5 -P data/grid_features wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_valid_grid8.h5 -P data/grid_features wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_test_grid8.h5 -P data/grid_features ``` * For text-to-image evaluation on MSCOCO dataset, we need the real images to calculate the FID metric. For UMT-DBITG, we use MSCOCO karpathy split, which has been included in the OneDrive folder (`images/imgs_karpathy`). For UMT-BITG, please download [MSCOCO validation set](http://images.cocodataset.org/zips/val2014.zip) in path `images/coco_val2014`. ## Citation If you like our paper or code, please generously cite us: ``` @inproceedings{huang2021unifying, author = {Yupan Huang and Bei Liu and Yutong Lu}, title = {Unifying Multimodal Transformer for Bi-directional Image and Text Generation}, booktitle = {Proceedings of the 29th ACM International Conference on Multimedia}, year = {2021} } @inproceedings{huang2021diverse, author = {Yupan Huang and Bei Liu and Jianlong Fu and Yutong Lu}, title = {A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation}, booktitle = {Proceedings of the 29th ACM International Conference on Multimedia}, year = {2021} } ``` ## Acknowledgement Our code is mainly based on [LaBERT](https://github.com/bearcatt/LaBERT) and [X-LXMERT](https://github.com/allenai/x-lxmert). Our text-to-image generation evaluation code is mainly based on [CLIP](https://github.com/openai/CLIP), [pytorch-fid](https://github.com/mseitzer/pytorch-fid/tree/802da3963113b5b5f8154e0e27580ee4c97460ab) and [inception_score](https://github.com/openai/improved-gan/blob/master/inception_score/README.md). We sincerely thank them for their contributions! Feel free to open issues or email to me for help to use this code. Any feedback is welcome!