# im2txt **Repository Path**: csdn_ai_project/im2txt ## Basic Information - **Project Name**: im2txt - **Description**: 吴黄子桑 程思邈 陈炜 李会娟 - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 3 - **Created**: 2018-04-23 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ### Introduction This neural system for image captioning is motivated by the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses faster rcnn model to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and currently allows training of RNN part only. ### Prerequisites * **Tensorflow** ([instructions](https://www.tensorflow.org/install/)) * **NumPy** ([instructions](https://scipy.org/install.html)) * **OpenCV** ([instructions](https://pypi.python.org/pypi/opencv-python)) * **Natural Language Toolkit (NLTK)** ([instructions](http://www.nltk.org/install.html)) * **Pandas** ([instructions](https://scipy.org/install.html)) * **Matplotlib** ([instructions](https://scipy.org/install.html)) ### Usage * **tips:** 1. delete all pycache folders under current directory ```shell find . -name '__pycache__' -type d -exec rm -rf {} \; ``` * **Dataset Preparing:** 1. download faster_rcnn_resnet checkpoint ```shell cd data wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet50_coco_2018_01_28.tar.gz tar -xzf faster_rcnn_resnet50_coco_2018_01_28.tar.gz ``` 2. frozen graph using checkpoint ```shell cd ../code/ export PYTHONPATH=$PYTHONPATH:./object_detection/ python ./object_detection/export_inference_graph.py \ --input_type image_tensor \ --pipeline_config_path ../data/faster_rcnn_resnet50_coco_2018_01_28/pipeline.config --trained_checkpoint_prefix ../data/faster_rcnn_resnet50_coco_2018_01_28/model.ckpt --output_directory ../data/faster_rcnn_resnet50_coco_2018_01_28/exported_graphs cp ../data/faster_rcnn_resnet50_coco_2018_01_28/exported_graphs/frozen_inference_graph.pb ../data/frozen_faster_rcnn.pb ``` 3. skip if have download coco dataset, else run the following command to get coco ```shell OUTPUT_DIR="/home/zisang/im2txt" sh ./dataset/download_mscoco.sh.sh ../data/coco ``` 4. get feature for each region proposal(100\*2048) for coco run the following command ```shell DATASET_DIR="/home/zisang/Documents/code/data/mscoco/raw-data" OUTPUT_DIR="/home/zisang/im2txt/data/coco" python ./dataset/build_data.py \ --graph_path="../data/frozen_faster_rcnn.pb" \ --dataset "coco" \ --train_image_dir="${DATASET_DIR}/train2014" \ --val_image_dir="${DATASET_DIR}/val2014" \ --train_captions_file="${DATASET_DIR}/annotations/captions_train2014.json" \ --val_captions_file="${DATASET_DIR}/annotations/captions_val2014.json" \ --output_dir="${OUTPUT_DIR}" \ --word_counts_output_file="${OUTPUT_DIR}/word_counts.txt" ``` for flickr8k ```shell DATASET_DIR="/home/zisang/Documents/code/data/Flicker8k" OUTPUT_DIR="/home/zisang/im2txt/data/flickr8k" python ./dataset/build_data.py \ --graph_path "../data/frozen_faster_rcnn.pb" \ --dataset "flickr8k" \ --min_word_count 2 \ --image_dir "${DATASET_DIR}/Flicker8k_Dataset/" \ --text_path "${DATASET_DIR}/" \ --output_dir "${OUTPUT_DIR}" \ --train_shards 32\ --num_threads 8 ``` * **Training:** First make sure you are under the folder `code`, then setup various parameters in the file `config.py` and then run a command like this: ```shell python train.py --input_file_pattern='../data/flickr8k/train-?????-of-00016' \ --number_of_steps=10000 \ --attention='bias' \ --optimizer='Adam' \ --train_dir='../output/model' ``` To monitor the progress of training, run the following command: ```shell tensorboard --logdir='../output/model' ``` * **Evaluation:** To evaluate a trained model using the flickr30 data, run a command like this: ```shell python eval.py --input_file_pattern='../data/flickr8k/val-?????-of-00008' \ --checkpoint_dir='../output/model' \ --attention='bias' \ --eval_dir='../output/eval' \ --min_global_step=10 \ --num_eval_examples=32 \ --vocab_file="../data/flickr8k/word_counts.txt" \ --beam_size=3 \ --save_eval_result_as_image \ --eval_result_dir='../val/results/' \ --val_raw_image_dir='/home/zisang/Documents/code/data/Flicker8k/Flicker8k_Dataset' ``` The result will be shown in stdout and stored in eval_dir as tensorflow summary. * **Inference:** A web interface was built using [Flask](http://flask.pocoo.org/). You can use the trained model to generate captions for any JPEG images! 1 - Install Flask ``` pip install Flask ``` 2 - First get the frozen graph: ```shell python export.py --model_folder='../output/model' \ --output_path='../data/frozen_lstm.pb' \ --attention='bias' ``` Run Flaskr ``` python server.py --mode ours \ --vocab_path ../data/flickr8k/word_counts.txt ``` or run the following to see our results ``` python server.py --mode att-nic \ --faster_rcnn_model_file='../data/frozen_faster_rcnn.pb' \ --lstm_model_file='../data/frozen_lstm.pb' --vocab_file="../data/flickr8k/word_counts.txt" \ ``` 3 - Picture test interface http://127.0.0.1:5000 4 - Admin log in http://127.0.0.1:5000/admin to see more information Username: admin Password: 0000 ### Results This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with `beam size=3`): * **BLEU-1 = 15.8%** * **BLEU-2 = 4.9%** * **BLEU-3 = 1.0%** * **BLEU-4 = 0%** * **METEOR = 4.4%** * **Rouge = 10.1%** * **CIDEr = 2.5%** * **Perplexity = 6.4** compared to Show, Attend and Tell, which have achieved the following performance: * **BLEU-1 = 70.3%** * **BLEU-2 = 53.6%** * **BLEU-3 = 39.8%** * **BLEU-4 = 29.5%** there is still a long way to go. ### References * [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015. * [The original implementation in Theano](https://github.com/kelvinxu/arctic-captions) * [An earlier implementation in Tensorflow](https://github.com/DeepRNN/image_captioning) * [Tensorflow models im2txt](https://github.com/tensorflow/models/tree/master/research/im2txt)