# grounded-video-description **Repository Path**: wu-zhiwei420/grounded-video-description ## Basic Information - **Project Name**: grounded-video-description - **Description**: Video Grounding and Captioning - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-10-06 - **Last Updated**: 2024-05-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Grounded Video Description ### [ActivityNet Entities Object Localization (Grounding) Challenge](http://activity-net.org/challenges/2020/tasks/guest_anet_eol.html) joins the official [ActivityNet Challenge](http://activity-net.org/challenges/2020/challenge.html) as a guest task this year! See [here](https://github.com/facebookresearch/ActivityNet-Entities#activitynet-entities-object-localization-challenge-2020) on how to participate. This repo hosts the source code for our paper [Grounded Video Description](https://arxiv.org/pdf/1812.06587.pdf). It supports [ActivityNet-Entities](https://github.com/facebookresearch/ActivityNet-Entities) dataset. We also have code that supports [Flickr30k-Entities](https://github.com/BryanPlummer/flickr30k_entities) dataset, hosted at the [flickr_branch](https://github.com/facebookresearch/grounded-video-description/tree/flickr_branch) branch. teaser results Note: [42] indicates [Masked Transformer](https://github.com/LuoweiZhou/densecap) ## Quick Start ### Preparations Follow the instructions 1 to 3 in the [Requirements](#req) section to install required packages. ### Download everything Simply run the following command to download all the data and pre-trained models (total 216GB): ``` bash tools/download_all.sh ``` ### Starter code Run the following eval code to test if your environment is setup: ``` python main.py --batch_size 100 --cuda --num_workers 6 --max_epoch 50 --inference_only \ --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 \ --seq_length 20 --language_eval --eval_obj_grounding --obj_interact ``` (Optional) Single-GPU training code for double-check: ``` python main.py --batch_size 20 --cuda --checkpoint_path save/gvd_starter --id gvd_starter --language_eval ``` You can now skip to the [Training and Validation](#train) section! ## Requirements (Recommended) 1) Clone the repo recursively: ``` git clone --recursive git@github.com:facebookresearch/grounded-video-description.git ``` Make sure all the submodules [densevid_eval](https://github.com/LuoweiZhou/densevid_eval_spice) and [coco-caption](https://github.com/tylin/coco-caption) are included. 2) Install CUDA 9.0 and CUDNN v7.1. Later versions should be fine, but might need to get the conda env file updated (e.g., for PyTorch). 3) Install [Miniconda](https://conda.io/miniconda.html) (either Miniconda2 or 3, version 4.6+). We recommend using conda [environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) to install required packages, including Python 3.7 or 2.7, [PyTorch 1.1.0](https://pytorch.org/get-started/locally/) etc.: ``` MINICONDA_ROOT=[to your Miniconda root directory] conda env create -f cfgs/conda_env_gvd_py3.yml --prefix $MINICONDA_ROOT/envs/gvd_pytorch1.1 conda activate gvd_pytorch1.1 ``` Note that there have been some [breaking changes](https://github.com/pytorch/pytorch/releases/tag/v1.2.0) since PyTorch 1.2 (e.g., bitwise not on torch.bool/torch.uint8 and masked\_fill\_). This code base could potentially work with PyTorch 1.2+ with corresponding changes made. Replace `cfgs/conda_env_gvd_py3.yml` with `cfgs/conda_env_gvd.yml` for Python 2.7. 4) (Optional) If you choose to not use `download_all.sh`, be sure to install JAVA and download Stanford CoreNLP for SPICE (see [here](https://github.com/tylin/coco-caption)). Also, download and place the reference [file](https://github.com/jiasenlu/coco-caption/blob/master/annotations/caption_flickr30k.json) under `coco-caption/annotations`. Download [Stanford CoreNLP 3.9.1](https://stanfordnlp.github.io/CoreNLP/history.html) for grounding evaluation and place the uncompressed folder under the `tools` directory. ## Data Preparation Updates on 04/15/2020: Feature files for the **hidden** test set, used in ANet-Entities Object Localization Challenge 2020, are available to download ([region features](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/fc6_feat_100rois_hidden_test.tar.gz) and [frame-wise features](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/rgb_motion_1d_hidden_test.tar.gz)). Make sure you move the additional *.npy files over to your folder `fc6_feat_100rois` and `rgb_motion_1d`, respectively. The following files have been updated to include the **hidden** test set or video IDs: `anet_detection_vg_fc6_feat_100rois.h5`, `anet_entities_prep.tar.gz`, and `anet_entities_captions.tar.gz`. Download the preprocessed annotation files from [here](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/anet_entities_prep.tar.gz), uncompress and place them under `data/anet`. Or you can reproduce them all using the data from ActivityNet-Entities [repo](https://github.com/facebookresearch/ActivityNet-Entities) and the preprocessing script `prepro_dic_anet.py` under `prepro`. Then, download the ground-truth caption annotations (under our val/test splits) from [here](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/anet_entities_captions.tar.gz) and same place under `data/anet`. The region features and detections are available for download ([feature](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/fc6_feat_100rois.tar.gz) and [detection](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/anet_detection_vg_fc6_feat_100rois.h5)). The region feature file should be decompressed and placed under your feature directory. We refer to the region feature directory as `feature_root` in the code. The H5 region detection (proposal) file is referred to as `proposal_h5` in the code. To extract feature for customized dataset (or brave folks for ANet-Entities as well), refer to the feature extraction tool [here](https://github.com/LuoweiZhou/detectron-vlp). The frame-wise appearance (with suffix `_resnet.npy`) and motion (with suffix `_bn.npy`) feature files are available [here](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/rgb_motion_1d.tar.gz). We refer to this directory as `seg_feature_root`. Other auxiliary files, such as the weights from Detectron fc7 layer, are available [here](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/detectron_weights.tar.gz). Uncompress and place under the `data` directory. ## Training and Validation Modify the config file `cfgs/anet_res101_vg_feat_10x100prop.yml` with the correct dataset and feature paths (or through symlinks). Link `tools/anet_entities` to your ANet-Entities dataset root location. Create new directories `log` and `results` under the root directory to save log and result files. The example command on running a 8-GPU data parallel job: For supervised models (with self-attention): ``` CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml \ --batch_size $batch_size --cuda --checkpoint_path save/$ID --id $ID --mGPUs \ --language_eval --w_att2 $w_att2 --w_grd $w_grd --w_cls $w_cls --obj_interact | tee log/$ID ``` For unsupervised models (without self-attention): ``` CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml \ --batch_size $batch_size --cuda --checkpoint_path save/$ID --id $ID --mGPUs \ --language_eval | tee log/$ID ``` Arguments: `batch_size=240`, `w_att2=0.05`, `w_grd=0`, `w_cls=0.1`, `ID` indicates the model name. (Optional) Remove `--mGPUs` to run in single-GPU mode. ### Pre-trained Models The pre-trained models can be downloaded from [here (1.5GB)](https://dl.fbaipublicfiles.com/ActivityNet-Entities/ActivityNet-Entities/pre-trained-models.tar.gz). Make sure you uncompress the file under the `save` directory (create one under the root directory if not exists). ## Inference and Testing For supervised models (`ID=anet-sup-0.05-0-0.1-run1`): (standard inference: language evaluation and localization evaluation on generated sentences) ``` python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \ --num_workers 6 --max_epoch 50 --inference_only --start_from save/$ID --id $ID \ --val_split $val_split --densecap_references $dc_references --densecap_verbose --seq_length 20 \ --language_eval --eval_obj_grounding --obj_interact \ | tee log/eval-$val_split-$ID-beam$beam_size-standard-inference ``` (GT inference: localization evaluation on GT sentences) ``` python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \ --num_workers 6 --max_epoch 50 --inference_only --start_from save/$ID --id $ID \ --val_split $val_split --seq_length 40 --eval_obj_grounding_gt --obj_interact \ --grd_reference $grd_reference | tee log/eval-$val_split-$ID-beam$beam_size-gt-inference ``` For unsupervised models (`ID=anet-unsup-0-0-0-run1`), simply remove the `--obj_interact` option. Arguments: `dc_references='./data/anet/anet_entities_val_1.json ./data/anet/anet_entities_val_2.json'`, `grd_reference='tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json'` `val_split='validation'`. If you want to evaluate on the test splits, set `val_split` to `'testing'` or `'hidden_test'`, `dc_references` (look for `anet_entities_test_1.json` and `anet_entities_test_2.json` and this only supports `'testing'`), and `grd_reference` (the skeleton files `*testing*.json` and `*hidden_test*.json`) accordingly. Then,submit the object localization output files under `results` to the [eval server](https://competitions.codalab.org/competitions/20537). Note that the eval server here is for general purposes. The servers designed for the CVPR'20 challenge is instead [here](https://github.com/facebookresearch/ActivityNet-Entities#evaluation-servers). You need at least 9GB of free GPU memory for the evaluation. ## Reference Please acknowledge the following paper if you use the code: ``` @inproceedings{zhou2019grounded, title={Grounded Video Description}, author={Zhou, Luowei and Kalantidis, Yannis and Chen, Xinlei and Corso, Jason J and Rohrbach, Marcus}, booktitle={CVPR}, year={2019} } ``` ## Acknowledgement We thank Jiasen Lu for his [Neural Baby Talk](https://github.com/jiasenlu/NeuralBabyTalk) repo. We thank Chih-Yao Ma for his helpful discussions. ## License This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the [Neural Baby Talk](https://github.com/jiasenlu/NeuralBabyTalk) project.