# AudioMAE **Repository Path**: facebookresearch/AudioMAE ## Basic Information - **Project Name**: AudioMAE - **Description**: This repo hosts the code and models of "Masked Autoencoders that Listen". - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-07-31 - **Last Updated**: 2024-10-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Audio-MAE This repo hosts the code and models of "[Masked Autoencoders that Listen](http://arxiv.org/abs/2207.06405)" [NeurIPS 2022 [bib](https://github.com/facebookresearch/AudioMAE#citation)].

### Demo Examples

[Music](https://www.dropbox.com/s/96v5et19521hlau/Fig6_b.mp4?dl=0), [Speech](https://www.dropbox.com/s/tyzjc9sk6wch1zk/Fig6_a.mp4?dl=0), [Event Sound](https://www.dropbox.com/s/rgmqgulnl1l9mu2/Fig6_c.mp4?dl=0) ### 1. Installation - This repo follows the [MAE repo](https://github.com/facebookresearch/mae), Installation and preparation follow that repo. - Copy files and patch the timm package by ``bash timm_patch.sh'' (Please change the path to your own timm package path). We use timm==0.3.2, for which a [fix](https://github.com/rwightman/pytorch-image-models/issues/420#issuecomment-776459842) is needed to work with PyTorch 1.8.1+. - Please find [mae_env.yml](./mae_env.yml) for all the dependencies. - You may also use download the conda-packed [conda env](https://drive.google.com/file/d/1ECVmVyscVqmhI7OQa0nghIsWVaZhZx3q/view?usp=sharing), untar it, and then: ``` source path_to_env/bin/activate ``` ### 2. Prepare data: Please download AudioSet at [here](https://research.google.com/audioset/). Due to copyright we cannot release the data. The data annotation json parased and used in this work is available [here](https://drive.google.com/file/d/1cAiaL69HFm1zSW4hqFQpdhNfHiVKBFNA/view?usp=share_link). The format follows the one in [AST](https://github.com/YuanGongND/ast). Please be sure to modify the path in the scripts accordingly to reflect your own setup. ### 3. Pretrianing on AudioSet-2M For the brave ones to pre-train on AudioSet-2M: Please use the pretrain_audioset2M.sh by: ``` bash pretrain_audioset2M.sh ``` ### 4. Fine-tuning on AudioSet-2M and AudioSet-20K For Finetuning from an AuioSet-pretrained model. Please use your own pretrained model from the previous step or download our pre-trained [ckpt](https://drive.google.com/file/d/1ni_DV4dRf7GxM8k-Eirx71WP9Gg89wwu/view?usp=share_link) and put it under ./ckpt/. Please use the script submit_ft_mask_bal.sh by ``` bash submit_ft_mask_bal.sh 2e-4 0.2 0.2 ./ckpt/pretrained.pth" ``` This will perform weighted distributed sampling on the unbalanded Audioset to fine-tuned the model with class-balanced data for 100 epochs. The resulting mAP on the AudioSet should be around 47.3. We provide our finetuned checkpoint at [here](https://drive.google.com/file/d/18EsFOyZYvBYHkJ7_n7JFFWbj6crz01gq/view?usp=share_link). An example log of finetuning is as follows: ``` [07:10:32.717347] log_dir: /checkpoint/berniehuang/experiments/419909 [07:10:36.394431] Epoch: [99] [ 0/781] eta: 0:47:51 lr: 0.000001 loss: 0.0066 (0.0066) time: 3.6761 data: 1.6724 max mem: 2606 [07:12:24.728503] Epoch: [99] [500/781] eta: 0:01:02 lr: 0.000001 loss: 0.0116 (0.0128) time: 0.2130 data: 0.0002 max mem: 2606 [07:13:24.602830] Epoch: [99] [780/781] eta: 0:00:00 lr: 0.000001 loss: 0.0122 (0.0128) time: 0.1837 data: 0.0003 max mem: 2606 [07:13:24.853957] Epoch: [99] Total time: 0:02:52 (0.2204 s / it) [07:13:25.085416] Averaged stats: lr: 0.000001 loss: 0.0122 (0.0126) [07:13:28.343364] Test: [ 0/79] eta: 0:02:01 time: 1.5353 data: 1.5029 max mem: 2606 [07:13:30.942012] Test: [78/79] eta: 0:00:00 time: 0.0206 data: 0.0001 max mem: 2606 [07:13:31.180169] Test: Total time: 0:00:04 (0.0554 s / it) [07:13:42.547896] mAP: 0.472873 [07:13:42.552120] mAP of the network on the 19148 test images: 0.4728 [07:13:42.552198] Max mAP: 0.473 [07:13:42.566228] Training time 5:16:14 submitit INFO (2022-04-22 07:13:43,404) - Job completed successfully ``` You can also try fine-tuning on AudioSet-20K for 60 epochs with ``` sbatch ft_as.sh 1e-3 ./ckpt/pretrained.pth ``` The log.txt will look like: ``` {"train_lr": 2.1997867184321786e-06, "train_loss": 0.01310475811136991, "test_mAP": 0.36981118189071294, "epoch": 56, "n_parameters": 85659407} {"train_lr": 1.6171788925401227e-06, "train_loss": 0.01304934614071496, "test_mAP": 0.37001905352752995, "epoch": 57, "n_parameters": 85659407} {"train_lr": 1.2277041313086816e-06, "train_loss": 0.013038477757025324, "test_mAP": 0.36998449127640076, "epoch": 58, "n_parameters": 85659407} {"train_lr": 1.0325878664284776e-06, "train_loss": 0.012981618695671238, "test_mAP": 0.36999196624276054, "epoch": 59, "n_parameters": 85659407} ``` The peformance on AudioSet-20K is around 37.0 mAP. ### 5. Inference For inference the finetuned model. Please put your finetuned model under ./ckpt, or please download our finetuned [ckpt](https://drive.google.com/file/d/18EsFOyZYvBYHkJ7_n7JFFWbj6crz01gq/view?usp=share_link). Then: ``` bash inf.sh ckpt/finetuned.pth ``` This should give you 47.3 mAP on AudioSet. An example log is as follows: ``` [18:22:12.877430] number of params (M): 85.66 [18:22:12.877460] base lr: 2.00e-03 [18:22:12.877479] actual lr: 1.25e-04 [18:22:12.877495] accumulate grad iterations: 1 [18:22:12.877511] effective batch size: 16 [18:22:12.898235] criterion = BCEWithLogitsLoss() [18:22:14.068845] Test: [ 0/1197] eta: 0:23:19 time: 1.1690 data: 1.0901 max mem: 1035 [18:22:55.447027] Test: [ 300/1197] eta: 0:02:06 time: 0.1402 data: 0.0001 max mem: 1046 [18:23:37.699615] Test: [ 600/1197] eta: 0:01:24 time: 0.1411 data: 0.0001 max mem: 1061 [18:24:20.110863] Test: [ 900/1197] eta: 0:00:41 time: 0.1417 data: 0.0001 max mem: 1075 [18:25:02.194206] Test: [1196/1197] eta: 0:00:00 time: 0.1526 data: 0.0001 max mem: 1090 [18:25:02.321579] Test: Total time: 0:02:49 (0.1415 s / it) [18:25:11.997641] mAP: 0.472873 [18:25:12.004128] Accuracy of the network on the 19148 test images: 0.4729 ``` Per-class AP can be found under ./aps.txt and per-example results is inf_output.npy ### Checkpoints: 1. ViT-B, AS-2M [pretrained](https://drive.google.com/file/d/1ni_DV4dRf7GxM8k-Eirx71WP9Gg89wwu/view?usp=share_link) 2. ViT-B, AS-2M pretrained+[finetuned](https://drive.google.com/file/d/18EsFOyZYvBYHkJ7_n7JFFWbj6crz01gq/view?usp=share_link) ### Updates - [x] Code and Model Release - [x] Provide conda-pack envs - [ ] Notebook demos for reconstruction (legal blocked) - [ ] Additional exps ### Citation ``` @inproceedings{huang2022amae, title = {Masked Autoencoders that Listen}, author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph} booktitle = {NeurIPS}, year = {2022} } ``` ### Contact Please contact Bernie Huang (berniehuang@meta.com) if you have any questions. Thank you. ### Reference The codebase is based on the awesome [MAE](https://github.com/facebookresearch/mae) and [AST](https://github.com/YuanGongND/ast) repos. ### License This project is under the CC-BY 4.0 license. See [LICENSE](LICENSE) for details.