# ml-acn-embed **Repository Path**: mirrors_apple/ml-acn-embed ## Basic Information - **Project Name**: ml-acn-embed - **Description**: Acoustic Neighbor Embeddings - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-06 - **Last Updated**: 2026-05-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Acoustic Neighbor Embeddings Official source code, documentation, and other files for training models and replicating the experiments in the paper, [_A Theoretical Framework for Acoustic Neighbor Embeddings_](https://arxiv.org/abs/2412.02164). Pretrained models with an accessible end-user Python interface are also provided. ## Quick start for end users of pretrained embedders Automatically install the software and essential dependencies: ```commandline pip install git+https://github.com/apple/ml-acn-embed@main ``` Download and extract pretrained models (total 705 MB): ```commandline wget https://ml-site.cdn-apple.com/models/ml-acn-embed/model.tgz -O - | tar xz ``` Compute audio embedding for a segment in an audio file: ```commandline acn_embed_audio model/embedder-64 --wav model/examples/librivox-adrift-in-new-york.wav \ --no-dither --start-ms 950 --end-ms 1410 [[-0.455 -0.2826 -0.4556 -0.1951 0.0635 -0.2371 -0.1635 0.1702 -0.3562 -0.451 0.2396 0.4862 0.3348 0.5978 -0.4025 -0.2422 -0.1461 -0.6631 -0.1315 -0.1655 0.5091 0.5982 0.4661 -0.0462 -1.1064 0.1496 0.321 0.0633 0.3954 -0.0344 0.2964 -0.1347 0.9364 0.3259 -0.6774 0.0106 0.2444 -0.1617 0.4076 0.0614 0.9511 -0.1825 -0.3518 0.7029 0.0263 -0.0147 0.1475 0.0644 -0.5739 0.4216 -0.304 -0.1987 0.0066 -0.1506 0.0399 -0.9484 -0.1181 -0.2064 -0.1856 -0.4535 0.7452 0.1771 0.2255 0.2512]] ``` Compute text embedding for the phone sequence `[N UW1 Y AO1 R K]` (the full list of supported phones is [here](src/acn_embed/setup/non-sil-phones.json), following the [CMU dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)): ```commandline acn_embed_phones model/embedder-64 --pron "N UW1 Y AO1 R K" [[-0.4033 -0.312 -0.4508 -0.2395 0.0018 -0.1536 -0.1881 0.1909 -0.3706 -0.408 0.2131 0.4712 0.3475 0.6508 -0.4154 -0.2339 -0.1123 -0.663 -0.1397 -0.0339 0.5486 0.6744 0.4474 -0.0982 -1.1024 0.135 0.3345 0.0854 0.4062 -0.0168 0.2736 -0.09 0.9067 0.3249 -0.6679 -0.0083 0.2754 -0.1247 0.4664 0.139 0.9605 -0.1718 -0.425 0.781 0.0105 0.0487 0.2395 0.0724 -0.5494 0.3845 -0.3735 -0.196 -0.009 -0.2369 0.0323 -0.9137 -0.0118 -0.1925 -0.2527 -0.3857 0.8111 0.2302 0.2463 0.2595]] ``` Compute text embedding for the grapheme sequence `NEW YORK` (the space character " " is considered a grapheme; the full list of supported graphemes is [here](src/acn_embed/embed/model/g_embedder/graphemes.json)): ```commandline acn_embed_graphemes model/embedder-64 --orth "NEW YORK" [[-4.0224e-01 -3.0963e-01 -4.8059e-01 -2.6145e-01 3.7101e-04 -1.3895e-01 -2.3018e-01 1.8403e-01 -3.5109e-01 -4.1349e-01 2.4302e-01 4.8039e-01 3.7329e-01 6.2355e-01 -4.2596e-01 -2.1117e-01 -9.3265e-02 -6.4991e-01 -1.3159e-01 -2.4362e-02 5.3998e-01 6.0438e-01 4.5777e-01 -7.0070e-02 -1.0676e+00 1.2307e-01 3.2640e-01 8.6234e-02 4.2766e-01 -3.1238e-02 2.8564e-01 -1.0514e-01 9.0636e-01 3.1695e-01 -6.5648e-01 3.2521e-02 2.7298e-01 -1.5310e-01 4.5591e-01 1.8780e-01 9.9554e-01 -1.4303e-01 -3.9783e-01 7.8327e-01 5.9864e-03 5.9943e-02 2.1847e-01 5.8644e-02 -5.1259e-01 3.6053e-01 -3.2967e-01 -2.0133e-01 -1.0037e-02 -2.4647e-01 3.3688e-02 -9.0557e-01 -3.3233e-02 -2.0612e-01 -2.2083e-01 -3.8905e-01 7.9650e-01 2.3235e-01 2.6399e-01 2.5961e-01]] ``` Notice that the three vectors above are all similar, because they represent similar sounds. Run the above tools with the `--help` option for further documentation. To compute many embeddings in bulk, see docs [here](doc/misc/batch_embed.md). ### Phonetic nearest neighbor search An interactive demo is included for doing nearest-neighbor searches using grapheme embeddings to find phonetically-similar words. In addition to installing the software and pretrained models above, download the vocabulary (total tarball size 866 MB): ```commandline wget https://ml-site.cdn-apple.com/models/ml-acn-embed/wakeword.tgz -O - | tar xz ``` Run the tool and type in any arbitrary string to find phonetically-similar entries in a vocabulary of 195k (controllable by `--lm-score-thres`) words: ```commandline acn_nnsearch --embeddings wakeword/embeddings-3-gram.pruned.1e-7.pt \ --lm-score-thres -14 --model model/embedder-64 \ --num-results 3 --strings wakeword/str2score.3-gram.pruned.1e-7.pt Reading strings... Reading embeddings... Loaded 194550 strings after pruning QUERY>> create dist=0.612 CRATE dist=0.618 CRETE dist=0.865 CREEDE QUERY>> wreck a nice dist=1.052 RECOGNISE dist=1.085 RECOGNIZE dist=1.088 RECOGNISED QUERY>> I scream dist=0.561 ICE CREAM dist=1.543 EXTREME dist=1.787 A STREAM ``` ## Table of contents 1. [Set up environment and prepare data](doc/setup.md) 1. Train an acoustic frontend 1. [Train and test HMM acoustic model](doc/fe_am/hmm.md) 1. [Train DNN-HMM acoustic model](doc/fe_am/train_cd_dnn.md) 1. [Train monophone DNN-HMM acoustic model](doc/fe_am/train_mono_dnn.md) 1. [Test DNN-HMM acoustic models](doc/fe_am/test_dnn.md) 1. Train embedders for acoustic neighbor embeddings 1. [Prepare data](doc/embed/dataprep.md) 1. [Train embedders](doc/embed/train.md) 1. Experiments 1. [Word classification](doc/exp/wordclassify.md) 1. [OOV recovery](doc/exp/oovrecovery.md) 1. [Dialect clustering](doc/exp/dialect.md) 1. [Wake-up word confusion](doc/exp/wakeword.md) 1. Miscellaneous 1. [Standalone forced-alignment tool](doc/misc/forcealign.md) 2. [Batch computation of embeddings](doc/misc/batch_embed.md) 1. [List of all downloadables](doc/downloads.md) ## BibTeX reference If you find this work useful, please consider citing it as follows: ``` @misc{acn-embed, title={A Theoretical Framework for Acoustic Neighbor Embeddings}, author={Woojay Jeon}, year={2024}, eprint={2412.02164}, archivePrefix={arXiv}, url={https://arxiv.org/abs/2412.02164}, } ```