# bayon **Repository Path**: mirrors_hankcs/bayon ## Basic Information - **Project Name**: bayon - **Description**: a simple and fast clustering tool - **Primary Language**: Unknown - **License**: LGPL-2.1 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-08 - **Last Updated**: 2026-05-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [Tutorial in Japanese](https://github.com/fujimizu/bayon/wiki/Tutorial_Japanese) [Tutorial in English](https://github.com/fujimizu/bayon/wiki/Tutorial_English) ## Overview ## **Bayon** is a simple and fast hard-clustering tool. **Bayon** supports Repeated Bisection clustering and K-means clustering. ## Install ## ``` % ./configure % make % sudo make install ``` ## Usage ## ### Clustering input data ### ``` % bayon -n num [options] file % bayon -l limit [options] file -n, --number=num the number of clusters -l, --limit=lim limit value of cluster bisection -p, --point output similarity points -c, --clvector=file save the vectors of cluster centroids --clvector-size=num max size of output vectors of cluster centroids (default: 50) --method=method clustering method(rb, kmeans), default:rb --seed=seed set a seed for random number generator ``` ### Get similar clusters for each input documents ### ``` % bayon -C file [options] file -C, --classify=file target vectors --inv-keys=num max size of the keys of each vector to be looked up in inverted index (default: 20) --inv-size=num max size of the inverted index of each key (default: 100) --classify-size=num max size of output similar groups (default: 20) ``` ### Common options ### ``` --vector-size=num max size of each input vector --idf apply idf to input vectors -h, --help show help messages -v, --version show the version and exit ``` ## Example ## * clustering (number\_of\_output\_clusters = 100) ``` % bayon -n 100 input.tsv > cluster.tsv ``` * clustering (save vectors of cluster centroids) ``` % bayon -n 100 -c centroid.tsv input.tsv > cluster.tsv ``` * classification (get similar clusters for input documents) ``` % bayon -C centroid.tsv input.tsv > classify.tsv ``` ## Format of Input Data ## ### List of the vectors of input documents for clustering and classification ### ``` document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n ... ``` * document\_id : string * key : string * value : double ### List of the vectors of cluster centroids ### ``` cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n ... ``` * cluster\_id : string * key : string * value : double ## Format of Output Data ## ### List of clusters (output of clustering) ### ``` cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n ... ``` * cluster\_id : integer (>= 1) * document\_id : string ### List of the clusters with similarity values between documents and clusters (if perform clustering with --point option) ### ``` cluster_id1 \t document_id1 \t point1 \t document_id2 \t point2 \t ...\n cluster_id2 \t document_id3 \t point3 \t document_id4 \t point4 \t ...\n ... ``` * cluster\_id : integer (>= 1) * document\_id : string * point : double ### List of the vectors of cluster centroids (if perform clustering with --clvector option) ### ``` cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n ... ``` * cluster\_id : integer (>= 1) * key : string * value : double ### List of similar clusters for each input documents ### ``` document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n ... ``` * document\_id : string * cluster\_id : string * point : double ## Requirement ## * C++ compiler with STL (Standard Template Library) ### Recommended ### * [google-sparsehash](http://code.google.com/p/google-sparsehash/) * If google-sparsehash not installed, this clustering tool uses "gnu\_cxx::hash\_map" or "std::map" ## License ## GPL2 (Gnu General Public License Version 2) ## Author ## Mizuki Fujisawa <[fujisawa@bayon.cc](mailto:fujisawa@bayon.cc)>