# HSP

**Repository Path**: coderxiaoluo/HSP

## Basic Information

- **Project Name**: HSP
- **Description**:  Hamming distance Similarity search in Postgresql
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2023-10-05
- **Last Updated**: 2023-10-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# HSP

HSP(Hamming simalirity Search in Postgres) is a extension which can support r-neighbor search  and knn(k-nearest neighbor) search in postgres for hamming vecotor efficiently.

## Installation

Compile ans install the extension (tested on Postgres 12.5)

```shell
git clone https://github.com/lss602726449/HSP.git
cd HSP 
make && make install 
cd dependecy/hstore
make && make install
cd ../..
```

Load the extension in postgres

```sql
CREATE EXTENSION IF NOT EXISTS hsp;
CREATE EXTENSION IF NOT EXISTS hstore;
```

## Getting Started

Create a table with 64 dimensions hamming vector (keep the primary key id for index and simalrity search)

```
CREATE TABLE test(id SERIAL PRIMARY KEY, val hmcode(64)));
```

Insert values (hmcode use multiple uint8 separated by comma to represent high-dimensional hamming vectors)

```sql
INSERT INTO test(val) VALUES ('{0,0,0,0,0,0,0,0}'), ('{1,0,0,0,0,0,0,0}'), ('{1,1,0,0,0,0,0,0}');
```

Brute-force search

```sql
SELECT * FROM test WHERE hamming_distance('{0,0,0,0,0,0,0,0}', val)<=1; -- r-neighbor search 
SELECT * FROM test ORDER BY hamming_distance('{0,0,0,0,0,0,0,0}', val) limit 2; -- knn search
```

## Similarity Search

For dataset size larger than 100 thousand, brute-force algorithm may suffer a high time overhead. Serveral algorithm have been proposed to solve the r-neighbor search problem and knn search problem. In this extension, We use the GPH algorithm to better solve the r-neighbor search in postgres, due to its efficiency. GPH algorithm obtains a more tight filting condition and enable flexible threshold allocation. It further uses dynamic programming algorithm to reduce the number of candidate sets. For knn search, multi-index (MIH) algorithm is modified to accelerate the search. For small datasets, it is not recommended to use index and similarity search algorithms.

### Indexing

Multiple hash tables are build as index shared by GPH and MIH. In order to make the dimension of each partition about 23, we set m = 3 for 64 dimension vector, and m=5 for  128. Trigger is also added in the UDF of *create_index* to keep the consistency of vector and index. (There must be at least one vector in the table)

```sql
SELECT set_m(3);
SELECT create_index('test', 'val');
```

### R-neighbor Search

*search_thres* takes query vector, tablename, clomnname, r as input, which searchs in table for all vectors hamming distance not greater than r. The result is exact.

```sql
SELECT * FROM search_thres('{0,0,0,0,0,0,0,0}', NULL::test, 'val', 2);
```

### KNN Search

*search_knn_mih*  takes query vector, tablename, clomnname, k as input, which searchs in table for k vectors which close to query vector. *set_pv_num* set the num of candidates to be post-verificated, which determines the search quality and search time. The result is not exact, we evaluate the search quality by recall. In our experiment, for sift1M_64 dataset, k= 10,   pv_num = 1000, recall rate of  0.936 can be achieved.

```sql
SELECT set_pv_num(1000);
SELECT *, hamming_distance(val, '{0,0,0,0,0,0,0,0}') AS dis  FROM search_knn_mih('{0,0,0,0,0,0,0,0}', NULL::test, 'val', 10);
```

## Example With SIFT1M_64

load the hamming vector dataset with a table named test_hsp in *data/sift1m.sql*

```shell
createdb test
psql -f data/sift1m.sql -d test
```

Create index and execute search

```sql
SELECT set_m(3);
SELECT create_index('test_hsp', 'val');

SELECT * FROM search_thres('{0,0,0,0,0,0,0,0}', NULL::test_hsp, 'val', 2);
SELECT set_pv_num(1000);
SELECT *, hamming_distance(val, '{0,0,0,0,0,0,0,0}') AS dis  FROM search_knn_mih('{0,0,0,0,0,0,0,0}', NULL::test_hsp, 'val', 10);
```

## Thanks

Thanks to:

* [GPH: Similarity Search in Hamming Space](https://ieeexplore.ieee.org/document/8509234)
* [Pigeonring: A Principle for FasterThresholded Similarity Search](http://www.vldb.org/pvldb/vol12/p28-qin.pdf)
* [Fast Exact Search in Hamming Spacewith Multi-Index Hashing](https://arxiv.org/pdf/1307.2982.pdf)
* [pgvector](https://github.com/pgvector/pgvector)