# bert-as-service **Repository Path**: deep-learning-now/bert-as-service ## Basic Information - **Project Name**: bert-as-service - **Description**: Mapping a variable-length sentence to a fixed-length vector using BERT model - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-03-29 - **Last Updated**: 2021-08-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
|   | ► Jina 101: First Thing to Learn About JinaEnglish • 日本語 • français • Deutsch • Русский язык • 中文► From BERT-as-Service to X-as-ServiceLearn how to use Jina to extract feature vector using any deep learning representation | 
Using BERT model as a sentence encoding service, i.e. mapping a variable-length sentence to a fixed-length vector.
Highlights • What is it • Install • Getting Started • API • Tutorials • FAQ • Benchmark • Blog
     
| BERT-Base, Uncased | 12-layer, 768-hidden, 12-heads, 110M parameters | 
| BERT-Large, Uncased | 24-layer, 1024-hidden, 16-heads, 340M parameters | 
| BERT-Base, Cased | 12-layer, 768-hidden, 12-heads , 110M parameters | 
| BERT-Large, Cased | 24-layer, 1024-hidden, 16-heads, 340M parameters | 
| BERT-Base, Multilingual Cased (New) | 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters | 
| BERT-Base, Multilingual Cased (Old) | 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters | 
| BERT-Base, Chinese | Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters | 






 ##### **Q:** Why does the server need two ports?
One port is for pushing text data into the server, the other port is for publishing the encoded result to the client(s). In this way, we get rid of back-chatter, meaning that at every level recipients never talk back to senders. The overall message flow is strictly one-way, as depicted in the above figure. Killing back-chatter is essential to real scalability, allowing us to use `BertClient` in an asynchronous way. 
##### **Q:** Do I need Tensorflow on the client side?
**A:** No. Think of `BertClient` as a general feature extractor, whose output can be fed to *any* ML models, e.g. `scikit-learn`, `pytorch`, `tensorflow`. The only file that client need is [`client.py`](service/client.py). Copy this file to your project and import it, then you are ready to go.
##### **Q:** Can I use multilingual BERT model provided by Google?
**A:** Yes.
##### **Q:** Can I use my own fine-tuned BERT model?
**A:** Yes. In fact, this is suggested. Make sure you have the following three items in `model_dir`:
                             
- A TensorFlow checkpoint (`bert_model.ckpt`) containing the pre-trained weights (which is actually 3 files).
- A vocab file (`vocab.txt`) to map WordPiece to word id.
- A config file (`bert_config.json`) which specifies the hyperparameters of the model.
##### **Q:** Can I run it in python 2?
**A:** Server side no, client side yes. This is based on the consideration that python 2.x might still be a major piece in some tech stack. Migrating the whole downstream stack to python 3 for supporting `bert-as-service` can take quite some effort. On the other hand, setting up `BertServer` is just a one-time thing, which can be even [run in a docker container](#run-bert-service-on-nvidia-docker). To ease the integration, we support python 2 on the client side so that you can directly use `BertClient` as a part of your python 2 project, whereas the server side should always be hosted with python 3.
##### **Q:** Do I need to do segmentation for Chinese?
No, if you are using [the pretrained Chinese BERT released by Google](https://github.com/google-research/bert#pre-trained-models) you don't need word segmentation. As this Chinese BERT is character-based model. It won't recognize word/phrase even if you intentionally add space in-between. To see that more clearly, this is what the BERT model actually receives after tokenization:
```python
bc.encode(['hey you', 'whats up?', '你好么?', '我 还 可以'])
```
```
tokens: [CLS] hey you [SEP]
input_ids: 101 13153 8357 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] what ##s up ? [SEP]
input_ids: 101 9100 8118 8644 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] 你 好 么 ? [SEP]
input_ids: 101 872 1962 720 8043 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] 我 还 可 以 [SEP]
input_ids: 101 2769 6820 1377 809 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
```
That means the word embedding is actually the character embedding for Chinese-BERT.
##### **Q:** Why my (English) word is tokenized to `##something`?
Because your word is out-of-vocabulary (OOV). The tokenizer from Google uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
For example:
```python
input = "unaffable"
tokenizer_output = ["un", "##aff", "##able"]
```
##### **Q:** Can I use my own tokenizer?
Yes. If you already tokenize the sentence on your own, simply send use `encode` with `List[List[Str]]` as input and turn on `is_tokenized`, i.e. `bc.encode(texts, is_tokenized=True)`.
##### **Q:** I encounter `zmq.error.ZMQError: Operation cannot be accomplished in current state` when using `BertClient`, what should I do?
**A:** This is often due to the misuse of `BertClient` in multi-thread/process environment. Note that you can’t reuse one `BertClient` among multiple threads/processes, you have to make a separate instance for each thread/process. For example, the following won't work at all:
```python
# BAD example
bc = BertClient()
# in Proc1/Thread1 scope:
bc.encode(lst_str)
# in Proc2/Thread2 scope:
bc.encode(lst_str)
```
Instead, please do:
```python
# in Proc1/Thread1 scope:
bc1 = BertClient()
bc1.encode(lst_str)
# in Proc2/Thread2 scope:
bc2 = BertClient()
bc2.encode(lst_str)
```
##### **Q:** After running the server, I have several garbage `tmpXXXX` folders. How can I change this behavior ?
**A:** These folders are used by ZeroMQ to store sockets. You can choose a different location by setting the environment variable `ZEROMQ_SOCK_TMP_DIR` :
`export ZEROMQ_SOCK_TMP_DIR=/tmp/`
##### **Q:** The cosine similarity of two sentence vectors is unreasonably high (e.g. always > 0.8), what's wrong?
**A:** A decent representation for a downstream task doesn't mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. if you want to use cosine distance anyway, then please focus on the rank not the absolute value. Namely, do not use:
```
if cosine(A, B) > 0.9, then A and B are similar
```
Please consider the following instead:
```
if cosine(A, B) > cosine(A, C), then A is more similar to B than C.
```
The graph below illustrates the pairwise similarity of 3000 Chinese sentences randomly sampled from web (char. length < 25). We compute cosine similarity based on the sentence vectors and [Rouge-L](https://en.wikipedia.org/wiki/ROUGE_(metric)) based on the raw text. The diagonal (self-correlation) is removed for the sake of clarity. As one can see, there is some positive correlation between these two metrics.
##### **Q:** Why does the server need two ports?
One port is for pushing text data into the server, the other port is for publishing the encoded result to the client(s). In this way, we get rid of back-chatter, meaning that at every level recipients never talk back to senders. The overall message flow is strictly one-way, as depicted in the above figure. Killing back-chatter is essential to real scalability, allowing us to use `BertClient` in an asynchronous way. 
##### **Q:** Do I need Tensorflow on the client side?
**A:** No. Think of `BertClient` as a general feature extractor, whose output can be fed to *any* ML models, e.g. `scikit-learn`, `pytorch`, `tensorflow`. The only file that client need is [`client.py`](service/client.py). Copy this file to your project and import it, then you are ready to go.
##### **Q:** Can I use multilingual BERT model provided by Google?
**A:** Yes.
##### **Q:** Can I use my own fine-tuned BERT model?
**A:** Yes. In fact, this is suggested. Make sure you have the following three items in `model_dir`:
                             
- A TensorFlow checkpoint (`bert_model.ckpt`) containing the pre-trained weights (which is actually 3 files).
- A vocab file (`vocab.txt`) to map WordPiece to word id.
- A config file (`bert_config.json`) which specifies the hyperparameters of the model.
##### **Q:** Can I run it in python 2?
**A:** Server side no, client side yes. This is based on the consideration that python 2.x might still be a major piece in some tech stack. Migrating the whole downstream stack to python 3 for supporting `bert-as-service` can take quite some effort. On the other hand, setting up `BertServer` is just a one-time thing, which can be even [run in a docker container](#run-bert-service-on-nvidia-docker). To ease the integration, we support python 2 on the client side so that you can directly use `BertClient` as a part of your python 2 project, whereas the server side should always be hosted with python 3.
##### **Q:** Do I need to do segmentation for Chinese?
No, if you are using [the pretrained Chinese BERT released by Google](https://github.com/google-research/bert#pre-trained-models) you don't need word segmentation. As this Chinese BERT is character-based model. It won't recognize word/phrase even if you intentionally add space in-between. To see that more clearly, this is what the BERT model actually receives after tokenization:
```python
bc.encode(['hey you', 'whats up?', '你好么?', '我 还 可以'])
```
```
tokens: [CLS] hey you [SEP]
input_ids: 101 13153 8357 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] what ##s up ? [SEP]
input_ids: 101 9100 8118 8644 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] 你 好 么 ? [SEP]
input_ids: 101 872 1962 720 8043 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] 我 还 可 以 [SEP]
input_ids: 101 2769 6820 1377 809 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
```
That means the word embedding is actually the character embedding for Chinese-BERT.
##### **Q:** Why my (English) word is tokenized to `##something`?
Because your word is out-of-vocabulary (OOV). The tokenizer from Google uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
For example:
```python
input = "unaffable"
tokenizer_output = ["un", "##aff", "##able"]
```
##### **Q:** Can I use my own tokenizer?
Yes. If you already tokenize the sentence on your own, simply send use `encode` with `List[List[Str]]` as input and turn on `is_tokenized`, i.e. `bc.encode(texts, is_tokenized=True)`.
##### **Q:** I encounter `zmq.error.ZMQError: Operation cannot be accomplished in current state` when using `BertClient`, what should I do?
**A:** This is often due to the misuse of `BertClient` in multi-thread/process environment. Note that you can’t reuse one `BertClient` among multiple threads/processes, you have to make a separate instance for each thread/process. For example, the following won't work at all:
```python
# BAD example
bc = BertClient()
# in Proc1/Thread1 scope:
bc.encode(lst_str)
# in Proc2/Thread2 scope:
bc.encode(lst_str)
```
Instead, please do:
```python
# in Proc1/Thread1 scope:
bc1 = BertClient()
bc1.encode(lst_str)
# in Proc2/Thread2 scope:
bc2 = BertClient()
bc2.encode(lst_str)
```
##### **Q:** After running the server, I have several garbage `tmpXXXX` folders. How can I change this behavior ?
**A:** These folders are used by ZeroMQ to store sockets. You can choose a different location by setting the environment variable `ZEROMQ_SOCK_TMP_DIR` :
`export ZEROMQ_SOCK_TMP_DIR=/tmp/`
##### **Q:** The cosine similarity of two sentence vectors is unreasonably high (e.g. always > 0.8), what's wrong?
**A:** A decent representation for a downstream task doesn't mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. if you want to use cosine distance anyway, then please focus on the rank not the absolute value. Namely, do not use:
```
if cosine(A, B) > 0.9, then A and B are similar
```
Please consider the following instead:
```
if cosine(A, B) > cosine(A, C), then A is more similar to B than C.
```
The graph below illustrates the pairwise similarity of 3000 Chinese sentences randomly sampled from web (char. length < 25). We compute cosine similarity based on the sentence vectors and [Rouge-L](https://en.wikipedia.org/wiki/ROUGE_(metric)) based on the raw text. The diagonal (self-correlation) is removed for the sake of clarity. As one can see, there is some positive correlation between these two metrics.

 | `max_seq_len` | 1 GPU | 2 GPU | 4 GPU |
|---------------|-------|-------|-------|
| 20            | 903   | 1774  | 3254  |
| 40            | 473   | 919   | 1687  |
| 80            | 231   | 435   | 768   |
| 160           | 119   | 237   | 464   |
| 320           | 54    | 108   | 212   |
#### Speed wrt. `client_batch_size`
`client_batch_size` is the number of sequences from a client when invoking `encode()`. For performance reason, please consider encoding sequences in batch rather than encoding them one by one. 
For example, do:
```python
# prepare your sent in advance
bc = BertClient()
my_sentences = [s for s in my_corpus.iter()]
# doing encoding in one-shot
vec = bc.encode(my_sentences)
```
DON'T:
```python
bc = BertClient()
vec = []
for s in my_corpus.iter():
    vec.append(bc.encode(s))
```
It's even worse if you put `BertClient()` inside the loop. Don't do that.
| `max_seq_len` | 1 GPU | 2 GPU | 4 GPU |
|---------------|-------|-------|-------|
| 20            | 903   | 1774  | 3254  |
| 40            | 473   | 919   | 1687  |
| 80            | 231   | 435   | 768   |
| 160           | 119   | 237   | 464   |
| 320           | 54    | 108   | 212   |
#### Speed wrt. `client_batch_size`
`client_batch_size` is the number of sequences from a client when invoking `encode()`. For performance reason, please consider encoding sequences in batch rather than encoding them one by one. 
For example, do:
```python
# prepare your sent in advance
bc = BertClient()
my_sentences = [s for s in my_corpus.iter()]
# doing encoding in one-shot
vec = bc.encode(my_sentences)
```
DON'T:
```python
bc = BertClient()
vec = []
for s in my_corpus.iter():
    vec.append(bc.encode(s))
```
It's even worse if you put `BertClient()` inside the loop. Don't do that.
 | `client_batch_size` | 1 GPU | 2 GPU | 4 GPU |
|---------------------|-------|-------|-------|
| 1                   | 75    | 74    | 72    |
| 4                   | 206   | 205   | 201   |
| 8                   | 274   | 270   | 267   |
| 16                  | 332   | 329   | 330   |
| 64                  | 365   | 365   | 365   |
| 256                 | 382   | 383   | 383   |
| 512                 | 432   | 766   | 762   |
| 1024                | 459   | 862   | 1517  |
| 2048                | 473   | 917   | 1681  |
| 4096                | 481   | 943   | 1809  |
#### Speed wrt. `num_client`
`num_client` represents the number of concurrent clients connected to the server at the same time.
| `client_batch_size` | 1 GPU | 2 GPU | 4 GPU |
|---------------------|-------|-------|-------|
| 1                   | 75    | 74    | 72    |
| 4                   | 206   | 205   | 201   |
| 8                   | 274   | 270   | 267   |
| 16                  | 332   | 329   | 330   |
| 64                  | 365   | 365   | 365   |
| 256                 | 382   | 383   | 383   |
| 512                 | 432   | 766   | 762   |
| 1024                | 459   | 862   | 1517  |
| 2048                | 473   | 917   | 1681  |
| 4096                | 481   | 943   | 1809  |
#### Speed wrt. `num_client`
`num_client` represents the number of concurrent clients connected to the server at the same time.
 | `num_client` | 1 GPU | 2 GPU | 4 GPU |
|--------------|-------|-------|-------|
| 1            | 473   | 919   | 1759  |
| 2            | 261   | 512   | 1028  |
| 4            | 133   | 267   | 533   |
| 8            | 67    | 136   | 270   |
| 16           | 34    | 68    | 136   |
| 32           | 17    | 34    | 68    |
As one can observe, 1 clients 1 GPU = 381 seqs/s, 2 clients 2 GPU 402 seqs/s, 4 clients 4 GPU 413 seqs/s. This shows the efficiency of our parallel pipeline and job scheduling, as the service can leverage the GPU time  more exhaustively as concurrent requests increase.
#### Speed wrt. `max_batch_size`
`max_batch_size` is a parameter on the server side, which controls the maximum number of samples per batch per worker. If a incoming batch from client is larger than `max_batch_size`, the server will split it into small batches so that each of them is less or equal than `max_batch_size` before sending it to workers.
| `num_client` | 1 GPU | 2 GPU | 4 GPU |
|--------------|-------|-------|-------|
| 1            | 473   | 919   | 1759  |
| 2            | 261   | 512   | 1028  |
| 4            | 133   | 267   | 533   |
| 8            | 67    | 136   | 270   |
| 16           | 34    | 68    | 136   |
| 32           | 17    | 34    | 68    |
As one can observe, 1 clients 1 GPU = 381 seqs/s, 2 clients 2 GPU 402 seqs/s, 4 clients 4 GPU 413 seqs/s. This shows the efficiency of our parallel pipeline and job scheduling, as the service can leverage the GPU time  more exhaustively as concurrent requests increase.
#### Speed wrt. `max_batch_size`
`max_batch_size` is a parameter on the server side, which controls the maximum number of samples per batch per worker. If a incoming batch from client is larger than `max_batch_size`, the server will split it into small batches so that each of them is less or equal than `max_batch_size` before sending it to workers.
 | `max_batch_size` | 1 GPU | 2 GPU | 4 GPU |
|------------------|-------|-------|-------|
| 32               | 450   | 887   | 1726  |
| 64               | 459   | 897   | 1759  |
| 128              | 473   | 931   | 1816  |
| 256              | 473   | 919   | 1688  |
| 512              | 464   | 866   | 1483  |
#### Speed wrt. `pooling_layer`
`pooling_layer` determines the encoding layer that pooling operates on. For example, in a 12-layer BERT model, `-1` represents the layer closed to the output, `-12` represents the layer closed to the embedding layer. As one can observe below, the depth of the pooling layer affects the speed.
| `max_batch_size` | 1 GPU | 2 GPU | 4 GPU |
|------------------|-------|-------|-------|
| 32               | 450   | 887   | 1726  |
| 64               | 459   | 897   | 1759  |
| 128              | 473   | 931   | 1816  |
| 256              | 473   | 919   | 1688  |
| 512              | 464   | 866   | 1483  |
#### Speed wrt. `pooling_layer`
`pooling_layer` determines the encoding layer that pooling operates on. For example, in a 12-layer BERT model, `-1` represents the layer closed to the output, `-12` represents the layer closed to the embedding layer. As one can observe below, the depth of the pooling layer affects the speed.
 | `pooling_layer` | 1 GPU | 2 GPU | 4 GPU |
|-----------------|-------|-------|-------|
| [-1]            | 438   | 844   | 1568  |
| [-2]            | 475   | 916   | 1686  |
| [-3]            | 516   | 995   | 1823  |
| [-4]            | 569   | 1076  | 1986  |
| [-5]            | 633   | 1193  | 2184  |
| [-6]            | 711   | 1340  | 2430  |
| [-7]            | 820   | 1528  | 2729  |
| [-8]            | 945   | 1772  | 3104  |
| [-9]            | 1128  | 2047  | 3622  |
| [-10]           | 1392  | 2542  | 4241  |
| [-11]           | 1523  | 2737  | 4752  |
| [-12]           | 1568  | 2985  | 5303  |
#### Speed wrt. `-fp16` and `-xla`
`bert-as-service` supports two additional optimizations: half-precision and XLA, which can be turned on by adding `-fp16` and `-xla` to `bert-serving-start`, respectively. To enable these two options, you have to meet the following requirements:
- your GPU supports FP16 instructions;
- your Tensorflow is self-compiled with XLA and `-march=native`;
- your CUDA and cudnn are not too old.
On Tesla V100 with `tensorflow=1.13.0-rc0` it gives:
| `pooling_layer` | 1 GPU | 2 GPU | 4 GPU |
|-----------------|-------|-------|-------|
| [-1]            | 438   | 844   | 1568  |
| [-2]            | 475   | 916   | 1686  |
| [-3]            | 516   | 995   | 1823  |
| [-4]            | 569   | 1076  | 1986  |
| [-5]            | 633   | 1193  | 2184  |
| [-6]            | 711   | 1340  | 2430  |
| [-7]            | 820   | 1528  | 2729  |
| [-8]            | 945   | 1772  | 3104  |
| [-9]            | 1128  | 2047  | 3622  |
| [-10]           | 1392  | 2542  | 4241  |
| [-11]           | 1523  | 2737  | 4752  |
| [-12]           | 1568  | 2985  | 5303  |
#### Speed wrt. `-fp16` and `-xla`
`bert-as-service` supports two additional optimizations: half-precision and XLA, which can be turned on by adding `-fp16` and `-xla` to `bert-serving-start`, respectively. To enable these two options, you have to meet the following requirements:
- your GPU supports FP16 instructions;
- your Tensorflow is self-compiled with XLA and `-march=native`;
- your CUDA and cudnn are not too old.
On Tesla V100 with `tensorflow=1.13.0-rc0` it gives: