# transformer **Repository Path**: txh2022/transformer ## Basic Information - **Project Name**: transformer - **Description**: fork from https://github.com/hyunwoongko/transformer - **Primary Language**: Python - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 4 - **Created**: 2024-09-11 - **Last Updated**: 2024-09-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ### WARNING This code was written in 2019, and I was not very familiar with transformer model in that time. So don't trust this code too much. Currently I am not managing this code well, so please open pull requests if you find bugs in the code and want to fix. # Transformer My own implementation Transformer model (Attention is All You Need - Google Brain, 2017)

![model](image/model.png)

## 1. Implementations ### 1.1 Positional Encoding ![model](image/positional_encoding.jpg) ```python class PositionalEncoding(nn.Module): """ compute sinusoid encoding. """ def __init__(self, d_model, max_len, device): """ constructor of sinusoid encoding class :param d_model: dimension of model :param max_len: max sequence length :param device: hardware device setting """ super(PositionalEncoding, self).__init__() # same size with input matrix (for adding with input matrix) self.encoding = torch.zeros(max_len, d_model, device=device) self.encoding.requires_grad = False # we don't need to compute gradient pos = torch.arange(0, max_len, device=device) pos = pos.float().unsqueeze(dim=1) # 1D => 2D unsqueeze to represent word's position _2i = torch.arange(0, d_model, step=2, device=device).float() # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50]) # "step=2" means 'i' multiplied with two (same with 2 * i) self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model))) self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model))) # compute positional encoding to consider positional information of words def forward(self, x): # self.encoding # [max_len = 512, d_model = 512] batch_size, seq_len = x.size() # [batch_size = 128, seq_len = 30] return self.encoding[:seq_len, :] # [seq_len = 30, d_model = 512] # it will add with tok_emb : [128, 30, 512] ```

### 1.2 Multi-Head Attention ![model](image/multi_head_attention.jpg) ```python class MultiHeadAttention(nn.Module): def __init__(self, d_model, n_head): super(MultiHeadAttention, self).__init__() self.n_head = n_head self.attention = ScaleDotProductAttention() self.w_q = nn.Linear(d_model, d_model) self.w_k = nn.Linear(d_model, d_model) self.w_v = nn.Linear(d_model, d_model) self.w_concat = nn.Linear(d_model, d_model) def forward(self, q, k, v, mask=None): # 1. dot product with weight matrices q, k, v = self.w_q(q), self.w_k(k), self.w_v(v) # 2. split tensor by number of heads q, k, v = self.split(q), self.split(k), self.split(v) # 3. do scale dot product to compute similarity out, attention = self.attention(q, k, v, mask=mask) # 4. concat and pass to linear layer out = self.concat(out) out = self.w_concat(out) # 5. visualize attention map # TODO : we should implement visualization return out def split(self, tensor): """ split tensor by number of head :param tensor: [batch_size, length, d_model] :return: [batch_size, head, length, d_tensor] """ batch_size, length, d_model = tensor.size() d_tensor = d_model // self.n_head tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2) # it is similar with group convolution (split by number of heads) return tensor def concat(self, tensor): """ inverse function of self.split(tensor : torch.Tensor) :param tensor: [batch_size, head, length, d_tensor] :return: [batch_size, length, d_model] """ batch_size, head, length, d_tensor = tensor.size() d_model = head * d_tensor tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model) return tensor ```

### 1.3 Scale Dot Product Attention ![model](image/scale_dot_product_attention.jpg) ```python class ScaleDotProductAttention(nn.Module): """ compute scale dot product attention Query : given sentence that we focused on (decoder) Key : every sentence to check relationship with Qeury(encoder) Value : every sentence same with Key (encoder) """ def __init__(self): super(ScaleDotProductAttention, self).__init__() self.softmax = nn.Softmax(dim=-1) def forward(self, q, k, v, mask=None, e=1e-12): # input is 4 dimension tensor # [batch_size, head, length, d_tensor] batch_size, head, length, d_tensor = k.size() # 1. dot product Query with Key^T to compute similarity k_t = k.transpose(2, 3) # transpose score = (q @ k_t) / math.sqrt(d_tensor) # scaled dot product # 2. apply masking (opt) if mask is not None: score = score.masked_fill(mask == 0, -10000) # 3. pass them softmax to make [0, 1] range score = self.softmax(score) # 4. multiply with Value v = score @ v return v, score ```

### 1.4 Layer Norm ![model](image/layer_norm.jpg) ```python class LayerNorm(nn.Module): def __init__(self, d_model, eps=1e-12): super(LayerNorm, self).__init__() self.gamma = nn.Parameter(torch.ones(d_model)) self.beta = nn.Parameter(torch.zeros(d_model)) self.eps = eps def forward(self, x): mean = x.mean(-1, keepdim=True) var = x.var(-1, unbiased=False, keepdim=True) # '-1' means last dimension. out = (x - mean) / torch.sqrt(var + self.eps) out = self.gamma * out + self.beta return out ```

### 1.5 Positionwise Feed Forward ![model](image/positionwise_feed_forward.jpg) ```python class PositionwiseFeedForward(nn.Module): def __init__(self, d_model, hidden, drop_prob=0.1): super(PositionwiseFeedForward, self).__init__() self.linear1 = nn.Linear(d_model, hidden) self.linear2 = nn.Linear(hidden, d_model) self.relu = nn.ReLU() self.dropout = nn.Dropout(p=drop_prob) def forward(self, x): x = self.linear1(x) x = self.relu(x) x = self.dropout(x) x = self.linear2(x) return x ```

### 1.6 Encoder & Decoder Structure ![model](image/enc_dec.jpg) ```python class EncoderLayer(nn.Module): def __init__(self, d_model, ffn_hidden, n_head, drop_prob): super(EncoderLayer, self).__init__() self.attention = MultiHeadAttention(d_model=d_model, n_head=n_head) self.norm1 = LayerNorm(d_model=d_model) self.dropout1 = nn.Dropout(p=drop_prob) self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob) self.norm2 = LayerNorm(d_model=d_model) self.dropout2 = nn.Dropout(p=drop_prob) def forward(self, x, src_mask): # 1. compute self attention _x = x x = self.attention(q=x, k=x, v=x, mask=src_mask) # 2. add and norm x = self.dropout1(x) x = self.norm1(x + _x) # 3. positionwise feed forward network _x = x x = self.ffn(x) # 4. add and norm x = self.dropout2(x) x = self.norm2(x + _x) return x ```
```python class Encoder(nn.Module): def __init__(self, enc_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device): super().__init__() self.emb = TransformerEmbedding(d_model=d_model, max_len=max_len, vocab_size=enc_voc_size, drop_prob=drop_prob, device=device) self.layers = nn.ModuleList([EncoderLayer(d_model=d_model, ffn_hidden=ffn_hidden, n_head=n_head, drop_prob=drop_prob) for _ in range(n_layers)]) def forward(self, x, src_mask): x = self.emb(x) for layer in self.layers: x = layer(x, src_mask) return x ```
```python class DecoderLayer(nn.Module): def __init__(self, d_model, ffn_hidden, n_head, drop_prob): super(DecoderLayer, self).__init__() self.self_attention = MultiHeadAttention(d_model=d_model, n_head=n_head) self.norm1 = LayerNorm(d_model=d_model) self.dropout1 = nn.Dropout(p=drop_prob) self.enc_dec_attention = MultiHeadAttention(d_model=d_model, n_head=n_head) self.norm2 = LayerNorm(d_model=d_model) self.dropout2 = nn.Dropout(p=drop_prob) self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob) self.norm3 = LayerNorm(d_model=d_model) self.dropout3 = nn.Dropout(p=drop_prob) def forward(self, dec, enc, trg_mask, src_mask): # 1. compute self attention _x = dec x = self.self_attention(q=dec, k=dec, v=dec, mask=trg_mask) # 2. add and norm x = self.dropout1(x) x = self.norm1(x + _x) if enc is not None: # 3. compute encoder - decoder attention _x = x x = self.enc_dec_attention(q=x, k=enc, v=enc, mask=src_mask) # 4. add and norm x = self.dropout2(x) x = self.norm2(x + _x) # 5. positionwise feed forward network _x = x x = self.ffn(x) # 6. add and norm x = self.dropout3(x) x = self.norm3(x + _x) return x ```
```python class Decoder(nn.Module): def __init__(self, dec_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device): super().__init__() self.emb = TransformerEmbedding(d_model=d_model, drop_prob=drop_prob, max_len=max_len, vocab_size=dec_voc_size, device=device) self.layers = nn.ModuleList([DecoderLayer(d_model=d_model, ffn_hidden=ffn_hidden, n_head=n_head, drop_prob=drop_prob) for _ in range(n_layers)]) self.linear = nn.Linear(d_model, dec_voc_size) def forward(self, trg, src, trg_mask, src_mask): trg = self.emb(trg) for layer in self.layers: trg = layer(trg, src, trg_mask, src_mask) # pass to LM head output = self.linear(trg) return output ```

## 2. Experiments I use Multi30K Dataset to train and evaluate model
You can check detail of dataset [here](https://arxiv.org/abs/1605.00459)
I follow original paper's parameter settings. (below)
![conf](image/transformer-model-size.jpg) ### 2.1 Model Specification * total parameters = 55,207,087 * model size = 215.7MB * lr scheduling : ReduceLROnPlateau #### 2.1.1 configuration * batch_size = 128 * max_len = 256 * d_model = 512 * n_layers = 6 * n_heads = 8 * ffn_hidden = 2048 * drop_prob = 0.1 * init_lr = 0.1 * factor = 0.9 * patience = 10 * warmup = 100 * adam_eps = 5e-9 * epoch = 1000 * clip = 1 * weight_decay = 5e-4

### 2.2 Training Result ![image](saved/transformer-base/train_result.jpg) * Minimum Training Loss = 2.852672759656864 * Minimum Validation Loss = 3.2048025131225586

| Model | Dataset | BLEU Score | |:---:|:---:|:---:| | Original Paper's | WMT14 EN-DE | 25.8 | | My Implementation | Multi30K EN-DE | 26.4 |

## 3. Reference - [Attention is All You Need, 2017 - Google](https://arxiv.org/abs/1706.03762) - [The Illustrated Transformer - Jay Alammar](http://jalammar.github.io/illustrated-transformer/) - [Data & Optimization Code Reference - Bentrevett](https://github.com/bentrevett/pytorch-seq2seq/)

## 4. Licence Copyright 2019 Hyunwoong Ko. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.