1 Star 0 Fork 1

modelee / bangla-bert-base

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
language tags license datasets
bn
bert
bengali
bengali-lm
bangla
mit
common_crawl
wikipedia
oscar

Bangla BERT Base

A long way passed. Here is our Bangla-Bert! It is now available in huggingface model hub.

Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository

Pretrain Corpus Details

Corpus was downloaded from two main sources:

After downloading these corpora, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.

sentence 1
sentence 2

sentence 1
sentence 2

Building Vocab

We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.

Training Details

  • Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
  • Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
  • Total Training Steps: 1 Million
  • The model was trained on a single Google Cloud GPU

Evaluation Results

LM Evaluation Results

After training 1 million steps here are the evaluation results.

global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227

Downstream Task Evaluation Results

  • Evaluation on Bengali Classification Benchmark Datasets

Huge Thanks to Nick Doiron for providing evaluation results of the classification task. He used Bengali Classification Benchmark datasets for the classification task. Comparing to Nick's Bengali electra and multi-lingual BERT, Bangla BERT Base achieves a state of the art result. Here is the evaluation script.

Model Sentiment Analysis Hate Speech Task News Topic Task Average
mBERT 68.15 52.32 72.27 64.25
Bengali Electra 69.19 44.84 82.33 65.45
Bangla BERT Base 70.37 71.83 89.19 77.13

We evaluated Bangla-BERT-Base with Wikiann Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT). Bangla-BERT-Base got a third-place where mBERT got first and XML-R got second place after training these models 5 epochs.

Base Pre-trained Model F1 Score Accuracy
mBERT-uncased 97.11 97.68
XLM-R 96.22 97.03
Indic-BERT 92.66 94.74
Bangla-BERT-Base 95.57 97.49

All four model trained with transformers-token-classification notebook. You can find all models evaluation results here

Also, you can check the below paper list. They used this model on their datasets.

NB: If you use this model for any NLP task please share evaluation results with us. We will add it here.

Limitations and Biases

How to Use

Bangla BERT Tokenizer

from transformers import AutoTokenizer, AutoModel

bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
text = "আমি বাংলায় গান গাই।"
bnbert_tokenizer.tokenize(text)
# ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

MASK Generation

You can use this model directly with a pipeline for masked language modeling:

from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
  print(pred)

# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

Author

Sagor Sarker

Reference

Citation

If you find this model helpful, please cite.

@misc{Sagor_2020,
  title   = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understanding},
  author  = {Sagor Sarker},
  year    = {2020},
  url    = {https://github.com/sagorbrur/bangla-bert}
}

空文件

简介

取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/modelee/bangla-bert-base.git
git@gitee.com:modelee/bangla-bert-base.git
modelee
bangla-bert-base
bangla-bert-base
main

搜索帮助