# KoBERT **Repository Path**: vwangyanwei/KoBERT ## Basic Information - **Project Name**: KoBERT - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-04 - **Last Updated**: 2025-12-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # KoBERT * [KoBERT](#kobert) * [Korean BERT pre-trained cased (KoBERT)](#korean-bert-pre-trained-cased-kobert) * [Why'?'](#why) * [Training Environment](#training-environment) * [Requirements](#requirements) * [How to install](#how-to-install) * [How to use](#how-to-use) * [Using with PyTorch](#using-with-pytorch) * [Using with ONNX](#using-with-onnx) * [Using with MXNet-Gluon](#using-with-mxnet-gluon) * [Tokenizer](#tokenizer) * [Subtasks](#subtasks) * [Naver Sentiment Analysis](#naver-sentiment-analysis) * [KoBERT와 CRF로 만든 한국어 객체명인식기](#kobert와-crf로-만든-한국어-객체명인식기) * [Korean Sentence BERT](#korean-sentence-bert) * [Release](#release) * [Contacts](#contacts) * [License](#license) --- ## Korean BERT pre-trained cased (KoBERT) ### Why'?' * 구글 [BERT base multilingual cased](https://github.com/google-research/bert/blob/master/multilingual.md)의 한국어 성능 한계 ### Training Environment * Architecture ```python predefined_args = { 'attention_cell': 'multi_head', 'num_layers': 12, 'units': 768, 'hidden_size': 3072, 'max_length': 512, 'num_heads': 12, 'scaled': True, 'dropout': 0.1, 'use_residual': True, 'embed_size': 768, 'embed_dropout': 0.1, 'token_type_vocab_size': 2, 'word_embed': None, } ``` * 학습셋 | 데이터 | 문장 | 단어 | | ----------- | ---- | ---- | | 한국어 위키 | 5M | 54M | * 학습 환경 * V100 GPU x 32, Horovod(with InfiniBand) ![2019-04-29 텐서보드 로그](imgs/2019-04-29_TensorBoard.png) * 사전(Vocabulary) * 크기 : 8,002 * 한글 위키 기반으로 학습한 토크나이저(SentencePiece) * Less number of parameters(92M < 110M ) ### Requirements * see [requirements.txt](https://github.com/SKTBrain/KoBERT/blob/master/requirements.txt) ### How to install * Install KoBERT as a python package ```sh pip install git+https://git@github.com/SKTBrain/KoBERT.git@master ``` * If you want to modify source codes, please clone this repository ```sh git clone https://github.com/SKTBrain/KoBERT.git cd KoBERT pip install -r requirements.txt ``` --- ## How to use ### PyTorch *Huggingface transformers API가 편하신 분은 [여기](kobert_hf)를 참고하세요.* ```python >>> import torch >>> from kobert import get_pytorch_kobert_model >>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) >>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) >>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) >>> model, vocab = get_pytorch_kobert_model() >>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids) >>> pooled_output.shape torch.Size([2, 768]) >>> vocab Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']") >>> # Last Encoding Layer >>> sequence_output[0] tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756], [-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754], [-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]], grad_fn=) ``` `model`은 디폴트로 `eval()`모드로 리턴됨, 따라서 학습 용도로 사용시 `model.train()`명령을 통해 학습 모드로 변경할 필요가 있다. * Naver Sentiment Analysis Fine-Tuning with pytorch * Colab에서 [런타임] - [런타임 유형 변경] - 하드웨어 가속기(GPU) 사용을 권장합니다. * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SKTBrain/KoBERT/blob/master/scripts/NSMC/naver_review_classifications_pytorch_kobert.ipynb) ### ONNX ```python >>> import onnxruntime >>> import numpy as np >>> from kobert import get_onnx_kobert_model >>> onnx_path = get_onnx_kobert_model() >>> sess = onnxruntime.InferenceSession(onnx_path) >>> input_ids = [[31, 51, 99], [15, 5, 0]] >>> input_mask = [[1, 1, 1], [1, 1, 0]] >>> token_type_ids = [[0, 0, 1], [0, 1, 0]] >>> len_seq = len(input_ids[0]) >>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids), >>> 'token_type_ids':np.array(token_type_ids), >>> 'input_mask':np.array(input_mask), >>> 'position_ids':np.array(range(len_seq))}) >>> # Last Encoding Layer >>> pred_onnx[-2][0] array([[-0.24610452, 0.24282141, 0.25895312, ..., -0.48613444, -0.07305173, 0.07560554], [-0.24783179, 0.24200465, 0.25520486, ..., -0.4877185 , -0.0727044 , 0.07536091], [-0.24721591, 0.24196623, 0.2560626 , ..., -0.48743123, -0.07326943, 0.07650235]], dtype=float32) ``` _ONNX 컨버팅은 [soeque1](https://github.com/soeque1)께서 도움을 주셨습니다._ ### MXNet-Gluon ```python >>> import mxnet as mx >>> from kobert import get_mxnet_kobert_model >>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]]) >>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]]) >>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]]) >>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False) >>> encoder_layer, pooled_output = model(input_id, token_type_ids) >>> pooled_output.shape (2, 768) >>> vocab Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']") >>> # Last Encoding Layer >>> encoder_layer[0] [[-0.24610372 0.24282135 0.2589539 ... -0.48613444 -0.07305248 0.07560539] [-0.24783105 0.242005 0.25520545 ... -0.48771808 -0.07270523 0.07536077] [-0.24721491 0.241966 0.25606337 ... -0.48743105 -0.07327032 0.07650219]] ``` * Naver Sentiment Analysis Fine-Tuning with MXNet * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SKTBrain/KoBERT/blob/master/scripts/NSMC/naver_review_classifications_gluon_kobert.ipynb) ### Tokenizer * Pretrained [Sentencepiece](https://github.com/google/sentencepiece) tokenizer ```python >>> from gluonnlp.data import SentencepieceTokenizer >>> from kobert import get_tokenizer_path >>> tok_path = get_tokenizer_path() >>> sp = SentencepieceTokenizer(tok_path) >>> sp('한국어 모델을 공유합니다.') ['▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.'] ``` --- ## Task Fine-tuning ### Naver Sentiment Analysis * Dataset : | Model | Accuracy | | --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- | | [BERT base multilingual cased](https://github.com/google-research/bert/blob/master/multilingual.md) | 0.875 | | KoBERT | **[0.901](logs/bert_naver_small_512_news_simple_20190624.txt)** | | [KoGPT2](https://github.com/SKT-AI/KoGPT2) | 0.899 | ### KoBERT와 CRF로 만든 한국어 객체명인식기 * ```text 문장을 입력하세요: SKTBrain에서 KoBERT 모델을 공개해준 덕분에 BERT-CRF 기반 객체명인식기를 쉽게 개발할 수 있었다. len: 40, input_token:['[CLS]', '▁SK', 'T', 'B', 'ra', 'in', '에서', '▁K', 'o', 'B', 'ER', 'T', '▁모델', '을', '▁공개', '해', '준', '▁덕분에', '▁B', 'ER', 'T', '-', 'C', 'R', 'F', '▁기반', '▁', '객', '체', '명', '인', '식', '기를', '▁쉽게', '▁개발', '할', '▁수', '▁있었다', '.', '[SEP]'] len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]'] decoding_ner_sentence: [CLS] 에서 모델을 공개해준 덕분에 기반 객체명인식기를 쉽게 개발할 수 있었다.[SEP] ``` ### Korean Sentence BERT * |Model|Cosine Pearson|Cosine Spearman|Euclidean Pearson|Euclidean Spearman|Manhattan Pearson|Manhattan Spearman|Dot Pearson|Dot Spearman| |:------------------------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| |NLl|65.05|68.48|68.81|68.18|68.90|68.20|65.22|66.81| |STS|**80.42**|**79.64**|**77.93**|77.43|**77.92**|77.44|**76.56**|**75.83**| |STS + NLI|78.81|78.47|77.68|**77.78**|77.71|**77.83**|75.75|75.22| --- ## Release * v0.2.4 * 대용량 파일을 Hugging Face Hub에서 받도록 변경 * v0.2.3 * `onnx 1.8.0` 지원 * v0.2.2 * 에러 수정: `No module named 'kobert.utils'` * v0.2.1 * import 구문 수정 * v0.2 * 대용량 파일을 `aws s3`에서 받도록 변경 * 함수명 변경 * v0.1.2 * transformers 라이브러리 호환성 수정 * pad token의 index 수정 * v0.1.1 * 사전(vocabulary)과 토크나이저 통합 * v0.1 * 초기 모델 릴리즈 ## Contacts `KoBERT` 관련 이슈는 [이곳](https://github.com/SKTBrain/KoBERT/issues)에 등록해 주시기 바랍니다. ## License `KoBERT`는 `Apache-2.0` 라이선스 하에 공개되어 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 `LICENSE` 파일에서 확인하실 수 있습니다.