# GilBERTo **Repository Path**: coracoding/GilBERTo ## Basic Information - **Project Name**: GilBERTo - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-03-24 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # GilBERTo: An Italian pretrained language model based on RoBERTa **GilBERTo** is an **Italian** pretrained language model based on [Facebook RoBERTa architecture](https://arxiv.org/abs/1907.11692) and [CamemBERT](https://www.researchgate.net/publication/337183733_CamemBERT_a_Tasty_French_Language_Model) text tokenization approach. Model was trained with the subword masking technique for 100k steps managing ~71GB of **Italian text** with 11,250,012,896 words ([OSCAR](https://traces1.inria.fr/oscar/): **O***pen* **S***uper-large* **C***rawled* **A***LMAnaCH* *co***R***pus*). We took up a vocabulary of 32k BPE subwords, generated using [SentencePiece](https://github.com/google/sentencepiece) tokenizer. GilBERTo evaluation was executed in different downstream tasks, comparing it to [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) and other (not BERT-based) models. More specifically, the models comparison was accomplished by executing the following tasks: * **P**art-**o**f-**S**peech tagging * **N**amed **E**ntity **R**ecognition ## Download **GilBERTo** is available both using [huggingface/transformers](https://github.com/huggingface/transformers) and [pytorch/fairseq](https://github.com/pytorch/fairseq) librarires. Model | Library | Download ---|:---:|:---: `GilBERTo-uncased-from-camembert` |*pytorch/fairseq* |[GilBERTo-uncased-fairseq.v1.zip](https://drive.google.com/uc?id=13kk8EfnJ2wVYLehPSjoskG2swU0Ab3Ch&export=download) `GilBERTo-uncased-from-camembert` |*huggingface/transformers* |[GilBERTo-uncased-transformers.v1.zip](https://drive.google.com/uc?id=1hokQynDBnI361rJc4UBSgtP56ZZKRWf2&export=download) ## Results We are in the drafting phase of the paper including all details (*coming soon*). To the best of our knowledge, downstream task applications are limited due to the lack of datasets available for Italian. **We strongly recommend everyone to contribute to the repository in order to improve the Italian NLP SOTA**. We will be happy to support. Currently we selected the following tasks based on what we have found in the Italian state of the art: ### PoS Tagging PoS task has been evaluated using the Accuracy metric with two different Italian dataset: [Italian ParTUT](https://universaldependencies.org/treebanks/it_partut/index.html) and [Italian ISDT](https://universaldependencies.org/treebanks/it_isdt/index.html). We also compared the results with [**UDPipe** and **UDify**](https://arxiv.org/pdf/1904.02099.pdf) models. Model | Italian ParTUT | Italian ISDT :---:|:---:|:---: UDPipe|98.4|98.4 UDify|98.2|98.5 mBERT|98.0|98.5 GilBERTo|**98.8**|**98.6** ### Named Entity Recognition NER task has been evaluated using the [WikiNER Italian dataset](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) already used by [Spacy pretrained model for Italian](https://spacy.io/models/it) who achieve `F-1 Score:86.40; Precision:86.73; Recall:86.08` Model | F1 | Precision | Recall :---:|:---:|:---:|:---: mBERT|92.2|92.1|92.3 GilBERTo|**92.7**|**92.7**|**92.8** ## How to use You can use **GilBERTo** with the latest version of [huggingface/transformers](https://github.com/huggingface/transformers) or [pytorch/fairseq](https://github.com/pytorch/fairseq) Python libraries. ### huggingface/transformers ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("idb-ita/gilberto-uncased-from-camembert", do_lower_case=True) model = AutoModel.from_pretrained("idb-ita/gilberto-uncased-from-camembert") input_ids = torch.tensor(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")).unsqueeze(0) #>> tensor([[5, 755, 181, 1413, 25, 155, 12513, 14397, 16247, 31976, 6]]) token_list = tokenizer.convert_ids_to_tokens(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")) #>> ['', '▁io', '▁sono', '▁italiano', '▁e', '▁mi', '▁chiamo', '▁gil', 'berto', '!', ''] ``` ### pytorch/fairseq $ pip install fairseq ```python from fairseq.models.roberta import RobertaModel as FairseqRobertaModel from fairseq.modules import TransformerSentenceEncoderLayer # Import GilBERTo with pytorch\fairseq Library gilberto_model = FairseqRobertaModel.from_pretrained('path/to/checkpoints_folder', bpe='sentencepiece') # Mask Predictions gilberto_model.fill_mask('Buongiorno mi Gilberto!', topk=3) #Fill mask token with GilBERTo # Outputs [('Buongiorno mi chiamo Gilberto!', 0.5044017434120178, ' chiamo'), ('Buongiorno mi presento Gilberto!', 0.05189879611134529, ' presento'), ('Buongiorno mi sento Gilberto!', 0.022937586531043053, ' sento')] # Other examples # Input: `È più facile per un italiano gesticolare senza che parlare senza gesticolare.` # Output: `È più facile per un italiano gesticolare senza parlare che parlare senza gesticolare.` # Input: `Agli italiani piace pasta, e mandolino` # Output: `Agli italiani piace pasta, pizza e mandolino` # Input: `Chi dice che il denaro non fa la , oltre a essere antipatico, è pure fesso.` # Output: `Chi dice che il denaro non fa la felicità, oltre a essere antipatico, è pure fesso.` # Input: `Era un uomo così antipatico che dopo la sua i parenti chiesero il bis` # Output: `Era un uomo così antipatico che dopo la sua morte i parenti chiesero il bis` ``` ## Contacts **Giulio Ravasio**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/GiulioRavasio) | [Github](https://github.com/giuliorav) | **Leonardo Di Perna**: [Linkedin](https://www.linkedin.com/in/leonardo-di-perna/) | [Twitter](https://twitter.com/Leodipe94) | [Github](https://github.com/LeoDeep) | ## References * [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) * [Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing](https://www.aclweb.org/anthology/D18-2012/) * [Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures](https://hal.inria.fr/hal-02148693) * [CamemBERT: a Tasty French Language Model](https://www.researchgate.net/publication/337183733_CamemBERT_a_Tasty_French_Language_Model) * [Learning multilingual named entity recognition from Wikipedia](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) * [75 Languages, 1 Model: Parsing Universal Dependencies Universally](https://arxiv.org/abs/1904.02099)