# RNN-POS-Tagger-TLE **Repository Path**: greitzmann/RNN-POS-Tagger-TLE ## Basic Information - **Project Name**: RNN-POS-Tagger-TLE - **Description**: Implement RNNs by PyTorch for automatic POS tagging - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-01-20 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Part-of-speech tagging for Treebank of Learner English corpora with Recurrent Neural Networks ## Motivation >Part-of-speech (POS) tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. [Wikipedia](https://en.wikipedia.org/wiki/Part-of-speech_tagging) POS tagging could be the fundamentals of many NLP/NLU tasks, such as Name Entity Recognition (NER) and Abstract Meaning Representation (AMR). In this project, I want to explore the state-of-the-art Recurrent Neural Network (RNN) based models for POS tagging. The following are the candidate models: - Long Short-Term Memory (LSTM) - Bidirectional LSTM (BI-LSTM) - LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) - Bidirectional LSTM with a CRF layer (BI-LSTM-CRF) I will apply the above models on two tasks: 1. Continuous POS tagging with RNNs 2. POS resemblance between learners with different native language background (**Update 2018/04/18: task 2 is added**) (**Update 2018/04/14: the BI-LSTM is added**) (**Update 2018/04/12: the basic LSTM and task 1 is added**) ## Dataset >UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at [esltreebank.org](esltreebank.org). Citation: (Berzak et al., 2016; Yannakoudakis et al., 2011) ### Data Loader I've built a data loader for this dataset. To use the data loader, you need to first install the [CoNLL-U Parser](https://github.com/EmilStenstrom/conllu) built by [Emil Stenström](https://github.com/EmilStenstrom). The following is an example to use data_loader: ```python import data_loader meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True) train_meta, train_meta_corrected, \ dev_meta, dev_meta_corrected, \ test_meta, test_meta_corrected = meta_list train_data, train_data_corrected, \ dev_data, dev_data_corrected, \ test_data, test_data_corrected = data_list ``` ### Metadata - doc_id: filename (also learner ID) of the original xml file - sent: raw text of the sentence written by the leaner with error corrected tags - native_language: native language of the leaner - age_range: age range of the learner - score: exam score of the learner Some observations: - "native_language" enables us to design tasks related to native language identificaiton. - "age_range" enables us to identify the learner's age based on his/her writing style. - "score" can help us to group learners into categories, such as Beginner, Intermediate, Expert, Fluent, Proficient. It enables us to discover the writing style and common mistakes of different groups of learners. ```python train_meta.head() ```

id	doc_id	sent	errors	native_language	age_range	score
1	doc2664	I was <ns type="S"><i>shoked</i><c>shocked</c>...	{'S': 2, 'RV': 1}	Russian	21-25	21.0
2	doc648	I am very sorry to say it was definitely not a...	{'RT': 1, 'MT': 1}	French	26-30	38.0
3	doc1081	Of course, I became aware of her feelings sinc...	{'AGQ': 1}	Spanish	16-20	36.0
4	doc724	I also suggest that more plays and films shoul...	{'RV': 1, 'FV': 1}	Japanese	21-25	33.0
5	doc567	Although my parents were very happy <ns type="...	{'FD': 1, 'RJ': 1, 'RT': 1, 'MT': 1}	Spanish	31-40	34.0

### Sentence Format In this project, we will only use "form" (words) and "upostag" (part-of-speech tags). ```python train_data[0] ```

id	form	lemma	upostag	xpostag	feats	head	deprel	deps	misc	meta_id
1	I	_	PRON	PRP	None	3	nsubj	None	None	1
2	was	_	VERB	VBD	None	3	cop	None	None	1
3	shoked	_	ADJ	JJ	None	0	root	None	None	1
4	because	_	SCONJ	IN	None	8	mark	None	None	1
5	I	_	PRON	PRP	None	8	nsubj	None	None	1
6	had	_	AUX	VBD	None	8	aux	None	None	1
7	alredy	_	ADV	RB	None	8	advmod	None	None	1
8	spoken	_	VERB	VBN	None	3	advcl	None	None	1
9	with	_	ADP	IN	None	10	case	None	None	1
10	them	_	PRON	PRP	None	8	nmod	None	None	1
11	and	_	CONJ	CC	None	8	cc	None	None	1
12	I	_	PRON	PRP	None	14	nsubj	None	None	1
13	had	_	AUX	VBD	None	14	aux	None	None	1
14	taken	_	VERB	VBN	None	8	conj	None	None	1
15	two	_	NUM	CD	None	16	nummod	None	None	1
16	autographs	_	NOUN	NNS	None	14	dobj	None	None	1
17	.	_	PUNCT	.	None	3	punct	None	None	1

## RNN Models In this project, we mainly use [PyTorch](http://pytorch.org/) to implement the RNN models. The following are what I've already implemented: ### Long Short-Term Memory (LSTM) >Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. [Wikipedia](https://en.wikipedia.org/wiki/Long_short-term_memory) The following is the high-level architecture for the LSTM model: ![Task1_LSTM](figures/task1-w2v-lstm.png) ### Bidirectional LSTM (BI-LSTM) The BI-LSTM model is derived from Bidrectional RNN (BRNN) (Schuster and Paliwal, 1997). >The principle of BRNN is to split the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). Those two states’ output are not connected to inputs of the opposite direction states. By using two time directions, input information from the past and future of the current time frame can be used unlike standard RNN which requires the delays for including future information. [Wikipedia](https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks) The BI-LSTM is based on BRNN but replaces the RNN units with LSTM units. The following is the high-level architecture for the BI-LSTM model: ![Task1_BILSTM](figures/task1-w2v-bi-lstm.png) ## Task 1: Continuous POS tagging with RNNs ### Architecture In this task, a POS tagger was trained with all train data (4124 sentences), validated with dev data (500 sentences), and tested with test data (500 sentences). The following is the architecture: ![Task1 Architecture](figures/task1-arch.png) ### Word Features We use the pre-trained [Word2Vec model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) built with Google News corpus (3 million 300-dimension English word vectors). Although it might not be the best choice (e.g. Google News corpus might not be representative for the English Learner text), it's still a legitimate choice: 1) It saves my time to build a large dictionary which cover all words in the UD English-ESL/TLE corpus; 2) It saves my time and computing resources to build large/sparse unigram vectors for words, and I don't need to worry about dimension reduction for now; 3) 300-dim w2v vector is small enough for this task, and the dimension is fixed so the vector can be directly used in NN. 4) It's free and available on Google Drive :). ### Experiments #### Performance The dataset was divided into train, dev, test sets. We used train and dev sets to observe the fluctuation of accuracy and loss during the training process of 1000 epochs. There are 17 different POS tags in this experiment. The prediction is considered as true postive only if it is the same as the actual POS tag. The optimizer of RNNs is Stochastic Gradient Descent (SGD) with different learning rate (lr). The loss function is Cross Entropy Loss. The following is the best performance after 100 epochs: - lr = 0.5 | Model | Train Accuracy | Dev Accuracy | Test Accuracy | | ------------- |:-------------|:-------------|:-------------:| | LSTM | 89.28% | 83.90% | 83.31% | | BI-LSTM | 93.25% | 88.00% | 88.00% | - lr = 0.1 | Model | Train Accuracy | Dev Accuracy | Test Accuracy | | ------------- |:-------------|:-------------|:-------------:| | LSTM | 73.77% | 71.86% | 70.9% | | BI-LSTM | 78.37% | 76.17% | 75.62% | The BI-LSTM model consistantly performs better than the LSTM model and achieve 88% in testing accuracy (lr=0.5). #### Parameter Tuning The following are train/dev accuracy and loss in 100 epochs: - lr = 0.5 ![Task1_Accu_lr0.5](figures/lstm_lr-0.5_accu_comparison.png) ![Task1_Loss_lr0.5](figures/lstm_lr-0.5_loss_comparison.png) - lr = 0.1 ![Task1_Accu_lr0.1](figures/lstm_lr-0.1_accu_comparison.png) ![Task1_Loss_lr0.1](figures/lstm_lr-0.1_loss_comparison.png) According to the following figures, both LSTM and BI-LSTM are not apparent overfitting. BI-LSTM learned faster and better than LSTM model. ## Task 2: POS resemblance between learners with different native language background In this task, I would like to discover the POS resemblance between learners with different native language background. The basic hypothesis is that a person's writing style in English is subconsciously influeced by the grammar of his/her native language. For example, the basic sentence structure in English is (Subject+Verb+Object), but in Japanese is (Subject+Object+Verb). Moreover, some languages do not have strict rules about the grammatical order of words, but they have abundant morphemes to construct sentences. In the following experiments, we use the train data in the dataset. Here are some stats of the train data regarding learner's native language background. ```python import data_loader import pandas as pd meta_list, data_list = data_loader.load_data(load_train=True, load_dev=False, load_test=False) train_meta, train_meta_corrected = meta_list train_data, train_data_corrected = data_list ``` ```python languages = train_meta["native_language"].unique() print("# of Sentence: {}".format(len(train_meta))) print("Sentence distribution:") stats = [] for language in languages: stats.append(len(train_meta[train_meta["native_language"]==language])) stats_df = pd.DataFrame(stats, columns=["# of sentences"], index=languages) print(stats_df) print("Author distribution:") stats = [] for language in languages: stats.append(len(train_meta[train_meta["native_language"]==language]["doc_id"].unique())) stats_df = pd.DataFrame(stats, columns=["# of authors"], index=languages) print(stats_df) stats = [] languages = train_meta["native_language"].unique() print("Exam score stats:") for language in languages: stats.append(train_meta[train_meta["native_language"]==language]["score"].describe()[['count', 'mean', 'std', 'max', 'min']]) stats_df = pd.DataFrame(stats, index=languages) print(stats_df) ``` # of Sentence: 4124 Sentence distribution: # of sentences Russian 427 French 401 Spanish 428 Japanese 407 Chinese 414 Turkish 404 Portuguese 407 Korean 413 German 400 Italian 423 Author distribution: # of authors Russian 81 French 131 Spanish 175 Japanese 81 Chinese 66 Turkish 73 Portuguese 68 Korean 84 German 69 Italian 76 Exam score stats: count mean std max min Russian 427.0 26.288056 6.179166 40.0 9.0 French 401.0 27.630923 4.666738 40.0 17.0 Spanish 428.0 26.789720 5.349402 40.0 11.0 Japanese 407.0 27.547912 5.040432 39.0 15.0 Chinese 414.0 26.268116 6.210832 40.0 14.0 Turkish 404.0 27.834158 5.494389 39.0 7.0 Portuguese 407.0 27.791155 4.963723 39.0 11.0 Korean 413.0 25.980630 6.019355 40.0 12.0 German 400.0 27.725000 5.880546 40.0 13.0 Italian 423.0 28.699764 4.388392 38.0 20.0 We train BI-LSTM models (500 epochs, SGD learning rate=0.5) for sentences in every lanugage respectively, and then test the tagging accuracy on sentences in other languages. That is, we train a POS tagger based on sentences written by learners with Japanese native language background, and use the tagger to tag sentences written by learners with other native language background. The following are the results of POS tagging accuracy. ![Task2_Stats](figures/task2-stats.png) The diagonal numbers show how well the models fit their training data. Although it shows some models learned faster and some learned slower, unfortunately, so far there is no significant proof that any pair of languages is more or less similar with each other in the perspective of POS resemblance. However, under the same experiment settings, we still learned some from the results: - Models trained by learners with Chinese, Portuguese, Korean and German native language background learn faster and perform better in POS tagging. - In some pairs of languages, there is higher difference between (train on language A -> test on language B) and (train on language B -> test on language A). ## References 1. Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., ... & Katz, B. (2016). Universal dependencies for learner English. arXiv preprint arXiv:1605.04278. 2. Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011, June). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 180-189). Association for Computational Linguistics. 3. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.