# language-difficulty-control **Repository Path**: alibaba/language-difficulty-control ## Basic Information - **Project Name**: language-difficulty-control - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-27 - **Last Updated**: 2026-01-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Controlling Language Difficulty in Dialogues with Linguistic Features [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![arXiv](https://img.shields.io/badge/arXiv-2509.14545-b31b1b.svg)](https://arxiv.org/abs/2509.14545) > **Dilaprix**: A metric for quantifying and regulating language difficulty in dialogues using linguistic features. ## 📌 Introduction The **Dialogue Language Proficiency Index (Dilaprix)** is a composite metric that evaluates the linguistic complexity of dialogue utterances based on three categories of features: - **Readability features** (e.g., Flesch-Kincaid Grade Level) - **Syntactic features** (e.g., syntactic tree depth) - **Lexical features** (e.g., simple word ratio) Dilaprix enables fine-grained control over language difficulty—useful for educational dialogue systems, accessibility tools, and language learning applications. ### 📊 Example: Utterances vs. Dilaprix Scores | Utterance | Dilaprix | |----------|--------| | *Thank you for coming, Lily. Do you like meat?* | 0.08 | | *Thank you for coming, Lily. I appreciate your help in the kitchen. To start with, do you like meat?* | 0.30 | | *Thank you for coming, Lily. I appreciate your help in the kitchen. To better understand your preferences, may I ask: do you like meat?* | 0.55 | | *Ah, excellent, Lily, for you to grace us with your presence in the kitchen. Now, to delve into a gastronomical inquiry: do you have an affinity for meat?* | 0.81 | > 🔍 **Lower Dilaprix = simpler language**; **Higher Dilaprix = more complex language** --- ## 🧠 Linguistic Features Dilaprix integrates the following 11 features: ### Readability - **Flesch Reading Ease** ($F_R$): Higher = easier to read. - **Flesch-Kincaid Grade Level** ($F_G$): US grade level estimate. - **Gunning Fog Index** ($G_F$): Based on sentence length and complex words (≥3 syllables). - **Coleman-Liau Index** ($C_L$): Uses character counts instead of syllables. ### Syntax - **Tree Depth** ($T_D$): Max depth of syntactic parse trees. - **Leaf Node Count** ($L_N$): Max number of leaf nodes in any sentence. - **Non-terminal Diversity** ($N_D$): Unique non-terminal tags in parse trees. - **Subtree Complexity** ($S_C$): Max number of sub-trees per sentence. - **Utterance Length** ($U_L$): Total tokens. ### Lexicon - **Simple Word Ratio** ($S_W$): Proportion of words in a simple vocabulary list. - **Intermediate Word Ratio** ($I_W$): Proportion in an intermediate vocabulary list. --- ## 📐 Dilaprix Formula The final score is computed as:

Where: - $\mathcal{X} = \{F_R, F_G, G_F, C_L, T_D, L_N, N_D, S_C, U_L, S_W, I_W\}$ - $\mathcal{X}' = \{F_R, S_W, I_W\}$: features **inversely** related to difficulty - $\alpha_i$, $\beta_i$: 5th and 95th percentiles from a textbook dialogue corpus (used for robust normalization) - $\text{clamp}(v, 0, 1)$: ensures output stays in $[0, 1]$ --- ## 🚀 Get Started ### Installation ``` cd language_difficulty_control pip install -e . ``` ### Usage ```python from language_difficulty_control import LinguisticAnalyzer analyzer = LinguisticAnalyzer() features = analyzer("Hello! How are you today?") dilaprix = features["dilaprix"] print(f"Dilaprix: {dilaprix:.2f}") ``` ### Output ``` Dilaprix: 0.06 ``` ### Language Proficiency Controlled Dialogue Prompt Example ```text [flesch_reading_ease] for the Flesch-Kincaid Reading Ease; [flesch_kincaid_grade_level] for Flesch-Kincaid Grade Level; [gunning_fog] for the Gunning Fog Index; [coleman_liau] for the Coleman Liau Index; [tree_depth] The max Depth of the Constituency Parsing Trees of the sentences in your response; [leaf_node_count] The max number of leaf nodes of the Constituency Parsing Trees of the sentences in your response; [non_terminal_diversity] The max number of unique tags of the Constituency Parsing Trees of the sentences in your response; [subtree_complexity] The max number of sub-trees of the Constituency Parsing Trees of the sentences in your response; [utterance_length] the number of words in your response; [simple_words_ratio] the ratio of simple words in your response; [intermediate_words_ratio] the ratio of simple and intermediate words in your response. You are given a context and dialogue tasks, and are asked to play a role to continue the following conversation naturally. [DIALOGUE TASKS] 1. Ask Anna if she can play the piano 2. Ask Anna if she can ride a bike [CURRENT DIALOGUE TASK] 2. Ask Anna if she can ride a bike [CONTEXT] Ming: Hi Anna, can you play the piano? Anna: Yes, I can. Your reply should consist of two parts: 1. First part should respond to the user kindly based on the context; 2. Second part should carry out the [CURRENT DIALOGUE TASK]. Additionally, your response should abide by the following linguistic features: [flesch_reading_ease] 86.42 [flesch_kincaid_grade_level] 3.07 [gunning_fog] 3.0 [coleman_liau] 2.99 [tree_depth] 9 [leaf_node_count] 10 [non_terminal_diversity] 14 [subtree_complexity] 22 [utterance_length] 18 [simple_words_ratio] 0.8 [intermediate_words_ratio] 1.0 ``` ## 📚 Citation ``` @misc{xu2025controllinglanguagedifficultydialogues, title={Controlling Language Difficulty in Dialogues with Linguistic Features}, author={Shuyao Xu and Wenguang Wang and Handong Gao and Wei Kang and Long Qin and Weizhi Wang}, archivePrefix={arXiv}, url={https://arxiv.org/abs/2509.14545}, } ``` ## 📄 License This project is licensed under the Apache License – see the [LICENSE](LICENSE) file for details.