Research Article - (2023) Volume 2, Issue 4
LC-Score: Reference-less Estimation of Text Comprehension Difficulty
Research Article; J Electrical Electron Eng, 2023; Volume 2 | Issue 4 | 490; DOI: 10.33140/JEEE.02.04.13
Paul Tardy, Charlotte Roze and Paul Poupet*
Greater Paris Metropolitan Region, France
*Corresponding Author : Paul Poupet, Greater Paris Metropolitan Region
Submitted: 2023, Oct 09; Accepted: 2023, Oct 30; Published: 2023, Nov 21
Citation: Tardy, P., Roze, C., Poupet, P. (2023). LC-Score: Reference-less Estimation of Text Comprehension Difficulty. J Electrical Electron Eng, 2(4), 490-495.
Abstract
Being able to read and understand written text is critical in a digital era. However, studies shows that a large fraction of the population experiences comprehension issues. In this context, further initiatives in accessibility are required to improve the audience text comprehension. However, writers are hardly assisted nor encouraged to produce easy-to-understand content. Moreover, Automatic Text Simplification (ATS) model development suffers from the lack of metric to accurately estimate comprehension difficulty We present LC-SCORE, a simple approach for training text comprehension metric for any French text without reference i.e. predicting how easy to understand a given text is on a [0,100] scale. Our objective with this scale is to quantitatively capture the extend to which a text suits to the Langage Clair (LC, Clear Language) guidelines, a French initiative closely related to English Plain Language. We explore two approaches: (i) using linguistically motivated indicators used to train statistical models, and (ii) neural learning directly from text leveraging pre-trained language models. We introduce a simple proxy task for comprehension difficulty training as a classification task. To evaluate our models, we run two distinct human annotation experiments, and find that both approaches (indicator based and neural) outperforms commonly used readability and comprehension metrics such as FKGL and SAMSA.
1. Introduction
The ability to understand text is essential for a wide range of daily tasks. It enables individuals to stay informed, understand administrative forms, and have a full, unimpeded access to social and medical care. Studies shows that a large fraction of the population experiences comprehension issues in their daily life. Almost half of the OECD population shows reading and written information comprehension difficulties [1,2].
Such difficulties have a major impact in people’s life. In France for example, the National Statistic Institute (INSEE 2012) reports that one person out of four has already abandoned an administrative procedure deemed too complicated to follow-along.
In order to improve written text accessibility, initiatives such as Plain Language or Language Clair (LC, translates to Clear Language) defines writing guidelines to produce clearer texts. Moreover, comprehension makes its way into international standards and norms but still lacks of concrete solution and measurable objectives [3,4].
With the rise of deep-learning approaches in natural language processing, as well as its recent successes in a wide variety of tasks (transcription, translation, summarization, question answering), Automatic Text Simplification is an interesting candidate for accessibility improvements at scale. However, system performances are difficult to measure due to the limitations of current automatic metrics [6].
We hypothesize that the development of better text comprehension metrics could provide Automatic Text Simplification researchers with a way of validating their models while also to giving measurable objectives for the content editors to write clearer texts.
In this context, we focus our work in developing models for reference-less text comprehension evaluation as a scoring function for French texts i.e. s : text 7→ [0,100] reflecting how clearly written a text is.
In this paper, we present the following contributions:
• We introduce a simple approach to address comprehension evaluation as a classification task
• We introduce a set of linguistically motivated lexical, syntactic and structural indicators
• We train both indicator based models and text based Neural Models
• We evaluate our experiments thanks to two human annotation experiments using crowd sourced human judgement for one and expert rating for the second.
2. Related Work
Defining what makes a text difficult to understand is a complex task by itself. Multiple approaches are explored, like studying the age at which children acquires complex syntactic constructions in French or relying on standardized foreign language levels such as the Common European Framework of Reference (CEFR), ranging from A1 to C2. uses this scale to study French as a Foreign Language difficulty [6,7].
In order to improve texts clarity, some organizations produced redaction guidelines i.e. suggestions of good practices to write clear texts, such as Plain Language and, in French, also published guidelines for adapting French texts to increase readability and comprehension. More closely related to our work, introduced a readability formula for French as a foreign language [8].
Automatic Text Simplification aims at generating simpler versions of source texts. In literature, such models are usually evaluated using automatic metrics. Therefore, standard language level and redaction guidelines are hardly suitable to evaluate simplification models since it would require an expert judgement. Automatic evaluation instead mostly rely on readability metrics such as FKGL SMOG and Gunning fog Index. Such metrics were designed with English in mind but can be used on French in practice. On the other hand, SAMSA a semantic metric, is currently not implemented for French, as discussed in section 3.1 [10-12].
Other approach include learning regression and classification models or pretrained language models [13,14]. However, found that automatic metrics remains unsuitable to evaluate progress in Automatic Text Simplification.
3. Methods
3.1 Baseline Metrics
In order to evaluate our work with respect to the literature we take the following existing readability metrics as baselines: FKGL SMOG Gunning Fog.
The SAMSA metric takes semantic into consideration. Even though it would be theoretically possible to adapt this metric for french, it is not yet implemented. We tried adapting existing implementation from EASSE based on CoreNLP but it turned out to fail due to the lack of French lemmatization model [15-17].
3.2. Evaluate Text Comprehension Difficulty as a Classification Task
Training a model to predict comprehension difficulty would
require a text corpus annotated with comprehension scores. However, to the best of our knowledge, their is no such corpus for the general audience and of sufficient size to envision model training. In this context, we suggest to rely on a simpler proxy task consisting of a classification between simple and complex texts. Defining what makes a text simple or complex here is difficult. In order to bypass this question, we uses pairs of content sources such as one is roughly a simplified version of the other:
Encyclopedia articles based on French Wikipedia (complex) and its simpler alternative, Vikidia (simple), designed for 8-13 years old readers. We only took into consideration the introduction paragraph as it is a concise and synthetic presentation of the article. Articles are aligned i.e. the corpus consists in (simple, complex) pairs.
International Radio Journal Transcriptions with France Culture international press review (complex) and RFI Journal En Franc¸ais Facile (simple), aimed at french speakers that do not speak the language on a daily basis. Articles
Corpus | #T | #W/#T | #W/#S | |
---|---|---|---|---|
Wikipedia | 25812 | 144 | 26.0 | |
Vikidia | 25812 | 80 | 18.9 | |
France Culture | 1402 | 1106 | 28.8 | |
Journal en Franc¸ais Facile | 1555 | 1494 | 19.0 |
Table 1: Comprehension Classification Datasets: number of texts per corpus (#T), average word per text (#W/#T) and average word per sentence #W/#S).
Have similar subjects (international news) but are not aligned strictly speaking i.e. there is no (complex,simple) pairs for a given article. We report statistics about this new corpus in table 1.
3.3 Linguistic Indicators
Deriving from works on Langage Clair we introduce a set of complexity indicators. Indicators varies from lexical difficulties (i.e. a word difficulty score) to syntactic difficulties or sentences parse tree height. Indicators are detailed below.
Indicators are detected based on our own rules implementation using SpaCy pipeline based on both dependency and constituency parsing respectively using fr-dep-news-trf and benepar .
Lexical Indicators (5) These are indicators of difficulties at word level. We use a word difficulty score based on word frequencies in corpora of different difficulty levels: elementary school textbooks of various grades from Manulex and French as a Foreign Language textbooks of various CEFR (Common European Framework of Reference for Languages) levels from FLELex [18]. Lexical indicators also include abbreviations, acronyms, named entities and numerical expressions.
Sentence Length Indicators (3) We measure sentences lengths with averages of words per sentence; dependency and constituency tree heights.
Syntactic Indicators (17) Several difficulties on the syntactic level in sentences are identified, which are related to sentence structure: coordinate clauses, relative clauses, adverbial clauses, participle clauses, cleft structures, interpolated clauses, appositive phrases, enumerations, etc.). Information about verb forms are also detected: non-finite clauses, passive voice, complex verbal tenses, conditional mood. Negations marks, complex noun phrases and text spans between brackets are also included in syntactic indicators Structure Indicators (3) Two indicators are related to the presence of connectives and their potential complexity, estimated by syntactic information (e.g. clause position for conjunction connectives, sentence initial position for adverbial connectives) and information from a French connectives lexicon [19]. A third indicator counts temporal breaks (i.e. a tense change) within text paragraphs.
We train models using sklearn: two linear models (Linear SVC and Ridge) for fairer comparison to linear readability metrics, and 2 non-linear (Random Forest and Multi Layer Perceptron)
3.4 Neural Methods based on Text
Even though indicator-based approaches rely on linguistic motivations, they lack the possibility to learn from deeper relationships throughout the text such as the subject, the context and the semantic that might carry essential information to infer comprehension difficulty. This is the reason why we chose to compare indicator-based methods with deep learning approaches directly relying on text.
We use two French pre-trained language models such as BARThez and CamemBERT fine-tuned with a classification (C) or a regression objective (R). 4 Comprehension Difficulty Annotation [20,21].
We ran two human annotation experiments in two different contexts: the first one using Mechanical Turk, a crowd-sourcing platform to receive annotations of French speakers from general audience (4.1); the second based on the feedback of Langage Clair experts in our team (4.2).
4.1 Crowd-Sourced Human Annotation
In order to get the most reliable annotations we follow and use a Best-Worst Scaling (BWS) technique. They recommend to use comparison task instead of direct assessment i.e. directly giving a note to a given text [22]. More specifically, BWS compares k (typically k = 4) simultaneous examples and asks the annotator to select the best one and the worst one with respect to the dimension of interest (text comprehension difficulty in our context).
When annotating texts of up to 200 words, preliminary experiments showed us that comparing k = 4 simultaneous texts was too long and fastidious. In this light, we reduce to k = 3.
The annotation counts T = 48 news articles (up to 200 words). Each text is present in e = 12 different examples of k = 3 texts. Examples are annotated by a = 3 separate annotators in a total of 26. We end up with a total of E = (T ×e)/k = 192 examples, and E × a annotation i.e. for any three texts {Ta;Tb;Tc} the annotation task consist in submitting an ordered set e.g. Tc > Ta > Tb.
Each text Ti is associated with an annotation score by score(i) = #best%(i) − #worst%(i) with #best%(i) (resp. #worst%(i)) representing the frequency at which Ti was evaluated the best (resp. worst) text out of the 3.
In order to measure the reliability of an annotation experiment, a common practice is to measure inter-annotation agreement. However, in a BWS process, each annotators is presented with a different set of examples, which makes the concept of annotator agreement less relevant. Moreover, disagreement is even beneficial to produce accurate annotation: for two items A and B of similar difficulty, we can expect half of the annotator to rate A > B and the other half B > A. From this apparent disagreement emerges diversity that actually reinforce score accuracy. For this reason, BWS is instead evaluated in terms of reproducibility metrics like Split Half Reliability (SHR). SHR is the correlation between two randomly sampled half of the annotation. In practice, we average SHR over 1000 iterations to rule out randomness.
4.2 Expert Annotation
In addition to crowd-sourced corpus, our team built a small corpus of 74 texts annotated with difficulty scores. We selected 37 texts originating from news articles, literature, and customer support mails. In addition, we provide 37 manually simplified versions following Language Clair methodology. Each of the 74 resulting texts were then scored on a [0,100] scale by 4 LC experts from our team.
To make sure we obtained good quality annotation, we measure annotator agreement with Intraclass Correlation Coefficient [23]. ICC2 ranges from 0 (no agreement) to 1 (perfect agreeement).
5 Results
5.1 Annotation Results
Annotations experiments text length metrics and reliability measure are reported in table 2.
Good reliability from MTurk and Expert even though our annotation experiments are very different in terms of annotators and process, both shows high reliability measures achieving respectively an SHR correlation of 64.7 (MTurk) and an Intraclass Correlation Coefficient of 74.6 (Experts).
Filtering MTurk workers does not increase reliability A common practice when involving crowdsourced annotation is to filter-out users that shows the lowest agreement. Even though we discussed in 4.2 that agreement is not considered to be the most relevant metric for BWS annotation, we challenge this hypothesis by calculating worker agreement rate based on how often a given user submits the same result than another worker. Then, we suppose that workers with the lowest agreement rate might add noise to the experiment so we might want to exclude them. However, results showed the opposite: filtering out workers does not increase reliability in terms of SHR, no matter the agreement rate of each. This observation is in line with the hypothesis that annotator disagreement is expected and beneficial in a BWS annotation experiment.
MTurk | Expert | |
---|---|---|
#T | 48 | 37 / 37 |
#W/#T | 183 | 190 / 209 |
#W/#S | 25 | 28 / 13 |
#Annotators | 26 | 4 |
Type | BWS | RS |
Reliability Measure | SHR | ICC2 |
Reliability | 64.7 | 74.6 |
Table 2: Human Annotation Experiments. Corpus are reported with number of texts per corpus (#T), average word per text (#W/#T) and average word per sentence #W/#S). Since Expert is aligned, metrics are reported for both sides. Experiments uses two different annotation processes (i) Best Worst Scaling (BWS) evaluated in term of Split Half Reliability (SHR) and (ii) Rating Scale in [0,100] (RS, 100 is best) evaluated with Intraclass Correlation Coefficient (ICC2).
Model | Valid acc% | MTurk ρ | Expert ρ |
---|---|---|---|
SMOG | - | -18.68 | -73.09 |
Gunning Fog | - | -12.59 | -82.14 |
FKGL | - | -19.66 | -77.54 |
Linear SVC | 73.07 | 20.94 | 69.37 |
Ridge | - | 27.58 | 86.44 |
MLP | 75.31 | 32.56 | 85.73 |
Random Forest | 77.20 | 34.42 | 88.09 |
BARThez | 79.64 | 23.16 | 58.41 |
Camembert(R) | 91.01 | 28.35 | 75.85 |
Camembert(C) | 90.15 | 18.44 | 84.73 |
Table 3: Scoring models Spearman correlations (ρ) with human judgement. (C) and (R) respectively indicates classification and regression training objective.
5.2 Scoring Results
First, we evaluate model performances with respect to their own training by measure accuracy on their validation set: a 10% held-out subset from the training set. Validation accuracy is used to select the best hyper-parameters and training iterations for each models.
Models are then evaluated against human annotations from MTurk and Experts using Spearman Rank Correlations (ρ).
Results are reported in Table 3. Our approaches show better correlations with the human judgement than readability metrics. Models trained from indicators achieves the highest correlations, with Random Forest being the best on both evaluation sets, MTurk and Expert.
It is also interesting that even simple linear statistical models based on our indicators outperforms readability metrics therefore arguing in favor of this indicator set. In particular, the Ridge Regression model outperform FKGL by 14.76 and 10.55 correlation point respectively on MTurk and Expert.
Readability metrics seems complementary in that FKGL achieve better correlation on MTurk evaluation while Gunning Fog does on Expert.
Similarly, we observe sensible differences between Camembert training objectives, with the regression (R) being better on MTurk and classification (C) on Expert.
6. Discussions
Results shows a large improvement of human judgement correlation in favor to our approaches over existing readability metrics. Moreover, indicator based method outperform neural models fine-tuned from pre-trained model. Neural models’ results are promising and could be extended with longer training time and adapting their training objective to produce equally distributed scores.
In addition to outperforming neural models, indicator based model are far cheaper to train and predict with since they does not require GPU. Being indicator-based makes it easier to interpret and more predictable than neural models, and thus might deliver a better user experience. We observed Neural models we trained tend to produce very polarized output probabilities i.e. either very close to 0 or to 1. That is not a problem to quantitatively evaluate the resulting score, but it should probably be adapted to output equally distributed scores in order to be more intuitive.
7. Conclusion
Developing methods to accurately measure written text comprehension difficulty is a key challenge that would help better assessing the quality of Automatic Text Simplification models, and provide with a tool for editors to produce texts that are simpler to understand.
We explore multiple approaches for training a reference-less metric based on a simple classification task. Our systems rely either on linguistic indicators or directly from text.
To evaluate our models, we two human annotation experiments. The first involves crowd-sourced workers, asked to compare text based on their comprehension difficulties using Best Worst Scaling with k = 3. In the second experiment, texts are simplified then rated on a [0,100] scale by experts from our team.
Both neural and indicator based methods shows promising results and largely outperform other broadly used readability metrics, on both crowdsourced and expert human annotations. Even simple linear models largely outperform readability metrics which adds an evidence against using it to estimate text comprehension complexity [24-27].
As further researches, we suggest exploring multi-lingual neural training. This would have the obvious benefit of overcoming the language restriction of our work while also mutualizing learning from each language and unifying comprehension difficulties estimation accross languages.
Lay Summary
Nowadays, most services use the Internet as their primary way of communicating. Therefore, being able to read and understand texts is really important. But a lot of people have difficulties reading and understanding so it is not simple for them to access information or complete administrative procedures. We introduce a method to calculate a difficulty score for French texts. A score of 0 means that the text is really difficult to understand, whereas a score of 100 means it is really clear. We suggest that developing such a score is a first step toward helping people write easier texts. We gathered two categories of texts: some that we consider easy to understand and others that we consider difficult to understand. Then, we trained models to predict whether a text is categorized as “easy” or not. After training, we use the predictions as our scoring method: the score corresponds to the probability (multiplied by 100) that a text is categorized as easy by the model.
We explored two kinds of models. For the first one, we count different kinds of linguistic difficulties and give them to the model to predict the difficulty. The second kind of model is deep neural networks that have already been trained to learn French. We specialize it in predicting the difficulty based on the text by providing examples of texts and their difficulties.
To measure how relevant our models are, we asked people on the Internet as well as experts to give their opinions on texts. In particular, they were given texts and should determine how difficult they are. We found that people agreed more with our method’s scores than with other existing scoring methods.
References
- Outlook, O. S. (2013). OECD Skills Outlook 2013.
- Štajner, S. (2021). Automatic text simplification for social good: Progress and challenges. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2637-2652.
- ISO 24495. 2023. Plain language – Part 1: Governing principles and guidelines. Standard, International Organization for Standardization.
- WCAG. (2018). Web Content Accessibility Guidelines (WCAG) 2.1 – 3.1 Understanding. Standard, W3C.
- Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (un) suitability of automatic evaluation metrics for text simplification. Computational Linguistics, 47(4), 861-889.
- Canut, E. (2014). Acquisition des constructions syntaxiques complexes chez l’enfant français entre 2 et 6 ans. In SHS Web of Conferences (Vol. 8, pp. 1437-1452). EDP Sciences.
- Wilkens, R., Alfter, D., Wang, X., Pintard, A., Tack, A., Yancey, K. P., & François, T. (2022, June). Fabra: French aggregator-based readability assessment toolkit. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 1217-1233).
- Gala, N., Todirascu, A., Javourey-Drevet, L., Bernhard, D., Wilkens, R., & Meyer, J. P. (2020). Recommandations pour des transformations de textes français afin d'améliorer leur lisibilité et leur compréhension (Doctoral dissertation, ANR).
- François, T., Gala, N., Watrin, P., & Fairon, C. (2014, May). FLELex: a graded lexical resource for French foreign learners. In International conference on Language Resources and Evaluation (LREC 2014).
- Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
- Mc Laughlin, G. H. (1969). SMOG grading-a new readability formula. Journal of reading, 12(8), 639-646.
- Gunning, R. (1952). The technique of clear writing.
- Sulem, E., Abend, O., & Rappoport, A. (2018). Semantic structural evaluation for text simplification. In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics. Human Language Technologies - Proceedings of the Conference, volume 1, pages 685–696.
- Martin, L., Humeau, S., Mazaré, P. E., Bordes, A., de La Clergerie, É. V., & Sagot, B. (2018). Reference-less quality estimation of text simplification systems.
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). Bertscore: Evaluating text generation with bert.
- Alva-Manchego, F., Martin, L., Scarton, C., & Specia, L. (2019). EASSE: Easier automatic sentence simplification evaluation.
- Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60).
- Lété, B., Sprenger-Charolles, L., & Colé, P. (2004). MANULEX: A grade-level lexical database from French elementary school readers. Behavior Research Methods, Instruments, & Computers, 36(1), 156-166.
- Roze, C., Danlos, L., & Muller, P. (2012). LEXCONN: a French lexicon of discourse connectives. Discours. Revuede linguistique, psycholinguistique et informatique. A journal of linguistics, psycholinguistics and computational linguistics, (10).
- Eddine, M. K., Tixier, A. J. P., & Vazirgiannis, M. (2020). Barthez: a skilled pretrained french sequence-to-sequence model.
- Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de La Clergerie, É. V., & Sagot, B. (2020). CamemBERT: a tasty French language model.
- Kiritchenko, S., & Mohammad, S. M. (2017). Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation.
- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2), 420.
- François, T., & Fairon, C. (2012, July). An “AI readability” formula for French as a foreign language. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 466-477).
- Jonas, N. (2012). For the most recent generations, adult difficulties decrease in writing, but increase in calculation. Insee premiere. (1426).
- Wallonie-Bruxelles, F., & Leys, M. (2017). Write to be read: How to write administrative texts that are easy to understand? Wallonia-Brussels Federation.
- PLAIN. (2023). Federal Plain Language Guidelines. Standard, Plain Language Action and Information Network (PLAIN).
Copyright: ©2023 Paul Poupet, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.