The Lexical Measures
function calculates various lexical diversity metrics from a given text. Each
measure provides unique insights into the text's complexity, diversity, and other linguistic features. Here's an
explanation of the formulas and their purposes:
The total number of characters in the text and the proportion of characters to the total number of words.
Count: \( \text{char_count} \)
Proportion: \( \frac{\text{char_count}}{\text{word_count}} \) if word count is not zero.
The total number of words in the text.
Count: \( \text{word_count} \)
The total number of sentences in the text.
Count: \( \text{len(sentence_lengths)} \)
The count and proportion of function words in the text.
Count: \( \text{len(total_function_words)} \)
Proportion: \( \frac{\text{len(total_function_words)}}{\text{word_count}} \) if word count is not zero.
The count and proportion of content words in the text.
Count: \( \text{len(total_content_words)} \)
Proportion: \( \frac{\text{len(total_content_words)}}{\text{word_count}} \) if word count is not zero.
The average number of words per sentence in the text.
Count: \( \frac{\text{sum(sentence_lengths)}}{\text{len(sentence_lengths)}} \) if there are sentences.
This is the total number of characters devided by the total number of words in the text.
Count: \( \frac{\text{char_count}}{\text{word_count}} \)
The number of propositional ideas per word (not detailed here as the calculation method isn't provided).
Count: Calculated value based on the text's propositional content.
The ratio of unique words (types) to the total number of words (tokens) in the text, multiplied by 100 to convert it to a percentage.
Count: \( \frac{\text{len(unique_types)}}{\text{total_tokens}} \times 100 \)
An adjusted TTR that accounts for text length. CTTR adjusts the traditional TTR for text length, making it more comparable across texts of different sizes.
Formula: \( \text{CTTR} = \frac{\text{unique types}}{\sqrt{2 \times \text{total tokens}}} \)
A logarithmic measure of lexical diversity.
Count: \( \frac{\log(\log(\text{len(unique_types)}))}{\log(\log(\text{total_tokens}))} \)
A measure that is sensitive to text length and lexical richness.
Count: Calculated value based on Maas's formula.
MSTTR divides the text into segments of equal size (e.g., 100 tokens) and calculates the Type-Token Ratio (TTR) for each segment. TTR is the ratio of unique words (types) to total words (tokens) in a segment. MSTTR is the average of these TTR values across all segments. It provides a more stable measure of lexical diversity by mitigating the length effect on TTR.
Formula: \( \text{MSTTR} = \frac{\sum (\frac{\text{unique types in segment}}{\text{total tokens in segment}})}{\text{number of segments}} \)
A logarithmic measure of lexical diversity that compares the logarithm of unique types to the logarithm of total tokens.
This measure, also known as Herdan's Logarithmic TTR, provides another perspective on lexical richness, considering the logarithmic relationship between unique words and total words.
Count: \( \frac{\log(\text{len(unique_types)})}{\log(\text{total_tokens})} \)
Like Herdan's C, Summer's S is a logarithmic measure, but it applies the logarithm twice, offering a different scaling and sensitivity to lexical diversity.
Formula: \( \text{Summer's S} = \frac{\log(\log(\text{unique types}))}{\log(\log(\text{total tokens}))} \)
This measure is another logarithmic formula that focuses on the ratio of total words to unique words, offering insights into lexical richness and diversity.
Formula: \( \text{Dugast's U} = \frac{(\log(\text{total tokens}))^2}{\log(\text{total tokens}) - \log(\text{unique types})} \)
These are measures of lexical diversity that rely on the distribution of word frequencies. Yule's K is sensitive to the presence of rare words, while Yule's I provides an inverse perspective.
Yule's K Formula: \( K = 10^4 \times \left( \frac{\sum{f_i^2 \times V_i}}{N^2} - \frac{1}{N} \right) \)
Yule's I Formula: \( I = \frac{V^2}{\sum{(f_i^2 \times V_i) - V}} \)
where \( f_i \) is the frequency of the i-th word, \( V_i \) is the number of words that occur \( f_i \) times in the text, \( N \) is the total number of words, and \( V \) is the number of unique words.
This measure calculates the probability that two randomly selected tokens will be different, providing a perspective on diversity that accounts for the abundance and rarity of words.
Formula: \( D = \frac{\sum (f_i \times (f_i - 1))}{N \times (N - 1)} \)
The Measure of Textual Lexical Diversity (MTLD) calculates the average length of sequential words in a text that maintain a specified level of lexical diversity. It is less sensitive to text length compared to TTR.
MTLD Formula: \( \text{MTLD} = \frac{\text{total number of tokens}}{\text{number of factors}} \)
The Hypergeometric Distribution Diversity (HDD) score is a probabilistic measure of lexical diversity based on the chances of randomly selecting different words from the text.
HDD Formula: \( \text{HDD} = \sum \left(1 - \frac{\binom{f_i}{0} \binom{N - f_i}{d}}{\binom{N}{d}} \right) \)
Each of these measures offers a unique lens through which to view the text's lexical characteristics, enabling a comprehensive analysis of its linguistic diversity and complexity.