Class QuantitativeLinguistics
java.lang.Object
org.episteme.social.linguistics.quantitative.QuantitativeLinguistics
Implements fundamental laws of quantitative linguistics.
Provides scientific metrics for statistical language analysis.
- Since:
- 1.0
- Author:
- Silvere Martin-Michiellot, Gemini AI (Google DeepMind)
-
Method Summary
Modifier and TypeMethodDescriptionstatic doublecalculateEntropy(Map<String, Long> wordFrequencies) Calculates the Shannon Entropy of a text based on word frequencies.static doublecalculateTTR(long vocabularySize, long totalTokens) Calculates the TTR (Type-Token Ratio).static doubleheapsLaw(long totalTokens, double K, double beta) Heaps' Law: Describes the number of distinct words (vocabulary size) in a document as a function of its length.static doublemenzerathAltmannLaw(double x, double a, double b, double c) Menzerath-Altmann Law: The more components a linguistic construct has, the smaller the components are. y = a * x^b * e^(cx)static doublezipfLaw(int rank, double exponent, double constant) Zipf's Law: The frequency of any word is inversely proportional to its rank in the frequency table. f(r) = C / r^s
-
Method Details
-
zipfLaw
public static double zipfLaw(int rank, double exponent, double constant) Zipf's Law: The frequency of any word is inversely proportional to its rank in the frequency table. f(r) = C / r^s- Parameters:
rank- The rank of the word (1-indexed).exponent- The Zipfian exponent (usually close to 1.0).constant- The normalizing constant.- Returns:
- The theoretical frequency.
-
heapsLaw
public static double heapsLaw(long totalTokens, double K, double beta) Heaps' Law: Describes the number of distinct words (vocabulary size) in a document as a function of its length. V = K * N^beta- Parameters:
totalTokens- (N) total number of tokens in the corpus.K- empirically determined constant (typically 10-100).beta- empirically determined exponent (typically 0.4-0.6).- Returns:
- theoretical vocabulary size (V).
-
menzerathAltmannLaw
public static double menzerathAltmannLaw(double x, double a, double b, double c) Menzerath-Altmann Law: The more components a linguistic construct has, the smaller the components are. y = a * x^b * e^(cx)- Parameters:
x- number of components (e.g., syllables in a word).a- parameter.b- parameter.c- parameter.- Returns:
- length of components (e.g., average phonemes in a syllable).
-
calculateEntropy
-
calculateTTR
public static double calculateTTR(long vocabularySize, long totalTokens) Calculates the TTR (Type-Token Ratio). A simple measure of lexical diversity.
-