English
 
User Manual Privacy Policy Disclaimer Contact us
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Journal Article

Generalized entropies and the similarity of texts

MPS-Authors
/persons/resource/persons145764

Altmann,  Eduardo G.
Max Planck Institute for the Physics of Complex Systems, Max Planck Society;

/persons/resource/persons217382

Dias,  Laércio
Max Planck Institute for the Physics of Complex Systems, Max Planck Society;

/persons/resource/persons184524

Gerlach,  Martin
Max Planck Institute for the Physics of Complex Systems, Max Planck Society;

Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Altmann, E. G., Dias, L., & Gerlach, M. (2017). Generalized entropies and the similarity of texts. Journal of Statistical Mechanics: Theory and Experiment, 2017: 014002. doi:10.1088/1742-5468/aa53f5.


Cite as: http://hdl.handle.net/21.11116/0000-0001-2997-7
Abstract
We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf's law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the google n-gram database) and scientific papers (indexed by Web of Science).