SCL Seminar by Alexandru Nicolin

SCL seminar of the Center for the Study of Complex Systems, will be held on Thursday, 10 November 2016 at 14:00 in the library reading room “Dr. Dragan Popović" of the Institute of Physics Belgrade. The talk entitled

"Computer-based statistical description of the Romanian language"

will be given by Dr. Alexandru Nicolin ("Horia Hulubei" National Institute for Physics and Nuclear Engineering, Bucharest, Romania).

Abstract of the talk:

Motivated by the advent of security solutions which rely on voice biometrics, we will revisit by means of extensive computer-based investigations the concept of phonetical balance for Romanian utterances and the distribution of Romanian words. We will show that the standard distribution of phonems offers only a partial description of the phonetics of the language and that more detailed statistical indicators are needed. To this end, we will introduce a simple indicator that measures vowel-consonant (or consonant-vowel) sequences and analyze the distribution of consonant clusters for Romanian words. Our results will show that the distribution of consonant clusters is scale-free-like (akin to the distribution of words and phrases in large texts) and that large clusters of vowels or consonants are infrequent. This, in turn, indicates that utterances consisting of words which are statistically unrepresentative with respect to the previous indicators are good candidates for benchmarking the efficency of voice biometrics solutions. For the distribution of Romanian words and word clusters we will show the validity of Zipf's law using a Romanian text corpus of roughly 5 million words. Finally, we will argue that these statistical analyses of text corpora belong to the general field of Big Data, for which there are numerous funding opportunities within Horizon 2020.