Parameterizing Human Speech Generation

Показати скорочений опис матеріалу

dc.contributor.author Perepichka, Nazariy
dc.date.accessioned 2020-01-29T11:28:20Z
dc.date.available 2020-01-29T11:28:20Z
dc.date.issued 2020
dc.identifier.citation Perepichka, Nazariy. Parameterizing Human Speech Generation : Master Thesis : manuscript rights / Nazariy Perepichka ; Supervisor Diego Saez-Trumper ; Ukrainian Catholic University, Department of Computer Sciences. – Lviv : [s.n.], 2020. – 35 p. : ill. uk
dc.identifier.uri http://er.ucu.edu.ua/handle/1/1908
dc.language.iso en uk
dc.subject Text-to-speech uk
dc.subject Speech synthesis uk
dc.subject Voice parameterizing en
dc.subject Emotion recognition
dc.subject Sentiment classification
dc.title Parameterizing Human Speech Generation uk
dc.type Preprint uk
dc.status Публікується вперше uk
dc.description.abstracten In modern days synthesis of human images and videos is arguably one of the most popular topics in the Data Science community. The synthesis of human speech is less trendy but deeply bonded to the mentioned topic. Since the publication of WaveNet paper by Google researchers in 2016, the state-of-the-art approach transferred from parametric and concatenative systems to deep learning models. Most of the work on the area focuses on improving the intelligibility and naturalness of the speech. However, almost every significant study also mentions ways to generate speech with the voices of different speakers. Usually, such an enhancement requires the model’s re-training in case of generating audio with the voice of a speaker that was not present in the training set. Additionally,studies focused on highly modular speech generation are rare. There fore there is a room left for research on ways to add new parameters for other aspects of the speech, like sentiment, prosody, and melody. In this work, we aimed to implement a competitive text-to-speech solution with the ability to specify the speaker without model re-training and explore possibilities for adding emotions to the generated speech. Our approach generates good quality speech with the mean opinion score of 3,78 (out of 5) points and the ability to mimic speaker voice in real-time, which is a significant improvement over the baseline that merely obtains 2,08. On top of that, we researched sentiment representation possibilities. We built an emotion classifier that performs on the level of the current state of the art solutions by giving an accuracy of more than eighty percent. uk


Долучені файли

Даний матеріал зустрічається у наступних зібраннях

Показати скорочений опис матеріалу

Пошук


Перегляд

Мій обліковий запис