Parameterizing Human Speech Generation

Perepichka, Nazariy

Home
→
Students Research & Project Works | Роботи студентів
→
Факультет прикладних наук
→
Освітня програма наук про дані
→
2020
→
View Item

dc.contributor.author	Perepichka, Nazariy
dc.date.accessioned	2020-01-29T11:28:20Z
dc.date.available	2020-01-29T11:28:20Z
dc.date.issued	2020
dc.identifier.citation	Perepichka, Nazariy. Parameterizing Human Speech Generation : Master Thesis : manuscript rights / Nazariy Perepichka ; Supervisor Diego Saez-Trumper ; Ukrainian Catholic University, Department of Computer Sciences. – Lviv : [s.n.], 2020. – 35 p. : ill.	uk
dc.identifier.uri	http://er.ucu.edu.ua/handle/1/1908
dc.language.iso	en	uk
dc.subject	Text-to-speech	uk
dc.subject	Speech synthesis	uk
dc.subject	Voice parameterizing	en
dc.subject	Emotion recognition
dc.subject	Sentiment classification
dc.title	Parameterizing Human Speech Generation	uk
dc.type	Preprint	uk
dc.status	Публікується вперше	uk
dc.description.abstracten	In modern days synthesis of human images and videos is arguably one of the most popular topics in the Data Science community. The synthesis of human speech is less trendy but deeply bonded to the mentioned topic. Since the publication of WaveNet paper by Google researchers in 2016, the state-of-the-art approach transferred from parametric and concatenative systems to deep learning models. Most of the work on the area focuses on improving the intelligibility and naturalness of the speech. However, almost every signiﬁcant study also mentions ways to generate speech with the voices of different speakers. Usually, such an enhancement requires the model’s re-training in case of generating audio with the voice of a speaker that was not present in the training set. Additionally,studies focused on highly modular speech generation are rare. There fore there is a room left for research on ways to add new parameters for other aspects of the speech, like sentiment, prosody, and melody. In this work, we aimed to implement a competitive text-to-speech solution with the ability to specify the speaker without model re-training and explore possibilities for adding emotions to the generated speech. Our approach generates good quality speech with the mean opinion score of 3,78 (out of 5) points and the ability to mimic speaker voice in real-time, which is a signiﬁcant improvement over the baseline that merely obtains 2,08. On top of that, we researched sentiment representation possibilities. We built an emotion classiﬁer that performs on the level of the current state of the art solutions by giving an accuracy of more than eighty percent.	uk