Matching Red Links with Wikidata Items

Show simple item record

dc.contributor.author Liubonko, Kateryna
dc.date.accessioned 2020-02-25T15:09:18Z
dc.date.available 2020-02-25T15:09:18Z
dc.date.issued 2020
dc.identifier.citation Liubonko, Kateryna. Matching Red Links with Wikidata Items : Master Thesis : manuscript rights / Kateryna Liubonko ; Supervisor Diego Sáez-Trumper ; Ukrainian Catholic University, Department of Computer Sciences. – Lviv : [s.n.], 2020. – 44 p. : ill. uk
dc.identifier.uri http://er.ucu.edu.ua/handle/1/2051
dc.language.iso en uk
dc.subject Word embeddings uk
dc.subject Graph embeddings uk
dc.subject Cross-lingual embedding similarity model uk
dc.title Matching Red Links with Wikidata Items uk
dc.type Preprint uk
dc.status Публікується вперше uk
dc.description.abstracten This work tackles the problem of matching Wikipedia red links with existing articles. Links in Wikipedia pages are considered red when lead to nonexistent articles. In other Wikipedia editions could exist articles that correspond to such red links. In our work, we propose a way to match red links in one Wikipedia edition to existent pages in another edition. We solve this task in a context of Ukrainian red links and English existing pages. We created a dataset of 3 171 most frequent Ukrainian red links and a dataset of 2 957 927 pairs of red links and the most probable candidates for the correspondent pages in English Wikipedia. This dataset is publicly released1. We defined the task as a Named Entity Linking problem. Red links are named entities and we link Ukrainian red links to English Wikipedia pages. In this work we provide a thorough analysis on the data and define its conceptual characteristics to exploit in entity resolution. These characteristics are graph properties (connections with the pages where red links occur and connections with the pages which occur in the same pages with red links) and word properties (title names). BabelNet knowledge base was applied to this task. We evaluated its powers in terms of F1 score (29 %) and regarded it as a baseline for our approach. To improve the results we introduced several similarity metrics based on mentioned red links characteristics. Combined in a linear model they resulted in F1 score 85 % which is our best result. In our thesis we also discuss bottlenecks and limitations of the current approach and outline the ideas for future improvements. To the best of our knowledge,we are the first to state the problem and propose a solution for red links in Ukrainian Wikipedia edition. All the code for this project is publicly released on github. uk


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Browse

My Account