Semi-automatic ontological alignment of digitized books parallel corpora


In this paper, we present a method for general ontology management integration with an alignment of digitized books paraphrase corpus, which have been compiled from bilingual parallel corpus. We show that our method can improve ontology development and consistency checking when we add semantic parsing and machine translation to the process of general knowledge management. Additionally, we argue that the focus on one’s favorite books gives a factor of gamification for knowledge management process. A new formalism of semantic parsing ontological alignments is introduced and its use for ontology development and consistency checking is discussed. It is shown that existing general ontologies requires much more axioms than it is currently available in order to explain unaligned content of books. Proactive learning approach is suggested as part of the solution to improve development of ontology predicates and axioms. WordNet, FrameNet and SUMO ontologies are used as a starting knowledge base of paraphrase corpus semantic alignment method.

Article in English.

Lygiagretaus skaitmeninių knygų rinkinio dalinis automatinis sugretinimas, naudojant ontologijas


Straipsnyje pateiktas bendrosios ontologijos valdymo metodas naudojant parafrazių rinkinius, gautus iš grožinės literatūros knygų. Straipsnyje pateiktas metodas gali pagerinti tolesnį ontologijos plėtimą ir loginio nuoseklumo patikrinimą. Šio metodo funkcionalumas grindžiamas dviem esminėmis technologijomis: semantine teksto analize ir automatiniu kompiuterio vertimu. Svarbus pateikto metodo aspektas – žaidimo elementų naudojimas valdant bendrąsias ontologijas. Šis aspektas užtikrinamas tuo, kad ontologijų valdymo procesas glaudžiai susietas su grožinės literatūros kūriniais. Straipsnyje pateiktas naujas ontologijų suderinimo formalizmas. Tyrimų rezultatai parodė, kad esamos bendrosios ontologijos turi būti papildytos kur kas didesniu kiekiu aksiomų, nei yra šiuo metu, kad būtų galima paaiškinti semantinį nesugretintų parafrazių ekvivalentiškumą. Papildomai straipsnyje pasiūlytas proaktyvus mokymosi metodas, leidžiantis pagerinti ontologijų kūrimo procesą. „WordNet“, „FrameNet“ ir SUMO ontologijos naudojamos kaip pradinės žinių bazės, siekiant pagerinti semantinio sugretinimo metodą.

Reikšminiai žodžiai: tekstų sugretinimas, ontologijų kūrimas ir naudojimas, automatinis mašininis vertimas, natūralios kalbos apdorojimo algoritmai.

Keyword : ontological alignment of corpora, alignment of digitized books, machine translation, natural language processing

How to Cite
Laukaitis, A., & Laukaitytė, N. (2021). Semi-automatic ontological alignment of digitized books parallel corpora. Mokslas – Lietuvos Ateitis / Science – Future of Lithuania, 13.
Published in Issue
Jul 2, 2021
Abstract Views
PDF Downloads
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.


Agirre, E., de Lacalle, O. L., & Soroa, A. (2014). Random walks for knowledge-based word sense disambiguation. Computational Linguistics, 40(1), 57–84.

Berant, J., Chou, A., Frostig, R., & Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1533–1544). Association for Computational Linguistics.

Berger, A. L., Della Pietra, V. J., & Della Pietra S. A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–72.

Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263–311.

Chiang, D. (2007). Hierarchical phrase-based translation. Computational Linguistics, 32(2), 201–228.

Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., & Steedman, M. (2011). Lexical generalization in CCG grammar induction for semantic parsing. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 24(18), 1512–1523.

Laukaitis, A., & Vasilecas, O. (2008). Multi-alignment templates induction. Informatica, 19(4), 535–554.

Laukaitis, A., Plikynas, D., & Ostasius, E. (2018). Sentence level alignment of digitized books parallel corpora. Informatica, 29(4), 693–710.

Laukaitis, A., Vasilecas, O., Laukaitis, R., & Plikynas, D. (2011). Semi-automatic bilingual corpus creation with zero entropy alignments. Informatica, 22(2), 223–224.

Marcu, D., & Wong, W. (2002). A phrase-based, joint probability model for statistical machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 10, 133–139.

McCallum, A., & Nigam, K. (1998). Employing EM and poolbased active learning for text classification. In Proceedings of the International Conference on Machine Learning (pp. 359– 367). Morgan Kaufmann.

Mitchell, T. M., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B. D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., … Welling, J. (2018). Never-ending learning. Communications of the ACM, 61(5), 103–115.

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 10.

Niles, I., & Pease, A. (2001). Towards a standard upper ontology. Proceedings of the International conference on Formal Ontology in Information Systems, 2001, 2–9.

Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–15.

Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics, 30(4), 417–449.

Settles, B., & Craven, M. (2008). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1070–1079). Association for Computational Linguistics.

Thompson, C. A., Califf, M. E., & Mooney, R. J. (1999). Active learning for natural language parsing and information extraction. In Proceedings of the 16th International Conference on Machine Learning (pp. 406–414). Morgan Kaufmann Publishers.

Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, (2), 45–66.

Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., & Tron, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies in the Theory and History of Linguistic Science, 4(292), 247.

Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. Proceedings of the National Conference on Artificial Intelligence, 2, 1050–1055.