Multi-Task Learning for Knowledge Graph Completion with Pre-trained Language Models
As research on utilizing human knowledge in natural language processing has attracted considerable attention in recent years,knowledge graph (KG)completion has come into the spotlight.Recently,a new knowledge graph completion method using a pre-trained language model, such as KG-BERT,was presented and showed high performance.However,its scores in ranking metrics such as Hits@k are still behind state-of-the-art models.We claim that there are two main reasons:1)failure in sufficiently learning relational information in knowledge graphs,and 2)difficulty in picking out the correct answer from lexically similar candidates.In this paper,we propose an effective multi-task learning method to overcome the limitations of previous works. By combining relation prediction and relevance ranking tasks with our target link prediction,the proposed model can learn more relational properties in KGs and properly perform even when lexical similarity occurs.Experimental results show that we not only largely improve the ranking performances compared to KG-BERT but also achieve the state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.
1 Introduction
1 介绍
A Knowledge Graph (KG) is a graph-structured knowledge base, where real-world knowledge is represented in the form of triple (h, r, t): (head entity, relation, tail entity) which means h and t have a relationship r. Entities and the relation in a triple are denoted as nodes and an edge of the graph, respectively. In recent years, Natural Language Processing (NLP) has benefited from utilizing KGs in various applications such as language modeling (Peters et al., 2019; Liu et al., 2019a), question answer�ing (Zhang et al., 2019; Huang et al., 2019), and machine reading (Yang and Mitchell, 2017). Since there has been an increasing demand for high-quality knowledge, the reliability of KG has also become important. Therefore, knowledge graph completion (a.k.a. link prediction), which identifies whether the triple in KG is valid or not, has been actively investigated.
Several studies on the knowledge graph completion have been conducted (Bordes et al., 2013; Trouillon et al., 2016; Sun et al., 2019; Dettmers et al., 2018). They presented methods to model the connectivity patterns between entities in KG, and score functions to define the validity of the triple. However, these methods only consider graph structure and relational information depending on existing KG. Thus, they cannot predict well on triples that contain less frequent entities. Recently, addressing the sparseness problem of previous models, Yao et al. (2019) proposed a method called KG-BERT for knowledge graph completion, using entity descriptions and pre-trained language models. Even though KG-BERT significantly improved mean ranks using preliminary linguistic information from BERT (Devlin et al., 2018), the results in other ranking metrics such as MRR and Hit@k are still behind the state-of-the-art models.
已经对知识图完成进行了几项研究(Bordes等人,2013;Trouil lon等人,2016;Sun等人,2019;Dettmers等人,2018)。他们提出了对KG中实体之间的连接模式进行建模的方法,以及定义三元组有效性的评分函数。然而,这些方法只考虑依赖于现有KG的图结构和关系信息。因此,它们不能很好地预测包含较少频繁实体的三元组。最近,为了解决先前模型的稀疏性问题,Yao等人(2019)提出了一种称为KG-BERT的知识图完成方法,使用实体描述和预先训练的语言模型。尽管KG-BERT使用BERT的初步语言信息显著提高了平均排名(Devlin等人,2018),但其他排名指标(如MRR和Hit@k仍然落后于最先进的模型。
We claim that there are two major reasons for this problem. First, KG-BERT misses lots of relation information in KGs. While previous state-of-the-art methods aimed to model relational properties in graphs, KG-BERT only uses binary cross entropy loss to predict valid or invalid triples for the link prediction task. Next, KG-BERT has difficulty in picking out the answer entity between lexically similar candidates. For example, given head entity and relation as (take a breather, derivationally related for, ) and the correct tail entity as “breathing time”, KG-BERT predicts “snorkel breather” and “breath” as top scores because of the lexical similarity by “breath”. This problem leads to lower performance in MRR and Hits@k.
我们声称这个问题有两个主要原因。首先,KG-BERT遗漏了KGs中的大量关系信息。虽然以前最先进的方法旨在对图中的关系属性进行建模,但KG-BERT仅使用二进制交叉熵损失来预测链接预测任务的有效或无效三元组。接下来,KG-BERT很难在词汇相似的候选人之间找出答案实体。例如,给定头部实体和关系为(take a breather,derivationally related for,),正确的尾部实体为“呼吸时间”,KG-BERT预测“snorkel breather”和“breath”为最高分数,因为“呼吸”在词汇上具有相似性。此问题导致MRR性能降低Hits@k.
In this paper, we propose an effective multi-task learning method to overcome these problems. We devise a multi-task framework by adding two tasks (relation prediction and relevance ranking) to link prediction, our target task. In the relation prediction, the model is trained to predict the relationship between given two entities, which helps the model learn more relational properties. In the relevance ranking, the model is trained by the margin ranking loss to make a gap between the valid triple and lexically similar candidates. We evaluate the proposed method on two popular datasets WN18RR and FB15k-237, and experimental results show that our method could improve ranking performance by a large margin compared to KG-BERT. Notably, our method achieves state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.
2 提出的方法
2 Proposed Method
In this section, we propose a multi-task learning for knowledge graph completion. As shown in Figure 1, we follow a multi-task learning framework in MT-DNN (Liu et al., 2019b), and use the pre-trained BERT model as a shared layer. We combine three tasks: link prediction, relation prediction, and relevance ranking. Each task has a classification layer W ∈ R K×H where K is the number of labels and H is the hidden size of BERT. Following Devlin et al. (2018), every input sequence has a [CLS] token at the head of sentence, and [SEP] token is used as a separator.
Link Prediction (LP): We define link prediction as same as KG-BERT (Yao et al., 2019), and this is our main target task. Given a training set S, the input x is a text sequence of (h, r, t). Each entity is represented as entity name and description, e.g., for triple (plant tissue, hypernym, plant structure), the input sequence is as follows:
[CLS] plant tissue, the tissue of a plant [SEP] hypernym [SEP] plant structure, any part of a plant or fungus [SEP]
Relation Prediction (RP): The model learns to classify the relation of two entities. The input is head and tail entity sequences, e.g., “[CLS] plant tissue, the tissue of a plant [SEP] plant structure, any part of a plant or fungus [SEP]”, then the model trains to predict the relation hypernym. The classification layer for relation prediction is WRP ∈ R R×H where R is the number of relations, and we minimize a cross-entropy loss.
关系预测(RP):模型学习对两个实体的关系进行分类。输入是头部和尾部实体序列,例如“[CLS]植物组织、植物组织[SEP]植物结构、植物或真菌的任何部分[SEP]”,然后模型训练以预测关系超名。关系预测的分类层是WRP∈R R×H,其中R是关系的数量,我们最小化了交叉熵损失。
Relevance Ranking (RR): The objective of relevance ranking is to make valid triples keep higher scores than invalid triples. We use a margin ranking loss to provide a bigger gap between valid and invalid triples. The input is the same as link prediction, and the classification layer for relevance ranking is WRR ∈ R 1×H.
3 Experiments
3 实验
Datasets We evaluated the proposed multi-task learning method on two benchmark datasets WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova and Chen, 2015). Each dataset consists of a set of triples in the form of (h, r, t). WN18RR is a subset of WordNet, which is a lexical database of English.Thus, entities in WN18RR are words or short phrases, and there exists 11 relations between two words, such as hypernym and similar to. FB15k-237 is a subset of Freebase (Bollacker et al., 2008), a large�scale graph database including general human knowledge. FB15k-237 has more general entities, such as Lincoln and Monaco, and relations are longer and more complex than WN18RR. We used the same entity descriptions with Yao et al. (2019): synset definitions from WordNet for WN18RR and descriptions from Xie et al. (2016) for FB15k-237. Table 1 summarizes our datasets.
数据集 我们在两个基准数据集WN18RR(Dettmers等人,2018)和FB15k-237(Toutanova和Chen,2015)上评估了所提出的多任务学习方法。每个数据集由一组(h,r,t)形式的三元组组成。WN18RR是WordNet的一个子集,WordNet是英语词汇数据库。因此,WN18RR中的实体是单词或短短语,两个单词之间存在11种关系,例如超名和类似。FB15k-237是Freebase的子集(Bollacker等人,2008),这是一个包含一般人类知识的大型图形数据库。FB15k-237拥有更多的一般实体,如林肯和摩纳哥,关系比WN18RR更长、更复杂。我们与Yao等人(2019)使用了相同的实体描述:来自WordNet的WN18RR的synset定义和来自Xie等人(2016)的FB15k-237的描述。表1总结了我们的数据集。
Baselines :We mainly compare our method with KG-BERT (Yao et al., 2019), and also provide a comparison with several outstanding models: TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), and RotatE (Sun et al., 2019).
基线 我们主要将我们的方法与KG-BERT(Yao等人,2019)进行比较,并与几个杰出的模型进行比较:TransE(Bordes等人,2013)、DistMult(Yang等人,2014)、ComplEx(Trouillon等人,2016)、ConveE(Dettmers等人,2018)和RotatE(Sun等人,2019年)。
Experimental Settings :We used pre-trained BERT-base as a shared layer and fine-tuned over the multi-task setup for 3 epochs. We used mini-batch size of 32 and Adam optimizer (Kingma and Ba, 2014) with learning rate 2e-5. In relevance ranking, we set the margin λ on the validation set, and it showed best results when λ = 0.1 .
实验设置 我们使用预训练的BERT基础作为共享层,并在3个时期内对多任务设置进行微调。我们使用了32的小批量和Adam优化器(Kingma和Ba,2014),学习率为2e-5。在相关性排序中,我们在验证集上设置了裕度λ,当λ=0.1时,结果最好。
Evaluation Settings We evaluate our method on the link prediction, where the model predicts the head entity given ( , r, t) and tail entity given (h, r, ). To compare prior work, we follow the evaluation protocol and filtered setting in Bordes et al. (2013). Let E be a entity set and T be a set of all triples in train, valid, and test. Then, the set of test candidates U for predicting h in a given triple (h, r, t) is
3.1 Main Results
3.1 主要结果
Table 2 demonstrates how the proposed method improves performance over the baseline model on the link prediction. The results show that multi-task learning with two tasks (LP + RP) and (LP + RR) could improve over the baseline by a large margin maintaining low MR scores. When the model is trained on three tasks (LP + RP + RR), we gain significant improvements, especially in Hits@1 and Hits@3 with 10.8 and 14.0, respectively. Table 3 shows an example of results in WN18RR. We observe that our model can choose the correct answer “breathing time” as the first ranking among lexically similar words, while the KG-BERT predicts “snorkel breather” and “breath” in top ranks. More examples are presented in Appendix A.
In the FB15k-237 benchmark, the task becomes more challenging as the number of relations increases up to 237, whereas the WN18RR contains only 11 relations. Thus, joint training with Relation Prediction (RP) was more effective on the FB15k-237, and this is shown as results that the model outperformed the baseline by 7, 2.5, 2.5, 2.9, and 2 absolute scores on MR, MRR, Hits@1, Hits@3, and Hits@10, respec�tively. When the Relevance Ranking (RR) task is added, and the model is trained with three different tasks, it achieves further improvements in all metrics with 13, 3, 2.8, 3.8, and 3.1 points, respectively.
在FB15k-237基准中,随着关系数增加到237,任务变得更具挑战性,而WN18RR仅包含11个关系。因此,具有关系预测(RP)的联合训练在FB15k-237上更有效,结果表明,该模型在MR、MRR、MRI和MRI上的得分分别超过基线7、2.5、2.5、2.9和2个绝对分数,Hits@1, Hits@3和Hits@10,分别为。当添加了相关性排名(RR)任务,并用三个不同的任务训练模型时,它在所有指标中分别获得了13、3、2.8、3.8和3.1分的进一步改进。
A Comparison with previous models is presented in Table 4. Our model achieved state-of-the-art performances in MR and hits@10 on the WN18RR. In the FB15k-237 dataset, the performance of our model is lower than that of several models in Hits@10. Since FB15k-237 has more relations and a more complex graph structure than WN18RR, we conjecture that pre-trained language models cannot capture the complex structural information in knowledge graphs. Despite that, we achieved the best MR score on FB15k-237.
4 相关工作
A common approach for the knowledge graph completion is learning vector embeddings of the entities and the relationships in KG (Bordes et al., 2013; Yang et al., 2014; Trouillon et al., 2016; Sun et al., 2019; Dettmers et al., 2018). The most widely used method is TransE (Bordes et al., 2013), which models the relationships as translations in low-dimensional vector space. Dettmers et al. (2018) and Nguyen et al.(2018) proposed the embedding models using a convolutional neural network. Recent research has shown that the relation in complex vector space can infer the connectivity patterns: symmetry/antisymmetry, inversion, and composition (Sun et al., 2019). On the one hand, Yao et al. (2019) proposed KG-BERT that uses pre-trained language models (PLM) with entity descriptions. It can capture the contextualized meaning of entities and significantly improve mean ranks with rich linguistic information from PLM.
Multi-task learning has gained popularity over a decade in natural language processing (Collobert and Weston, 2008; Luong et al., 2015; Hashimoto et al., 2017; Liu et al., 2019b) of various tasks. It aims to regularize deep learning models from overfitting by sharing parameters of different tasks while jointly training them. With the advent of powerful PLMs such as BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019), a multi-task learning scheme is applied by sharing pre-trained parameters of these models when training different tasks simultaneously.
5 Conclusion and Future Work
5 结论和未来工作
We propose an effective multi-task learning method for knowledge graph completion by combining relation prediction and relevance ranking tasks with link prediction. Experimental results demonstrate that our method outperforms previous strong baselines, and we largely improve MRR and Hits@k compared to the previous KG-BERT model.
In the future, we plan to investigate how to combine pre-trained language models and graph embedding methods to fully utilize the prior linguistic information of pre-trained models and graph structural information.
This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).
For the example 1, the entity breathing time appears only once in the training set. Thus, the methods using only graph structure information, such as TransE and RotatE, cannot predict well on the given triple. Our model provides the correct answer, while KG-BERT predicts snorkel breather and breath as top scores due to the lexical similarity by breath. In example 2, the entity piece of music has lots of relationships with other entities; thus, most models show low performance on that example. Lastly, the example 3 shows that how the pre-trained language model (PLM) improves Mean Rank significantly. KG-BERT and our model give a high score for the answer programme using preliminary linguistic information from PLM, but the results of TransE and RotatE are extremely low.