Multi-Task Learning for Knowledge Graph Completion with Pre-trained Language Models
利用预训练语言模型完成知识图谱的多任务学习
Abstract
摘要
As research on utilizing human knowledge in natural language processing has attracted considerable attention in recent years,knowledge graph (KG)completion has come into the spotlight.Recently,a new knowledge graph completion method using a pre-trained language model, such as KG-BERT,was presented and showed high performance.However,its scores in ranking metrics such as Hits@k are still behind state-of-the-art models.We claim that there are two main reasons:1)failure in sufficiently learning relational information in knowledge graphs,and 2)difficulty in picking out the correct answer from lexically similar candidates.In this paper,we propose an effective multi-task learning method to overcome the limitations of previous works. By combining relation prediction and relevance ranking tasks with our target link prediction,the proposed model can learn more relational properties in KGs and properly perform even when lexical similarity occurs.Experimental results show that we not only largely improve the ranking performances compared to KG-BERT but also achieve the state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.
近年来,随着在自然语言处理中利用人类知识的研究引起了人们的极大关注,知识图(KG)补全已成为人们关注的焦点。最近,提出了一种使用预训练语言模型(如KG-BERT)的新的知识图补全方法,并显示出很高的性能。然而,它在排名指标(如Hits@k我们声称有两个主要原因:1)未能充分学习知识图中的关系信息,以及2)难以从词汇相似的考生中选出正确答案。在本文中,我们提出了一种有效的多任务学习方法,以克服先前工作的局限性。通过将关系预测和相关性排序任务与我们的目标链接预测相结合,所提出的模型可以在KGs中学习更多的关系属性,甚至在出现词汇相似性时也能正确执行。实验结果表明,与KG-BERT相比,我们不仅大大提高了排名性能,而且在平均排名和Hits@10在WN18RR数据集上。
1 Introduction
1 介绍
A Knowledge Graph (KG) is a graph-structured knowledge base, where real-world knowledge is represented in the form of triple (h, r, t): (head entity, relation, tail entity) which means h and t have a relationship r. Entities and the relation in a triple are denoted as nodes and an edge of the graph, respectively. In recent years, Natural Language Processing (NLP) has benefited from utilizing KGs in various applications such as language modeling (Peters et al., 2019; Liu et al., 2019a), question answer�ing (Zhang et al., 2019; Huang et al., 2019), and machine reading (Yang and Mitchell, 2017). Since there has been an increasing demand for high-quality knowledge, the reliability of KG has also become important. Therefore, knowledge graph completion (a.k.a. link prediction), which identifies whether the triple in KG is valid or not, has been actively investigated.
知识图(KG)是一个图结构的知识库,其中真实世界的知识以三元组(h,r,t)的形式表示:(头部实体,关系,尾部实体),这意味着h和t具有关系r。三元组中的实体和关系分别表示为图的节点和边。近年来,自然语言处理(NLP)受益于在语言建模(Peters等人,2019;Liu等人,2019a)、问答(Zhang等人,2019年;Huang等人,2019)和机器阅读(Yang和Mitchell,2017)等各种应用中使用KGs。由于对高质量知识的需求不断增加,KG的可靠性也变得重要。因此,知识图完成(也称为链接预测),用于识别KG中的三元组是否有效,已被积极研究。
Several studies on the knowledge graph completion have been conducted (Bordes et al., 2013; Trouillon et al., 2016; Sun et al., 2019; Dettmers et al., 2018). They presented methods to model the connectivity patterns between entities in KG, and score functions to define the validity of the triple. However, these methods only consider graph structure and relational information depending on existing KG. Thus, they cannot predict well on triples that contain less frequent entities. Recently, addressing the sparseness problem of previous models, Yao et al. (2019) proposed a method called KG-BERT for knowledge graph completion, using entity descriptions and pre-trained language models. Even though KG-BERT significantly improved mean ranks using preliminary linguistic information from BERT (Devlin et al., 2018), the results in other ranking metrics such as MRR and Hit@k are still behind the state-of-the-art models.
已经对知识图完成进行了几项研究(Bordes等人,2013;Trouil lon等人,2016;Sun等人,2019;Dettmers等人,2018)。他们提出了对KG中实体之间的连接模式进行建模的方法,以及定义三元组有效性的评分函数。然而,这些方法只考虑依赖于现有KG的图结构和关系信息。因此,它们不能很好地预测包含较少频繁实体的三元组。最近,为了解决先前模型的稀疏性问题,Yao等人(2019)提出了一种称为KG-BERT的知识图完成方法,使用实体描述和预先训练的语言模型。尽管KG-BERT使用BERT的初步语言信息显著提高了平均排名(Devlin等人,2018),但其他排名指标(如MRR和Hit@k仍然落后于最先进的模型。
We claim that there are two major reasons for this problem. First, KG-BERT misses lots of relation information in KGs. While previous state-of-the-art methods aimed to model relational properties in graphs, KG-BERT only uses binary cross entropy loss to predict valid or invalid triples for the link prediction task. Next, KG-BERT has difficulty in picking out the answer entity between lexically similar candidates. For example, given head entity and relation as (take a breather, derivationally related for, ) and the correct tail entity as “breathing time”, KG-BERT predicts “snorkel breather” and “breath” as top scores because of the lexical similarity by “breath”. This problem leads to lower performance in MRR and Hits@k.
我们声称这个问题有两个主要原因。首先,KG-BERT遗漏了KGs中的大量关系信息。虽然以前最先进的方法旨在对图中的关系属性进行建模,但KG-BERT仅使用二进制交叉熵损失来预测链接预测任务的有效或无效三元组。接下来,KG-BERT很难在词汇相似的候选人之间找出答案实体。例如,给定头部实体和关系为(take a breather,derivationally related for,),正确的尾部实体为“呼吸时间”,KG-BERT预测“snorkel breather”和“breath”为最高分数,因为“呼吸”在词汇上具有相似性。此问题导致MRR性能降低Hits@k.
In this paper, we propose an effective multi-task learning method to overcome these problems. We devise a multi-task framework by adding two tasks (relation prediction and relevance ranking) to link prediction, our target task. In the relation prediction, the model is trained to predict the relationship between given two entities, which helps the model learn more relational properties. In the relevance ranking, the model is trained by the margin ranking loss to make a gap between the valid triple and lexically similar candidates. We evaluate the proposed method on two popular datasets WN18RR and FB15k-237, and experimental results show that our method could improve ranking performance by a large margin compared to KG-BERT. Notably, our method achieves state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.
在本文中,我们提出了一种有效的多任务学习方法来克服这些问题。我们设计了一个多任务框架,通过将两个任务(关系预测和相关性排序)添加到链接预测(我们的目标任务)。在关系预测中,模型被训练来预测给定两个实体之间的关系,这有助于模型学习更多的关系属性。在相关性排名中,该模型通过边际排名损失来训练,以在有效的三元组和词汇相似的候选之间产生差距。我们在两个流行的数据集WN18RR和FB15k-237上评估了所提出的方法,实验结果表明,与KG-BERT相比,我们的方法可以大幅提高排名性能。值得注意的是,我们的方法在平均排名和Hits@10在WN18RR数据集上。
2 提出的方法
2 Proposed Method
In this section, we propose a multi-task learning for knowledge graph completion. As shown in Figure 1, we follow a multi-task learning framework in MT-DNN (Liu et al., 2019b), and use the pre-trained BERT model as a shared layer. We combine three tasks: link prediction, relation prediction, and relevance ranking. Each task has a classification layer W ∈ R K×H where K is the number of labels and H is the hidden size of BERT. Following Devlin et al. (2018), every input sequence has a [CLS] token at the head of sentence, and [SEP] token is used as a separator.
在本节中,我们提出了一种用于完成知识图的多任务学习。如图1所示,我们遵循MT-DNN中的多任务学习框架(Liu等人,2019b),并使用预训练的BERT模型作为共享层。我们结合了三项任务:链接预测、关系预测和相关性排名。每个任务都有一个分类层W∈RK×H,其中K是标签的数量,H是BERT的隐藏大小。根据Devlin等人(2018),每个输入序列在句首都有[CLS]标记,[SEP]标记用作分隔符。
Link Prediction (LP): We define link prediction as same as KG-BERT (Yao et al., 2019), and this is our main target task. Given a training set S, the input x is a text sequence of (h, r, t). Each entity is represented as entity name and description, e.g., for triple (plant tissue, hypernym, plant structure), the input sequence is as follows:
链接预测(LP):我们将链接预测定义为与KG-BERT相同(Yao等人,2019),这是我们的主要目标任务。给定训练集S,输入x是(h,r,t)的文本序列。每个实体表示为实体名称和描述,例如,对于三元组(植物组织、超名称、植物结构),输入序列如下:
[CLS] plant tissue, the tissue of a plant [SEP] hypernym [SEP] plant structure, any part of a plant or fungus [SEP]
[CLS]植物组织,植物的组织[SEP]超名[SEP]植物结构,植物或真菌的任何部分[SEP]
Relation Prediction (RP): The model learns to classify the relation of two entities. The input is head and tail entity sequences, e.g., “[CLS] plant tissue, the tissue of a plant [SEP] plant structure, any part of a plant or fungus [SEP]”, then the model trains to predict the relation hypernym. The classification layer for relation prediction is WRP ∈ R R×H where R is the number of relations, and we minimize a cross-entropy loss.
关系预测(RP):模型学习对两个实体的关系进行分类。输入是头部和尾部实体序列,例如“[CLS]植物组织、植物组织[SEP]植物结构、植物或真菌的任何部分[SEP]”,然后模型训练以预测关系超名。关系预测的分类层是WRP∈R R×H,其中R是关系的数量,我们最小化了交叉熵损失。
Relevance Ranking (RR): The objective of relevance ranking is to make valid triples keep higher scores than invalid triples. We use a margin ranking loss to provide a bigger gap between valid and invalid triples. The input is the same as link prediction, and the classification layer for relevance ranking is WRR ∈ R 1×H.
相关性排名(RR):相关性排名的目标是使有效的三元组比无效的三元组保持更高的分数。我们使用保证金排名损失来提供有效和无效三元组之间的更大差距。输入与链接预测相同,相关性排序的分类层为WRR∈R1×H。
其中h(x)是模型的输出,λ是余量。在训练时间,我们使用基于小批量的随机梯度下降。我们首先为每个任务、DLP、DRP和DRR组成小批量,然后组合所有数据D=DLPûDRP⑪DRR。在每个训练步骤中,从D中随机选择小批量,然后依次训练与该批量对应的任务。
3 Experiments
3 实验
Datasets We evaluated the proposed multi-task learning method on two benchmark datasets WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova and Chen, 2015). Each dataset consists of a set of triples in the form of (h, r, t). WN18RR is a subset of WordNet, which is a lexical database of English.Thus, entities in WN18RR are words or short phrases, and there exists 11 relations between two words, such as hypernym and similar to. FB15k-237 is a subset of Freebase (Bollacker et al., 2008), a large�scale graph database including general human knowledge. FB15k-237 has more general entities, such as Lincoln and Monaco, and relations are longer and more complex than WN18RR. We used the same entity descriptions with Yao et al. (2019): synset definitions from WordNet for WN18RR and descriptions from Xie et al. (2016) for FB15k-237. Table 1 summarizes our datasets.
数据集 我们在两个基准数据集WN18RR(Dettmers等人,2018)和FB15k-237(Toutanova和Chen,2015)上评估了所提出的多任务学习方法。每个数据集由一组(h,r,t)形式的三元组组成。WN18RR是WordNet的一个子集,WordNet是英语词汇数据库。因此,WN18RR中的实体是单词或短短语,两个单词之间存在11种关系,例如超名和类似。FB15k-237是Freebase的子集(Bollacker等人,2008),这是一个包含一般人类知识的大型图形数据库。FB15k-237拥有更多的一般实体,如林肯和摩纳哥,关系比WN18RR更长、更复杂。我们与Yao等人(2019)使用了相同的实体描述:来自WordNet的WN18RR的synset定义和来自Xie等人(2016)的FB15k-237的描述。表1总结了我们的数据集。
Baselines :We mainly compare our method with KG-BERT (Yao et al., 2019), and also provide a comparison with several outstanding models: TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), and RotatE (Sun et al., 2019).
基线 我们主要将我们的方法与KG-BERT(Yao等人,2019)进行比较,并与几个杰出的模型进行比较:TransE(Bordes等人,2013)、DistMult(Yang等人,2014)、ComplEx(Trouillon等人,2016)、ConveE(Dettmers等人,2018)和RotatE(Sun等人,2019年)。
Experimental Settings :We used pre-trained BERT-base as a shared layer and fine-tuned over the multi-task setup for 3 epochs. We used mini-batch size of 32 and Adam optimizer (Kingma and Ba, 2014) with learning rate 2e-5. In relevance ranking, we set the margin λ on the validation set, and it showed best results when λ = 0.1 .
实验设置 我们使用预训练的BERT基础作为共享层,并在3个时期内对多任务设置进行微调。我们使用了32的小批量和Adam优化器(Kingma和Ba,2014),学习率为2e-5。在相关性排序中,我们在验证集上设置了裕度λ,当λ=0.1时,结果最好。
Evaluation Settings We evaluate our method on the link prediction, where the model predicts the head entity given ( , r, t) and tail entity given (h, r, ). To compare prior work, we follow the evaluation protocol and filtered setting in Bordes et al. (2013). Let E be a entity set and T be a set of all triples in train, valid, and test. Then, the set of test candidates U for predicting h in a given triple (h, r, t) is
评估设置我们在链接预测上评估我们的方法,其中模型预测给定的头部实体(,r,t)和给定的尾部实体(h,r,)。为了比较之前的工作,我们遵循Bordes等人(2013)的评估方案和过滤设置。设E是实体集合,T是训练、有效和测试中所有三元组的集合。然后,用于预测给定三元组(h,r,t)中的h的候选测试集U为
3.1 Main Results
3.1 主要结果
Table 2 demonstrates how the proposed method improves performance over the baseline model on the link prediction. The results show that multi-task learning with two tasks (LP + RP) and (LP + RR) could improve over the baseline by a large margin maintaining low MR scores. When the model is trained on three tasks (LP + RP + RR), we gain significant improvements, especially in Hits@1 and Hits@3 with 10.8 and 14.0, respectively. Table 3 shows an example of results in WN18RR. We observe that our model can choose the correct answer “breathing time” as the first ranking among lexically similar words, while the KG-BERT predicts “snorkel breather” and “breath” in top ranks. More examples are presented in Appendix A.
表2展示了所提出的方法如何在链路预测上改进基线模型的性能。结果表明,具有两个任务(LP+RP)和(LP+RR)的多任务学习可以通过保持低MR分数的大幅度提高基线水平。当模型在三个任务(LP+RP+RR)上进行训练时,我们获得了显著的改进,尤其是在Hits@1和Hits@3分别为10.8和14.0。表3显示了WN18RR的结果示例。我们观察到,我们的模型可以在词汇相似的单词中选择正确的答案“呼吸时间”作为第一位,而KG-BERT预测“呼吸呼吸”和“呼吸”排在第一位。更多示例见附录A。
In the FB15k-237 benchmark, the task becomes more challenging as the number of relations increases up to 237, whereas the WN18RR contains only 11 relations. Thus, joint training with Relation Prediction (RP) was more effective on the FB15k-237, and this is shown as results that the model outperformed the baseline by 7, 2.5, 2.5, 2.9, and 2 absolute scores on MR, MRR, Hits@1, Hits@3, and Hits@10, respec�tively. When the Relevance Ranking (RR) task is added, and the model is trained with three different tasks, it achieves further improvements in all metrics with 13, 3, 2.8, 3.8, and 3.1 points, respectively.
在FB15k-237基准中,随着关系数增加到237,任务变得更具挑战性,而WN18RR仅包含11个关系。因此,具有关系预测(RP)的联合训练在FB15k-237上更有效,结果表明,该模型在MR、MRR、MRI和MRI上的得分分别超过基线7、2.5、2.5、2.9和2个绝对分数,Hits@1, Hits@3和Hits@10,分别为。当添加了相关性排名(RR)任务,并用三个不同的任务训练模型时,它在所有指标中分别获得了13、3、2.8、3.8和3.1分的进一步改进。
A Comparison with previous models is presented in Table 4. Our model achieved state-of-the-art performances in MR and hits@10 on the WN18RR. In the FB15k-237 dataset, the performance of our model is lower than that of several models in Hits@10. Since FB15k-237 has more relations and a more complex graph structure than WN18RR, we conjecture that pre-trained language models cannot capture the complex structural information in knowledge graphs. Despite that, we achieved the best MR score on FB15k-237.
与先前模型的比较见表4。我们的模型在MR和hits@10在WN18RR上。在FB15k-237数据集中,我们的模型的性能低于Hits@10.由于FB15k-237比WN18RR具有更多的关系和更复杂的图结构,我们推测预先训练的语言模型无法捕获知识图中的复杂结构信息。尽管如此,我们在FB15k-237上获得了最佳MR分数。
4 相关工作
A common approach for the knowledge graph completion is learning vector embeddings of the entities and the relationships in KG (Bordes et al., 2013; Yang et al., 2014; Trouillon et al., 2016; Sun et al., 2019; Dettmers et al., 2018). The most widely used method is TransE (Bordes et al., 2013), which models the relationships as translations in low-dimensional vector space. Dettmers et al. (2018) and Nguyen et al.(2018) proposed the embedding models using a convolutional neural network. Recent research has shown that the relation in complex vector space can infer the connectivity patterns: symmetry/antisymmetry, inversion, and composition (Sun et al., 2019). On the one hand, Yao et al. (2019) proposed KG-BERT that uses pre-trained language models (PLM) with entity descriptions. It can capture the contextualized meaning of entities and significantly improve mean ranks with rich linguistic information from PLM.
知识图完成的一种常见方法是学习KG中实体和关系的向量嵌入(Bordes等人,2013;Yang等人,2014;Trouillon等人,2016;Sun等人,2019;Dettmers等人,2018)。最广泛使用的方法是TransE(Bordes等人,2013),它将关系建模为低维向量空间中的平移。Dettmers等人(2018)和Nguyen等人(2018)提出了使用卷积神经网络的嵌入模型。最近的研究表明,复杂向量空间中的关系可以推断连接模式:对称/反对称、反转和合成(Sun等人,2019)。一方面,Yao等人(2019)提出了KG-BERT,该模型使用带有实体描述的预训练语言模型(PLM)。它可以捕捉实体的上下文意义,并利用PLM中丰富的语言信息显著提高平均等级。
Multi-task learning has gained popularity over a decade in natural language processing (Collobert and Weston, 2008; Luong et al., 2015; Hashimoto et al., 2017; Liu et al., 2019b) of various tasks. It aims to regularize deep learning models from overfitting by sharing parameters of different tasks while jointly training them. With the advent of powerful PLMs such as BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019), a multi-task learning scheme is applied by sharing pre-trained parameters of these models when training different tasks simultaneously.
十多年来,多任务学习在各种任务的自然语言处理中得到了普及(Collbert和Weston,2008;Luong等人,2015;Hashimoto等人,2017;Liu等人,2019b)。它旨在通过共享不同任务的参数,同时对其进行联合训练,使深度学习模型免受过度拟合的影响。随着诸如BERT(Devlin等人,2018)和XLNet(Yang等人,2019)等强大PLM的出现,当同时训练不同任务时,通过共享这些模型的预训练参数来应用多任务学习方案。
5 Conclusion and Future Work
5 结论和未来工作
We propose an effective multi-task learning method for knowledge graph completion by combining relation prediction and relevance ranking tasks with link prediction. Experimental results demonstrate that our method outperforms previous strong baselines, and we largely improve MRR and Hits@k compared to the previous KG-BERT model.
通过将相关性预测和相关性排序任务与链接预测相结合,我们提出了一种有效的多任务学习方法来完成知识图。实验结果表明,我们的方法优于先前的强基线,并且我们大大提高了MRR和Hits@k与之前的KG-BERT模型相比。
In the future, we plan to investigate how to combine pre-trained language models and graph embedding methods to fully utilize the prior linguistic information of pre-trained models and graph structural information.
未来,我们计划研究如何将预训练的语言模型和图嵌入方法相结合,以充分利用预训练模型的先验语言信息和图结构信息。
Acknowledgements
致谢
This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques).
这项工作得到了韩国政府(MSIT)资助的信息与通信技术规划与评估研究所(IITP)赠款(编号2020-0-00368,知识获取和推理技术的神经符号模型)的支持。
附录A
For the example 1, the entity breathing time appears only once in the training set. Thus, the methods using only graph structure information, such as TransE and RotatE, cannot predict well on the given triple. Our model provides the correct answer, while KG-BERT predicts snorkel breather and breath as top scores due to the lexical similarity by breath. In example 2, the entity piece of music has lots of relationships with other entities; thus, most models show low performance on that example. Lastly, the example 3 shows that how the pre-trained language model (PLM) improves Mean Rank significantly. KG-BERT and our model give a high score for the answer programme using preliminary linguistic information from PLM, but the results of TransE and RotatE are extremely low.
对于示例1,实体呼吸时间在训练集中仅出现一次。因此,仅使用图结构信息的方法,如TransE和RotatE,无法很好地预测给定的三元组。我们的模型提供了正确的答案,而KG-BERT预测,由于呼吸的词汇相似性,通气管呼吸和呼吸是最高分。在示例2中,实体音乐片段与其他实体有很多关系;因此,大多数模型在该示例中表现出较低的性能。最后,示例3显示了预训练语言模型(PLM)如何显著提高平均秩。KG-BERT和我们的模型使用PLM提供的初步语言信息对回答方案给予了高分,但TransE和RotatE的结果非常低。