Continual Entity Alignment for Growing Knowledge Graphs

Audrey Wang
8 min readJul 23, 2022

--

Entity Alignment is a crucial technique for knowledge fusion when we want to integrate information from multiple sources. It aims to identify entities in different knowledge graphs that refer to the same real-world object (i.e., finding the alignment).

Entity alignment across two knowledge graphs

Over the years, research on Entity Alignment has focused on the setting of ‘static’ knowledge graphs where the size of entities, relations, and triples does not grow with time. But this setting neglects many real-world difficulties like alignment incompleteness, KG growth, and alignment growth. In our work at ISWC2022, we consider the nature of the growth of knowledge graphs and how conventional entity alignment methods can be conditioned on it.

A New Scenario and Task

Growing Knowledge Graphs

Many real-world knowledge graphs are constantly growing, where new data is added into the graph with new entities and relations emerging. For example, every day, new mobile games go online on Google Play and App Store. If we build their game atlas with graphs, the graphs look unceasingly expanding.

Landscapes of number of games released every week on Google Play (left) and App Store (right). [Source: Internal Data]

Here, we propose a new knowledge graph scenario called Growing Knowledge Graphs.

A growing knowledge graph G is a list of consecutive snapshots {G⁰, G¹, …, Gᵗ}, where each snapshot is a static graph. And there is Gᵗ⁻¹ ⊆ Gᵗ.

It’s worth mentioning that in our work, we only consider the emergence of new entities while ignoring the emergence of new relation type, because the scheme of the relation of knowledge graphs does not change much.

Continual Entity Alignment

Then, how to conduct entity alignment for growing knowledge graphs? Apparently, under this new scenario, the alignment is no longer a one-time task that requires no future update once done. As new entities emerge, the topological structure of graphs gets changed, and the inherent information is enriched. Previously predicted alignment results suffer from disturbance and need to be revisited or even corrected. Besides, the coming of new entities brings new potential alignment to be discovered. All these challenges necessitate a more proper entity alignment task — Continual Entity Alignment as we call it. A brief definition of continual entity alignment is below,

Given two growing knowledge graphs and a set of pre-aligned entity pairs as anchors, continual entity alignment is about discovering evolving potential alignments over time.

You may note that in this definition, there is no new pre-known alignment emerging with time. We discard their emergence of purpose, considering the cost of manual labeling for making pre-aligned pairs. Human labeling is usually costly, especially when it comes to giant and dramatically changeable graphs. But undoubtedly, with new pre-aligned pairs as new anchors, entity alignment performance will be more or less increased.

To tackle this new task, we propose a continual entity alignment framework named ContEA. The following part shows its details.

ContEA: A Framework For Continual Entity Alignment

A general and intuitive pipeline for continual learning is: training a base model on initial data and later finetune it on new data. This pattern can be witnessed in many areas like transfer learning and using pre-train language models (PLMs). Based on this pipeline, we design ContEA as a flow of modules that train an alignment model from scratch and then finetuning the learned model to adapt to new data and graph structures.

Overview of ContEA: A Framework for Continual Entity Alignment

ContEA consists of two modules:

  1. Subgraph-based Entity Alignment Module. This module is performed at initial time t=0. A GNN encoder is trained to encode entities and relations of graphs into low-dimensional vectors. Then we retrieve from the vector space using bidirectional nearest neighbor search to obtain trustworthy alignment as prediction results.
  2. Embedding and Alignment Update Module. This module is performed at later time t>0. It inherits the learned model at previous time and finetunes it to adapt to new data and graph structures. This will bring changes in the encoded vector space and results in new alignment predictions. We integrate the new predicted alignments with older predictions using heuristic method rather than blindly replace the older ones.

Subgraph-based Entity Alignment Module

We use the static entity alignment model Dual-AMN as the original encoder. Dual-AMN has two advantages for us to base on here: 1. it is simple and efficient; 2. it has two layers of networks that capture different aspects of information. Roughly speaking, Dual-AMN consists of an inner-graph network (Aggregator₁) to capture structural information with single graph and a cross-graph network (Aggregator₂) to capture alignment information across two graphs. The alignment learning objective Lₐ of Dual-AMN is LogSumExp aims to make similar entity pairs close and distance dissimilar pairs.

To make better alignment, we design a new learning objective called reconstruction loss — Lᵣ. A typical assumption in embedding-based entity alignment is that two entities are similar if their neighborhood subgraphs are similar. Motivated by this, the reconstruction loss minimizes the distance between an entity and its neighbor subgraph embedding.

The overall learning objective of our encoder is:

L = Lₐ + α·Lᵣ  # α is a weight to measure relative importance of Lᵣ

After entities and relations get encoded by the encoder. We retrieve potential alignment from the vector space. We take a bidirectional nearest neighbor search strategy. To be specific, for each entity in one graph, we try to find the nearest neighbor (measured in CSLS similarity) for it from candidate entities in the other graph. Different from existing works which only view the test set as the candidate pool, we consider the existence of unknown dangling entities (Zequn Sun et al.) and take all entities in graphs as candidates. Then, if two entities from different graphs are mutually the nearest neighbors, they constitute a predicted pair. We compare the predicted alignment with the golden test set at time t=0 and calculate the P, R, F1 scores as evaluation metrics.

Process of Subgraph-based Entity Alignment Module

Embedding and Alignment Update Module

When new data comes at time t>0, the embedding and alignment update module takes into effect. The first challenge is how to integrate new entities. Randomly initializing the embeddings of new entities could be detrimental to the previously optimized embedding space and cause representation inconsistency. Thanks to the design of reconstruction loss, which encourages entities close to their one-hop context, we can intuitively generate initial embeddings (learnable in finetuning) for new entities using the average embeddings of neighboring entities.

This module inherits the previously learned model parameters and embeddings of existing entities and relations. After integrating new entities, we finetune the encoder with two sets of training data.

  1. Selected trustworthy alignment (STA). We choose top-m trustworthy alignment with the highest similarity scores and take them as part of the training data as a kind of data augmentation. The reason for doing this is that there is no new pre-known alignment for us to train, also the coming of new data expands the graph with lots of new areas where pre-known anchors at the initial time are unable to cover.
  2. Affected pre-known alignment (APA). It has been concluded in prior research that the potential entity alignment is more likely to occur near anchors. Thus we replay the affected pre-known alignment at t=0 that contains anchors involved in new triples. This also helps to alleviate catastrophic forgetting.

When finetuning the encoder, we freeze the inner-graph network Aggregator₁ while making the cross-graph Aggregator₂ learnable. For a single KG, the coming of new data does not change the neighbor aggregation pattern, as a KG’s schema stays consistent (no new relations or entity domains). But the two KGs grow independently and asymmetrically in the proposed scenario. It is necessary to fine-tune the matching network to make adjustments and new discoveries.

Finally, the learning objective of the embedding and alignment update module is:

L = Lₐ(APA) + α·Lᵣ + β·Lₐ(STA) # α and β are both weights
Process of Embedding and Alignment Update Module

After the new potential alignment is predicted, we take a heuristic strategy to update the total predicted alignment set. We keep a new trustworthy alignment which is between two new entities. But for new ones that cause alignment conflicts with the previous trustworthy alignment (i.e., an entity is aligned with different entities), we decide to keep the alignment that has higher similarity scores. With KGs growing, the size of trustworthy entity alignment is accumulative.

Performances of ContEA

We construct three new datasets based on DBP15K to support this setting, which can be found in our GitHub repo. Each dataset contains six snapshots of two growing cross-lingual knowledge graphs. As mentioned above, we take the metrics of P, R, and F1 to measure the ability of alignment models to discover potential alignment.

Ability on Discovering Potential Alignment

ContEA outperforms a list of static entity alignment models (MTransE, GCN-Align, AlignE, AliNet, KEGCN and Dual-AMN), which requires retraining whenever new data comes, and three inductive entity alignment models (DINGAL-O, MEAN⁺, and LAN⁺). Here is the comparison on one of the constructed datasets DBP_ZH-EN.

Results of entity alignment on DBPZH-EN. NA stands for Not Applicable

The three rows below ContEA are the performances of its variants that discard certain parts of ContEA.

Efficiency

Efficiency is another important measurement dimension for continual learning, which emphasizes satisfactory performance but with less time cost. The comparison of training/finetuning time cost is demonstrated below.

Time cost comparison of ContEA and retraining baselines

The very left column in the lightest green color stands for the time cost of ContEA, which is much lower than other models’ costs.

Sensitivity on Hyperparameter α and β

Performance of ContEA when given different values to α and β

ContEA is very sensitive to the value of α, which is about the importance of reconstruction loss to alignment loss. In our reported results, we set α=0.1, which gives the best performance. For β, we find it does not cause much fluctuation in the performance. In our experiment, we set it as 0.1 too.

More detailed experiments are in the proceeding version of our paper, which will be published soon!

Special thanks for my collaborators of this paper: Yuanning Cui and Zequn Sun!

--

--

Audrey Wang
Audrey Wang

Written by Audrey Wang

Study NLP now | Student. Love to share! ✍️🗣🦻🏼 Seek international research opportunities! (feel free to send me DM). [twitter]: @rainxine