Incorporating Continual Learning methods into Continual Knowledge Graph Embedding

6 min readJun 8, 2022

In recent years, inductive learning is definitely one of the on-trend research topics in knowledge graph embedding. The central concern of inductive knowledge graph embedding is efficiently generating effective embeddings for new entities. Most of the relevant research papers addressing this topic (such as Hamaguchi etc., LAN, GraIL, INDIGO) seek to design kinds of inductive models that are powered by graph neural networks (GNNs). Their strategy is pretty intuitive: GNNs has been a technical keystone in deep graph learning, and it has the inherent ability to transfer pre-learned knowledge from one graph to another homogenous graph.

Continual learning aims to learn continually and adaptively in new tasks while preserving or reinforcing the prior-learned knowledge on old tasks. If we regard learning embeddings on existing data in a knowledge graph as the old task, while learning embeddings on newly-emerged data as the new task, it is easy to figure out that the aim of inductive knowledge graph embedding falls into the what continual learning seeks to achieve. Surprisingly, even though continual learning has been a notable research field for a long time and reached some advances so far, there are not many inductive knowledge graph embedding works that incorporate continual learning techniques. A recent research paper Continual Learning of Knowledge Graph Embeddings may be the first work to explore the combination of continual learning techniques with knowledge graph embedding. This work summarizes several typical continual learning methods and applies their ideas to continual knowledge graph embedding. In the following, I’d like to introduce a common taxology of different continual learning methods and strategies for their combination.

Categories of Continual Learning Methods

Different continual learning methods can be classified into three categories: architectural modification methods, regularization-based methods, and replay methods. The table below summarizes their ideas and one or two representative methods.

When a prior trained model needs to learn a new task without retraining from scratch, architectural modification methods usually try to create a new model parameter space so that the new task can be learned by new parameters. Regularization-based methods aim to finetune previously learned parameters to generalize on new tasks. Replay methods perform joint learning on a new task and partial old task (or data). Upon understanding the ideas of these continual learning methods, we can simply migrate them into continual knowledge embedding.

Applying Continual Learning Methods to Knowledge Graph Embedding

1.1 Architectural Modification: PNN

Progressive Neural Network (PNN) is a method adding copies of existing layers of a multi-layered neural network for each new learning task. When a new task begins, we frozen existing weights and enable lateral connections to transfer previously learned weights forwardly.

Now, in order to borrow the idea from PNN to continual knowledge graph embedding, we need to understand the two spirits of PNN: create new weights and freeze existing weights. Following the two spirits we have two steps:

Expand the embedding matrices vⁿ ∈ ℝ^(|ℰⁿ| × dℰ) and wⁿ ∈ ℝ^(|ℛⁿ |×dℛ) to include new entities and relations in a new task. dℰ and dℛ are the dimensions of entities and relations.
Freeze previously learned embeddings to prevent their corruption in the new task.

Naive isn’t it? When training our model in a new task, the existing embeddings of entities and relations are fixed, while we just fill new embeddings into the matrices.

1.2 Architectural Modification: CWR

To avoid corruption, Copy Weight with ReInit (CWR) maintains the weights of the final layer of the network during a new learning task (temporary weights, TW), and separates from the corresponding weights trained in prior tasks (consolidated weights, CW).

The spirit of CWR can be concluded in a word: keep the newest learned weights. To apply CWR to continual knowledge graph embedding, we can

In each learning task, we resize and re-initialize TE based on the number of entities and relations in this task.
After one task, move TE into CE by copying new embeddings and averaging existing embeddings.

Consolidated Embeddings (CE), Temporary Embeddings (TE)

2.1 Regularization: L2R

L2 Regularization (L2R) adds a regularization term to the task loss when updating model weights, which encourages the trained weights to not deviate from their previous values.

When applying L2R to continual knowledge graph embedding, we training loss formula is:

Where e ∈ ℰ^(n-1), r ∈ ℛ^(n-1), and λ is a regularization scaling hyper-parameter.

2.2 Regularization: SI

Synaptic Intelligence (SI) lextends L2R by considering the weight-specific contributions to the reduction in loss over a learning task.

The loss function when applying SI to continual knowledge graph embedding:

e ∈ ℰ^(n-1), r ∈ ℛ^(n-1), and λ is a regularization scaling hyper-parameter. Ω is the parameter regularization strength.

3. Replay: DGR

Deep Generative Replay (DGR) uses a generative model G to approximate the distribution of observed training examples, and train a discriminative model (solver) to perform a task.

The most important component of DGR method is how to sample training examples, in another word, what is the generative model G. In this paper, Variational Autoencoder (VAE) is used as the generative model. VAE is an artificial neural network architecture that usually is used to sample a sequence of discrete tokens and generate text sentences. When it comes to knowledge graphs, we can treat each triple (h, r, t) as a sequence of discrete tokens (each entity and relation is a token), then VAE can generate random triples.

A comparision between simple autoencoder and variational autoencoder is illustrated below.

Difference between simple autoencoder and variational autoencoder

The steps to apply VAE to continual knowledge graph embedding:

An input triple (h, r, t) is first transformed into token embedding sequences x=(v_t,w_r, v_t).
Use Gated-Recurrent Units (GRU) to encode and decode the triples to and from the latent space z. The encoder learns posterior q(z|x), and KL divergence measure is used to generate samples from latent space. Then the decoder maximizes q(x│z) based on the samples.
The output sequences of the decoder are transformed back into triples using a Softmax function over all tokens.

A more detailed look at DGR:

Summary

I believe there is a big room for exploring the intersection between continual learning and knowledge graph embedding and applying continual learning techniques to knowledge graph embedding. Instead of being confined to GNN-based inductive techniques, borrowing from continual learning enables us to see the problem from a higher perspective. As the phrase says:

Standing on the shoulders of giants.