# KGloVe

(most of this page was contributed by Maria Angela Pellegrino)

# KGloVe : Global RDF vector space embeddings

Approach that adapts language modeling approaches to extract features from sequences of words to RDF graphs, exactly as the previous technique.

*Step 1 :* the building of a co-occurrence matrix from graph data

Firstly ** weighs** are added to the edges of the graph and compute

**starting from each node. Biasing these walks, they are made more meaningful, i.e. they are able to capture the most important infomation about the observed entities. Weight will bias he random walks. When a walk arrives in a vertex**

*approximate personalized PageRank scores**v*with out edges

*v_{o1}, …, v_{od}*, then the walk will follow edge

*v_{ol}*with a probability related to the normalized edge weight.

**Twelve**different strategies for assigning these weights to the edges of the graph are evaluated:

UNIFORM APPROCH

*uniform approach*=*object frequency split*: it is the same to not apply bias over the graph

EDGE-CENTRIC APPROACHES

*predicate frequency*: they count the number of times each predicate occurs (only as a predicate) and they weight the edges with this value. Edges with more common predicates are more often followed.*inverse predicate frequency*: similar to the previous one, but in this case the*rare*predicates will be followed often.*predicate-object frequency*: for each pair of a predicate and an object in the dataset, they count the number of times they appear as predicate-object. It is similar to the predicate frequency weight, but differentiating the objects as well.*inverse predicate-object frequency*

When computing the inverse metrics, not the absolute frequency is assigned, but its multiplicative inverse.

NODE-CENTRIC OBJECT FREQUENCY APPROACHES (also strategy 1)

*object frequency*: for each resource in the dataset, they count the number of times it occurs as the object of a triple. This techinique assign a score to each node. To weight the edges, they either*push*the number down (default) or*split*the number down to all in edges. It ignores the predicate labels and ensures that entities with high in degree get visited even more often.*inverse object frequency**inverse object frequency split*

NODE-CENTRIC PAGERANK APPROACHES

PageRank is computed based on links between the Wikipedia articles representing the same entities of the graph. To the page not linked to Wikipedia pages, a fixed PageRank value is assigned: 0.2.

*PageRank**inverse PageRank**PageRank split*: it does not depend only on the structure of the graph.*inverse PageRank split*

The PageRank score for the other nodes is then used as the absolute frequency in the matrix.

This procedure is then repeated on the graph with all edges reversed and the result is added to the co-occurrence matrix.

The obtained matrix is then normalized.

*Co-occurrence matrix creation by Personalized PageRank*

The co-occurrence matrix for textual data is obtained by linearly scanning through the text and counting the occurrence of context words in the context of each word. Instead the graph has not a linear structure.

To define it, the Personalized PageRank has been used. It determines how important nodes in the context of a focus node. The PageRank is used to find important nodes in a direct graph. At its heart, it works by simulating random walkers over the graph and observing where these random walkers end up. The simplified page rank problem is solved by finding the stationary solution to

p^{(k+1)} = P^{T} p^{(k)}

where P is a nxn matrix where n is the node number in the graph filled with zeros excepts for positions *i,j* for which there exists an arc from i to j and these positions contain 1/deg(i) with deg(i) is the out degree of the node *i*.

To solve the problem of *dangling nodes*, i.e. nodes with zero out degree, a node could continue from another node selected from a distribution *v*, called the *teleportation distribution*. Usually *v* is chosen to be a uniform distribution. To avoid ending in a cycle, a random jump is also performed with probability *p* to the focus node (this is a difference with the classical PageRank that consider the target node a node chosen with the same probability of all other nodes).
The Personalized PageRank assign a value to all nodes in the graph, ending up with a very large dense matrix with small values. To make this step faster, the co-occurrence matrix is created by a fast personalized PageRank Approximation.

** Co-occurrence matrix creation by a fast personalized PagerRank Approximation**
The effort of the algorithm is only used for the nodes which will receive a significant rank.

*b*is the focus node and the algorithm starts injecting a unit amount of paint to

*b*. The paint represents the random walk in the original algorithm. From this paint, an p-portion is retained and added to the value for

*b*in the related position in the ouput vector. The remaining (1-p)-portion is distributed uniformly over the out-links. This retain-and-distributed process is repeated recursively for all nodes. When a node has a zero out degree, the outgoing paint is discarder. The best performance is obtaining by:

- choosing the oder in which nodes are evaluated
- reusing the value of the output vector
- changing the uniform distribution with a biased one

*Step 2 :* GloVe is applied to the obtained co-occurrence matrix

** The GloVe model**
GloVe is designed for creating word embeddings from natural language texts. It creates the co-occurrence matrix evaluating the size of the context window, distinguishing left from right context and a weighting function to weight the contribution of each word co-occurrence.

where is a weighting function on co-occurrence counts of word *j* in the context of word *i* , value represented by , are word vectors, are context vectors, and are biases. The idea is that when two words co-occur often, their vectors’ dot product will be relatively high, meaning that the vectors are more similar to make the error factor smaller.

## Evaluation metrics

### 1. Machine Learning Tasks

*Classification*

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

*Regression*

In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. Regression is used to predict continuous values.

### 2. Document similarity

It can be repeated the same speech done above for RDF2Vec about the most important document similarity indexes to compare with.

## Datasets

In *machine learning task* has been used 5 different datasets from different domains:
Cities,
Metacritic Movies,
Metacritic Albums,
AAUP
and Forbes

In *document similarity* task, as benchmark datasets is used the **LP50** [Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. Cognitive Science Society (2005)]

## Notice that…

- It uses
*global*informations - It ignores literals

# Datasets

Here you can find the vectors from computing RDF2vec and KGlove embeddings from DBpedia 2016-04 graph weighted in different way. For each weighting technique, the link to the zip file is provided.

For each entity in the graph, the text file in the zip archive contains a line with the entity name and the embedded vector.