(most of this page was contributed by Maria Angela Pellegrino)

RDF2Vec is a technique to embed RDF graphs. The original paper is only about performing uniform walks, while later work also introduces biased ones.

The algorithm has three steps:

  1. Weighing of the graph (especially important for the biased version)
  2. Performing random walks on the graph (or compute graph kernels)
  3. Computing word2vec embeddings of these walks

The evaluation of these embeddings has been performed using ‘downstream’ machine learning tasks.

RDF2Vec

Step 1 : create a sequence of entities, that can be assimilated to a sequence of words, starting from a graph

Applied techniques : Graph walks and Weisfeiler-Lehman Subtree RDF Graph Kernels

Random graph walks

For each vertex v of the graph, generate all graph walks of depth d rooted in the vertex v through breath-first algorithm.

The final set of sequences for the given graph is the union of the sequences of all the vertices crossed by all the paths.

Weisfeiler-Lehman Subtree RDF Graph Kernels

The basic idea behind their computation is to evaluate the distance between two data instances by counting common subtrees in the graph. The kernel computes the number of subtrees shared between two or more graph by using the Weisfeiler-Lehman test of graph isomorphism. To apply the algorithm, it must be adapted to the graph in input: it has to manage a directed edges and the labels from two iterations can potentially be different while still representing the same subtree. To avoid it, at each iteration the neighboring labels of the previous iteration is compared to the actual one and, if they are identical, the label of the previous iteration is reused. For each vertex, all the paths of depth d within the subgraph of the vertex v on the relabeled graph are extracted. Then the original label of the vertex v is set as the starting token of each path. This process is repeat for h times.

The final set of sequences is the union of the sequences of all the vertices in each iteration.

Step 2 : apply Word2Vec to the obtained sequences to transform them into numerical vectors

Word2Vec

It is a one level neural networks for learning a low-dimensional, dense representation of words with two essential properties: (a) similar words are close in the vector space, and (b) relations between pairs of words can be represented as vectors as well, allowing for arithmetic operations in the vector space. Word2Vec estimates the likelihood of a sequence of entities appearing in the graph. There are two different algorithms: the Continuous Bag-of-Words (CBOW) and Skip-Gram model. The CBOW model predicts target words from context words within a given window. For each word of the vocabulary, the algorithm establishes which is the probability of the word being a target word. The Skip-gram model does the inverse of the CBOW model and tries to predict the context words from the target words.

Hierarchical softmax and negative sampling are used as optimizations.

Once the training is finished, all the words (for instance the entities) are projected into a lower-dimensional feature space and semantically similar words (or entities) are positioned close to each other.

Evaluation metrics

It is evaluated on three different tasks:

  1. standard machine-learning tasks
  2. entity and document modeling
  3. content-based recommender systems and for each of them reaches high level results.

2.1 Entity similarity

In most embedding techniques semantically related entities appear close to each other in the feature space, as a consequence the problem of calculating entity similarity is a matter of calculating the distance between two instances. Some methods includes the standard cosine similarity measure applied on the vectors of the entities. Formally, the similarity between two entities e1 and e2, with vectors V1 and V2 , is calculated as the cosine similarity between the vectors V1 and V2.

2.2 Document similarity

It is based on entity similarity, two document are considered to be similar if many entities of the one document are similar to at least one entity in the other document. More precisely, we try to identify the most similar pairs of entities in both documents, ignoring the similarity of all the other 1-1 similarities values. Given two documents d1 and d2 , the similarity between the documents sim(d1, d2) is calculated as follows:

  1. Extract the sets of entities E1 and E2 in the documents d1 and d2 .
  2. Calculate the similarity score sim(e1i, e2j) for each pair of entities in document d1 and d2, where e1i∈ E1 and e2j ∈ E2.
  3. For each entity e1i in d1 identify the maximum similarity to an entity in d2 max_sim(e1i, e2j ∈ E2 ), and vice versa.
  4. Calculate the similarity score between the documents d1 and d2.

Some of the most important document similarity indexes to compare with:

  • TF-IDF: Distributional baseline algorithm.

  • AnnOv: Similarity score based on annotation overlap.

  • Explicit Semantic Analysis (ESA) [Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJcAI. vol. 7, pp. 1606–1611 (2007)]

  • GED: semantic similarity using a Graph Edit Distance based measure [Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM international conference on Web search and data mining. pp. 543–552. ACM (2014)]

  • Salient Semantic Analysis (SSA), Latent Semantic Analysis (LSA) [Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011)]

  • Graph-based Semantic Similarity (GBSS) [Paul, C., Rettinger, A., Mogadala, A., Knoblock, C.A., Szekely, P.: Efficient graph-based document similarity. In: International Semantic Web Conference. pp. 334–349. Springer (2016)]

2.3 Entity relatedness

One approach is to assume that two entities are related if they often appear in the same context. To do so the probability p(e1|e2) has to be calculated for each couple e1, e2. This probability highly depend on the embeddings itself, in fact in RDF2vec two formulas are provided for CBOW and skip-gram. Other state-of-the-art graph-based entity relatedness approaches are:

  • baseline: computes entity relatedness as a function of distance between the entities in the network.
  • KORE: calculates keyphrase overlap relatedness, as described in the original KORE paper.
  • CombIC: semantic similarity using a Graph Edit Distance based measure [Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM international conference on Web search and data mining. pp. 543–552. ACM (2014)].
  • ER: Exclusivity-based relatedness [Hulpuş, I., Prangnawarat, N., Hayes, C.: Path-based semantic relatedness on linked data and its use to word and entity disambiguation. In: International Semantic Web Conference. pp. 442–457. Springer (2015)].

3 Recommender Systems

Given that the items to be recommended are linked to a LOD dataset, information from LOD can be exploited to determine which items are considered to be similar to the ones that the user has consumed in the past, allowing to discover hidden information and implicit relations between objects. The cosine similarity between the latent vectors representing the items can be interpreted as a measure of reciprocal proximity and then exploited to produce recommendations. Recommender systems to be compared with are:

  • SLIM, is a Sparse LInear Method for top-N recommendation that learns a sparse coefficient matrix for the items involved in the system by only relying on the users purchase/ratings profile and by solving a L1-norm and L2-norm regularized optimization problem. [Ning, X., Karypis, G.: SLIM: sparse linear methods for top-n recommender systems. In: 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011. pp. 497–506 (2011), http://dx.doi.org/10.1109/ICDM.2011.134]
  • BPR-SSPLIM, is a Sparse LInear Method using item Side information (SSLIM) and the Bayesian Personalized Ranking (BPR) optimization criterion.
  • SPRank, is a novel hybrid recommender system that solves the top-N recommendation problem in a learning to rank fashion, exploiting the freely available knowledge in the Linked Open Data to build semantic path-based features. [Di Noia, T., Ostuni, V.C., Tomeo, P., Di Sciascio, E.: Sprank: Semantic path-based ranking for top-n recommendations using linked open data. ACM Transactions on Intelligent Systems and Technology (TIST) (2016)]

Datasets

In document similarity task, as benchmark datasets is used the LP50 [Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. Cognitive Science Society (2005)]

In entity relatedness task, as benchmark dataset is used the KORE. The dataset consists of 21 main entities, whose relatedness to other 20 entities manually assessed. The relatedness is not estimated, but the related entities are ranked from the closest concept to the most far one. [Hoffart, J., Seufert, S., Nguyen, D.B., Theobald, M., Weikum, G.: Kore: keyphrase overlap relatedness for entity disambiguation. In: Proceedings of the 21st ACM international conference on Information and knowledge management. pp. 545–554. ACM (2012)]

In the recommendation task are used:

  • Movielens for movies
  • LibraryThing for books
  • Last.fm for music.

The original datasets are enriched with side information thanks to the item mapping and linking to DBpedia, whose dump is available here, info about the techniques is in [Ostuni, V.C., Noia, T.D., Sciascio, E.D., Mirizzi, R.: Top-n recommendations from implicit feedback leveraging linked open data. In: ACM RecSys ’13. pp. 85–92 (2013)]. The datasets are finally preprocessed to guarantee a fair comparison with the state of the art approaches. In general the authors propose to (i) remove popularity biases from the evaluation not considering the top 1% most popular items, (ii) reduce the sparsity of Movielens dataset in order to have at least a sparser test dataset and (iii) remove from LibraryThing and Last.fm users with less than five ratings and items rated less than five times.

Notice that…

  1. It exploits only local informations
  2. It ignores literals
  3. The generation of the entities’ vectors is task and dataset independent
  4. Graph kernels are not scalable for large dataset
  5. It embeds only the nodes of the graph

Datasets

Here you can find the vectors from computing RDF2vec and KGlove embeddings from DBpedia 2016-04 graph weighted in different way. For each weighting technique, the link to the zip file is provided.

For each entity in the graph, the text file in the zip archive contains a line with the entity name and the embedded vector.

RDF2Vec