(most of this page was contributed by Maria Angela Pellegrino and Martina Garofalo)

Overview

Once you have defined a new RDF embedding technique or you want to compare two existing methods, you should define a set of tests to run and analyze the results.

By using this framework the work is simplified. It provides Machine Leraning and semantic tasks to evaluate your vectors.

The implemented tasks are:

Machine Learning
- Classification
- Regression
- Clustering
Semantic tasks
- Entity Relatedness
- Document similarity
- Semantic analogies

For each task, the parameters, the procedure, the file used as gold standards and metric used to evaluate the output are presented.

Embedding framework

Link project : the project is available on GitHub

To run it :

Libraries : 
numpy==1.14.0
pandas==0.22.0
scipy==1.1.0
scikit-learn==0.19.2

Python version : Python 2.7.3

Parameters:
--vectors_file, Path of the file where your vectors are stored. File format: one line for each entity with entity and vector, mandatory
--vectors_size, default=200, Length of each vector
--top_k, default=2, Used in SemanticAnalogies : The predicted vector will be compared with the top k closest vectors to establish if the prediction is correct or not

To customize distance metric and analogy function :

You have to redefine your own main.

You can use one of the distance metric accepted by scipy.spatial.distance.cdist.

Your analogy function has to take

3 vectors or matrices of vectors used to forecast the forth vector,
the index (or indices) or these vectors in the data matrix
the data matrixes that contains all the vectors
the top_k, i.e., the number of vectors you want to use to check if the predicted vector is close to one in your dataset

and it must return the indices of the top_k closest vector to the predicted one.

Classification

Parameters

Datasets used as gold standard

Useful information related to the dataset are listed here

The datasets can be downloaded following this link.

The datasets are explined in the paper[Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International Semantic Web Conference (To Appear). Springer (2016)]

The used ones are

Cities
Metacritic movies
Metacritic albums
Forbes
AAUP (only salary information)

and are available in the Git respository.

Procedure

for each file_as_gold_standard:
	NB, K-NN with k=3, C45 and SVM with C={pow(10, -3), pow(10, -2), 0.1, 1.0, 10.0, pow(10, 2), pow(10, 3)} are used as models

	data = intersection between entities in vectors and in file_as_gold_standard
	data = random sampling of data is computed (using seed from 1 to 10 for reproducibility)

	model is created
	model is trained using cv = 10

	results are collected

Output metric

Accuracy is computed

The output file is a CSV file for each file_as_gold_standard with header:

task_name : Classification
model_name : [NB, K-NN, C45, SVM]
model_configuration : null for NB and C45, k=3 for K-NN and current C value for SVM
score_type : accuracy
score_value : the actual score

Reference

The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation

Regression

Parameters

Datasets used as gold standard

Useful information related to the dataset are listed here

The datasets can be downloaded following this link.

The used ones are

Cities
Metacritic movies
Metacritic albums
Forbes
AAUP (only salary information)

and are available in the Git respository.

Procedure

for each file_as_gold_standard:
	LR, K-NN with k=3 and M5 are used as models

	data = intersection between entities in vectors and in file_as_gold_standard
	data = random sampling of data is computed (using seed from 1 to 10 for reproducibility)

	model is created
	model is trained using cv = 10

	results are collected

Output metric

Root mean squared error is computed

The output file is a CSV file for each file_as_gold_standard with header:

task_name : Regression
model_name : [LR, K-NN, M5]
model_configuration : null for LR and M5, k=3 for K-NN
score_type : root mean squared error
score_value : the actual score

Reference

The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation

Clustering

Parameters

Distance metric to compute distance score among vectors. Default: cosine

Datasets used as gold standard

Same entities contained in the datasets Cities, Metacritic movies, Metacritic albums, AAUP and Forbes splitted in 5 clusters
Cities and Countries retrieved respectlively by DBpedia SPARQL endpoint through the queries:

prefix dbo:<http://dbpedia.org/ontology/>
SELECT DISTINCT ?c{
	?c a dbo:City
} 

and

prefix dbo:<http://dbpedia.org/ontology/>
SELECT DISTINCT ?c{
	?c a dbo:PopulatedPlace
} 

Cities and Countries ar before, but belancing the 2 clusters (retrieving only 2000 Cities)
Football and Basketball teams retrieved respectlively by DBpedia SPARQL endpoint through the queries:

select distinct ?c where {
	?c a <http://dbpedia.org/ontology/SportsTeam>.
	BIND(STR(?c) AS ?string )
		filter(contains(?string, 'football_team'))
}

and

select distinct ?c where {
	?c a <http://dbpedia.org/ontology/SportsTeam>.
	BIND(STR(?c) AS ?string )
		filter(contains(?string, 'basketball_team'))
}

The datasets are available in the Git respository.

Procedure

for each file_as_gold_standard:
	DB, KMeans, AgglomerativeClustering, WardHierarchicalClustering and SpectralClustering are used as models

	data = intersection between entities in vectors and in file\_as\_gold\_standard
	ignored = data in file\_as\_gold\_standard and not in vectors

	model is created
	model is fit 

	for each ignored_data
		a new (unused) cluster is created

	results are collected

Output metric

Adjusted_rand_index, adjusted_mutual_info_score, fowlkes_mallows_score, homogeneity_score, completeness_score and v_measure_score are computed. To read more details about the metrics you can follow this link.

The output file is a CSV file for each file_as_gold_standard with header:

task_name : Clustering
model_name : [DB, KMeans, AgglomerativeClustering, WardHierarchicalClustering, SpectralClustering]
model_configuration : used metric
adjusted_rand_index : the actual score
adjusted_mutual_info_score : the actual score
fowlkes_mallows_score : the actual score
homogeneity_score : the actual score
completeness_score : the actual score
v_measure_score : the actual score

Entity relatedness

Parameters

Distance metric to compute distance score among vectors. Default: cosine

Datasets used as gold standard

The used dataset is KORE that contains 21 entities (Actors, Companies, TV series and Videogames) and 20 ranked related entites.

The original dataset contains words. The used datasets contains DBpedia entities linked to these words. The used dataset is available in the Git respository.

Procedure

data = intersection between entities in vectors and in KORE
ignored = data in KORE and not in vectors

for each group in data (GROUP is the set of the current entities among the 21 and the 20 ranked related entities attach to it)
	distances : the distance score is computed for all the pairs
	sorted_distances : the distances are sorted from the close entities to the most far one

	for each ignored_data
		the distance between the current entity and the ignored one is set as maximum

	the actual ranking and the one in KORE are compared

Output metric

Kendall tau correlation is computed between the actual ranking and the one in KORE

The output file is a CSV file with header:

task_name : Entity relatedness
entity_name : the current entity
kendalltau_correlation : the actual score
kendalltau_pvalue : the actual score

Reference

The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation

Document similarity

Parameters

Distance metric to compute distance score among vectors. Default: cosine

Datasets used as gold standard

The used dataset is LP50. The original dataset can be downloaded here. The zip file contains the 50 documents used in the evaluation and the statistics computed manually by universitary students that evaluated the similarity among document assigning to each pair a point in the range [1,5] where 5 means maximum similarity. The dataset is explained here.

The first step of the procedure is the extraction of the entities from the documents. The sets of entities used in this framewrok are the output of the annotator xLisa, as presented in this paper, and the output of the tool is the following.

The stats are scan through and, for each pair of document, the mean of the rates given by the students is computed and then used as gold standard in the evaluation. The output of this elaboration is available in the Git repository.

Procedure

for each pair of documents (doc1, doc2)
	set1 := set of entities in doc1 and in vectors
	set2 := set of entities in doc2 and in vectors

	for each entity in set1 as entity1
		for each entity in set2 as entity2
			distance_score = distance between (entity1, entity2)
		sorted_distance_score = sort distance_scores between entity1 and all the entities in set2
		min_distance_1 = the min is picked and stored

	for each entity in set2 as entity2
		for each entity in set1 as entity1
			distance_score = distance between (entity2, entity1)
		sorted_distance_score = sort distance_scores between entity2 and all the entities in set1
		min_distance_2 = the min is picked and stored	
	
	document_similarity = (min_distance_1 + min_distance_2) / (|set1| + |set2|)

	results are collected

For each annotation in the JSON file, also weights are provided. The same procedure is repetead also considering them in the entity distance computation: min_distance = min_distance/ (weight1 + weight2)

Output metric

Pearson and Spearman correlation and their harmonic mean are used to evaluate the obtained distance scores against the one considered gold standard.

The output file is a CSV file with header:

task_name : Document similarity
conf : with or without weights
pearson_score : the actual score
spearman_score : the actual score
harmonic_mean : harmonic mean between pearson and spearman scores

Reference

The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation

Semantic Analogies

The test is based on quadruplets (word1, word2, word3, word4) and the idea is to check is working on the first three ones, it is possible to predict the last word with a low error.

A practical example is:

X = vector("queen") − vector("woman") +vector("man")

and check if X is near to “king”.

If the vector manipulation is able to predict the right vector, the embedding technique that produced these vectors is able to keep the semanti of the embedded entities in the vectors.

Both syntantic and semantic analogies could be considered.

In the framework we consider only semantic analogies.

The used dataset are:

capital and country
currency
city and state

Parameters

Analogy function to predict the forth vector.

Default :

def default_analogy_function(a, b, c, index_a, index_b, index_c, data, top_k):
    pred_vec = np.array(b) - np.array(a) + np.array(c)

    dist = np.dot(data, pred_vec.T)

    for k in range(len(a)):
        dist[index_a[k], k] = -np.Inf
        dist[index_b[k], k] = -np.Inf
        dist[index_c[k], k] = -np.Inf

    return np.argsort(-dist, axis=0)[:top_k].T

input : a, b, c are vectors or matrices to speed up the computation index_a, index_b and index_c are indexes of a, b and c in data data is the matrix of vectors top_k is the number of vectors to take into account to check if the predicted vector is close to one of these actual k vectors

output indexes in data of the top_k vectors closest to the predicted one

Datasets used as gold standard

The original datasets are available here. All the words have been substituted with DBpedia entities and the obtained dataset is available in the Git repository.

Procedure

for each file_as_gold_standard
	right_answer=0

	for each (word1, word2, word3, word4)
		predicted_vecs = analogy_function(vec(word1), vec(word2), vec(word3))
		for each prec_vector in predicted_vecs
			if the right answer is equal to prec_vector
				right_answer++

	results are collected

Output metric

Semantic accuracy (correct_answer/total_answers) is computed.

The output file is a CSV file for each file_as_gold_standard with header:

task_name : Semantic Analogies
top_k : the actual value of top_k
right_answers : the actual value
tot_answers : the actual value
accuracy : correct_answer/total_answers

Reference

The used dataset, the procedure and the output metric is based on Word2vec evaluation

Summary

Task	Classification	Regression	Clustering	Semantic Analogies	Document Similarity	Entity Relatedness
Input	-	-	distance function	analogy function	distance function	distance function
Gold standard	Cities Metacritic movies Matacritic albums AAUP Forbes	Cities Metacritic movies Matacritic albums AAUP Forbes	1.Cities Metacritic movies Matacritic albums AAUP Forbes 2.Cities and Countries 3. Football	Dataset provided by WORD2vec	LP50	KORE
Methods and configurations	GaussianNB K-NN k=3 SVM with C in {10^-3, 10^-2, 0.1, 1, 10, 10^2, 10^3 } C45	LR M5 K-NN k=3	k-means, agglomerative clustering, ward hierarchical clustering, spectral clustering
Evaluation metric	accuracy	root mean squared error	adjusted rand index, adjusted mutual info score, fowlkes mallows score, homogeneity score, completeness score, v measure score	accuracy	Pearson score, Spearman score and their harmonic mean	kendall tau correlation

Structure of the project

(./img/structureFramework.png)

main.py instantiates the distance function to measure how much two vectors are distant and the analogy function used in Semantic Analogies task. It manages the parameters and instantiates the evaluator_manager.

The evaluator_manager.py reads the vectors file, runs all the tasks sequentially or in parallel and creates the output directory calling it results_\_.

Each task is in a separate folder and each of them is costituted by: a manager that supervises the work and organizes the output, a data_manager that reads the files used as gold standard and merges them with the actual vectors, a model that computes the task and provides the output to the manager.