Graph Embedding Evaluation Framework
(most of this page was contributed by Maria Angela Pellegrino and Martina Garofalo)
Overview
Once you have defined a new RDF embedding technique or you want to compare two existing methods, you should define a set of tests to run and analyze the results.
By using this framework the work is simplified. It provides Machine Leraning and semantic tasks to evaluate your vectors.
The implemented tasks are:
- Machine Learning
- Classification
- Regression
- Clustering
- Semantic tasks
- Entity Relatedness
- Document similarity
- Semantic analogies
For each task, the parameters, the procedure, the file used as gold standards and metric used to evaluate the output are presented.
Embedding framework
Link project : the project is available on GitHub
To run it :
Libraries :
numpy==1.14.0
pandas==0.22.0
scipy==1.1.0
scikit-learn==0.19.2
Python version : Python 2.7.3
Parameters:
--vectors_file, Path of the file where your vectors are stored. File format: one line for each entity with entity and vector, mandatory
--vectors_size, default=200, Length of each vector
--top_k, default=2, Used in SemanticAnalogies : The predicted vector will be compared with the top k closest vectors to establish if the prediction is correct or not
To customize distance metric and analogy function :
You have to redefine your own main.
You can use one of the distance metric accepted by scipy.spatial.distance.cdist.
Your analogy function has to take
- 3 vectors or matrices of vectors used to forecast the forth vector,
- the index (or indices) or these vectors in the data matrix
- the data matrixes that contains all the vectors
- the top_k, i.e., the number of vectors you want to use to check if the predicted vector is close to one in your dataset
and it must return the indices of the top_k closest vector to the predicted one.
Classification
Parameters
-
Datasets used as gold standard
Useful information related to the dataset are listed here
The datasets can be downloaded following this link.
The datasets are explined in the paper[Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International Semantic Web Conference (To Appear). Springer (2016)]
The used ones are
- Cities
- Metacritic movies
- Metacritic albums
- Forbes
- AAUP (only salary information)
and are available in the Git respository.
Procedure
for each file_as_gold_standard:
NB, K-NN with k=3, C45 and SVM with C={pow(10, -3), pow(10, -2), 0.1, 1.0, 10.0, pow(10, 2), pow(10, 3)} are used as models
data = intersection between entities in vectors and in file_as_gold_standard
data = random sampling of data is computed (using seed from 1 to 10 for reproducibility)
model is created
model is trained using cv = 10
results are collected
Output metric
Accuracy is computed
The output file is a CSV file for each file_as_gold_standard with header:
- task_name : Classification
- model_name : [NB, K-NN, C45, SVM]
- model_configuration : null for NB and C45, k=3 for K-NN and current C value for SVM
- score_type : accuracy
- score_value : the actual score
Reference
The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation
Regression
Parameters
-
Datasets used as gold standard
Useful information related to the dataset are listed here
The datasets can be downloaded following this link.
The datasets are explined in the paper[Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International Semantic Web Conference (To Appear). Springer (2016)]
The used ones are
- Cities
- Metacritic movies
- Metacritic albums
- Forbes
- AAUP (only salary information)
and are available in the Git respository.
Procedure
for each file_as_gold_standard:
LR, K-NN with k=3 and M5 are used as models
data = intersection between entities in vectors and in file_as_gold_standard
data = random sampling of data is computed (using seed from 1 to 10 for reproducibility)
model is created
model is trained using cv = 10
results are collected
Output metric
Root mean squared error is computed
The output file is a CSV file for each file_as_gold_standard with header:
- task_name : Regression
- model_name : [LR, K-NN, M5]
- model_configuration : null for LR and M5, k=3 for K-NN
- score_type : root mean squared error
- score_value : the actual score
Reference
The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation
Clustering
Parameters
Distance metric to compute distance score among vectors. Default: cosine
Datasets used as gold standard
- Same entities contained in the datasets Cities, Metacritic movies, Metacritic albums, AAUP and Forbes splitted in 5 clusters
- Cities and Countries retrieved respectlively by DBpedia SPARQL endpoint through the queries:
prefix dbo:<http://dbpedia.org/ontology/>
SELECT DISTINCT ?c{
?c a dbo:City
}
and
prefix dbo:<http://dbpedia.org/ontology/>
SELECT DISTINCT ?c{
?c a dbo:PopulatedPlace
}
-
Cities and Countries ar before, but belancing the 2 clusters (retrieving only 2000 Cities)
-
Football and Basketball teams retrieved respectlively by DBpedia SPARQL endpoint through the queries:
select distinct ?c where {
?c a <http://dbpedia.org/ontology/SportsTeam>.
BIND(STR(?c) AS ?string )
filter(contains(?string, 'football_team'))
}
and
select distinct ?c where {
?c a <http://dbpedia.org/ontology/SportsTeam>.
BIND(STR(?c) AS ?string )
filter(contains(?string, 'basketball_team'))
}
The datasets are available in the Git respository.
Procedure
for each file_as_gold_standard:
DB, KMeans, AgglomerativeClustering, WardHierarchicalClustering and SpectralClustering are used as models
data = intersection between entities in vectors and in file\_as\_gold\_standard
ignored = data in file\_as\_gold\_standard and not in vectors
model is created
model is fit
for each ignored_data
a new (unused) cluster is created
results are collected
Output metric
Adjusted_rand_index, adjusted_mutual_info_score, fowlkes_mallows_score, homogeneity_score, completeness_score and v_measure_score are computed. To read more details about the metrics you can follow this link.
The output file is a CSV file for each file_as_gold_standard with header:
- task_name : Clustering
- model_name : [DB, KMeans, AgglomerativeClustering, WardHierarchicalClustering, SpectralClustering]
- model_configuration : used metric
- adjusted_rand_index : the actual score
- adjusted_mutual_info_score : the actual score
- fowlkes_mallows_score : the actual score
- homogeneity_score : the actual score
- completeness_score : the actual score
- v_measure_score : the actual score
Entity relatedness
Parameters
Distance metric to compute distance score among vectors. Default: cosine
Datasets used as gold standard
The used dataset is KORE that contains 21 entities (Actors, Companies, TV series and Videogames) and 20 ranked related entites.
The original dataset contains words. The used datasets contains DBpedia entities linked to these words. The used dataset is available in the Git respository.
Procedure
data = intersection between entities in vectors and in KORE
ignored = data in KORE and not in vectors
for each group in data (GROUP is the set of the current entities among the 21 and the 20 ranked related entities attach to it)
distances : the distance score is computed for all the pairs
sorted_distances : the distances are sorted from the close entities to the most far one
for each ignored_data
the distance between the current entity and the ignored one is set as maximum
the actual ranking and the one in KORE are compared
Output metric
Kendall tau correlation is computed between the actual ranking and the one in KORE
The output file is a CSV file with header:
- task_name : Entity relatedness
- entity_name : the current entity
- kendalltau_correlation : the actual score
- kendalltau_pvalue : the actual score
Reference
The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation
Document similarity
Parameters
Distance metric to compute distance score among vectors. Default: cosine
Datasets used as gold standard
The used dataset is LP50. The original dataset can be downloaded here. The zip file contains the 50 documents used in the evaluation and the statistics computed manually by universitary students that evaluated the similarity among document assigning to each pair a point in the range [1,5] where 5 means maximum similarity. The dataset is explained here.
The first step of the procedure is the extraction of the entities from the documents. The sets of entities used in this framewrok are the output of the annotator xLisa, as presented in this paper, and the output of the tool is the following.
The stats are scan through and, for each pair of document, the mean of the rates given by the students is computed and then used as gold standard in the evaluation. The output of this elaboration is available in the Git repository.
Procedure
for each pair of documents (doc1, doc2)
set1 := set of entities in doc1 and in vectors
set2 := set of entities in doc2 and in vectors
for each entity in set1 as entity1
for each entity in set2 as entity2
distance_score = distance between (entity1, entity2)
sorted_distance_score = sort distance_scores between entity1 and all the entities in set2
min_distance_1 = the min is picked and stored
for each entity in set2 as entity2
for each entity in set1 as entity1
distance_score = distance between (entity2, entity1)
sorted_distance_score = sort distance_scores between entity2 and all the entities in set1
min_distance_2 = the min is picked and stored
document_similarity = (min_distance_1 + min_distance_2) / (|set1| + |set2|)
results are collected
For each annotation in the JSON file, also weights are provided. The same procedure is repetead also considering them in the entity distance computation: min_distance = min_distance/ (weight1 + weight2)
Output metric
Pearson and Spearman correlation and their harmonic mean are used to evaluate the obtained distance scores against the one considered gold standard.
The output file is a CSV file with header:
- task_name : Document similarity
- conf : with or without weights
- pearson_score : the actual score
- spearman_score : the actual score
- harmonic_mean : harmonic mean between pearson and spearman scores
Reference
The used dataset, models, their configuration and the output metric is based on RDF2vec evaluation
Semantic Analogies
The test is based on quadruplets (word1, word2, word3, word4) and the idea is to check is working on the first three ones, it is possible to predict the last word with a low error.
A practical example is:
X = vector("queen") − vector("woman") +vector("man")
and check if X is near to “king”.
If the vector manipulation is able to predict the right vector, the embedding technique that produced these vectors is able to keep the semanti of the embedded entities in the vectors.
Both syntantic and semantic analogies could be considered.
In the framework we consider only semantic analogies.
The used dataset are:
- capital and country
- currency
- city and state
Parameters
Analogy function to predict the forth vector.
Default :
def default_analogy_function(a, b, c, index_a, index_b, index_c, data, top_k):
pred_vec = np.array(b) - np.array(a) + np.array(c)
dist = np.dot(data, pred_vec.T)
for k in range(len(a)):
dist[index_a[k], k] = -np.Inf
dist[index_b[k], k] = -np.Inf
dist[index_c[k], k] = -np.Inf
return np.argsort(-dist, axis=0)[:top_k].T
input : a, b, c are vectors or matrices to speed up the computation index_a, index_b and index_c are indexes of a, b and c in data data is the matrix of vectors top_k is the number of vectors to take into account to check if the predicted vector is close to one of these actual k vectors
output indexes in data of the top_k vectors closest to the predicted one
Datasets used as gold standard
The original datasets are available here. All the words have been substituted with DBpedia entities and the obtained dataset is available in the Git repository.
Procedure
for each file_as_gold_standard
right_answer=0
for each (word1, word2, word3, word4)
predicted_vecs = analogy_function(vec(word1), vec(word2), vec(word3))
for each prec_vector in predicted_vecs
if the right answer is equal to prec_vector
right_answer++
results are collected
Output metric
Semantic accuracy (correct_answer/total_answers) is computed.
The output file is a CSV file for each file_as_gold_standard with header:
- task_name : Semantic Analogies
- top_k : the actual value of top_k
- right_answers : the actual value
- tot_answers : the actual value
- accuracy : correct_answer/total_answers
Reference
The used dataset, the procedure and the output metric is based on Word2vec evaluation
Summary
Task | Classification | Regression | Clustering | Semantic Analogies | Document Similarity | Entity Relatedness |
---|---|---|---|---|---|---|
Input | - | - | distance function | analogy function | distance function | distance function |
Gold standard | Cities Metacritic movies Matacritic albums AAUP Forbes | Cities Metacritic movies Matacritic albums AAUP Forbes | 1.Cities Metacritic movies Matacritic albums AAUP Forbes 2.Cities and Countries 3. Football | Dataset provided by WORD2vec | LP50 | KORE |
Methods and configurations | GaussianNB K-NN k=3 SVM with C in {10^-3, 10^-2, 0.1, 1, 10, 10^2, 10^3 } C45 | LR M5 K-NN k=3 | k-means, agglomerative clustering, ward hierarchical clustering, spectral clustering | |||
Evaluation metric | accuracy | root mean squared error | adjusted rand index, adjusted mutual info score, fowlkes mallows score, homogeneity score, completeness score, v measure score | accuracy | Pearson score, Spearman score and their harmonic mean | kendall tau correlation |
Structure of the project
(./img/structureFramework.png)
main.py instantiates the distance function to measure how much two vectors are distant and the analogy function used in Semantic Analogies task. It manages the parameters and instantiates the evaluator_manager.
The evaluator_manager.py reads the vectors file, runs all the tasks sequentially or in parallel and creates the output directory calling it results_
Each task is in a separate folder and each of them is costituted by: a manager that supervises the work and organizes the output, a data_manager that reads the files used as gold standard and merges them with the actual vectors, a model that computes the task and provides the output to the manager.