Graph Embeddings

Model by Modzy

This model can be used to explore possible relationships between entities, such as finding people who share similar interests or finding biological interactions between pairs of proteins. Graphs are particularly useful for describing relational entities, and graph embedding is an approach used to transform a graph’s structure into a format digestible by an AI model, whilst preserving the graph’s properties. Graph structures can widely vary in terms of their scale, specificity, and subject, making graph embedding a difficult task.

Categories: , Tags: , ,
  • Description

    Product Description

    PERFORMANCE METRICS

    96.8% Area Under Curve (AUC) -The Area Under Curve (AUC) can be interpreted as the probability that the model ranks a random positive example higher than it ranks a random negative sample. Further information here.

    25.8% Macro F1 Score – The Macro-F1 score is calculated as the arithmetic mean of the F1 scores for each label. It is a primary metric used in multilabel classification. Further information here.

    This model achieved a Macro-F1 score of 0.2581 on the BlogCatalog dataset and an Area Under Curve (AUC) score of 0.9680 on the Facebook dataset.

    OVERVIEW:

    This model generates graph embeddings by first creating a NetworkX graph from edge data from a given input text file. It then generates random walks from each node of the graph, subject to the parameters of the node2vec algorithm. These parameters indicate the size of the embedding space, the number of walks starting at each node, the length of each walk, and the probabilities that a walk will proceed to a previously-visited node or to a new node. These parameters make it approximately four times more likely that a random walk will visit a new node rather than a previously-visited node, i.e. the model performs more of a depth-first search rather than a breadth-first search. A shallow neural network, the Skip-Gram network, is used to generate the node embeddings. The embeddings of each pair of nodes are passed through the Hadamard operator to generate the embedding of the edge with those nodes as its endpoints. The final output is a 128-dimensional embedding of each node and a 128-dimension embedding of each edge and non-edge between the graph’s nodes.

    TRAINING:

    This model trains a new Skip-Gram network on the random walks generated for an input graph. The skip-gram network is a neural network with an input and output layer of size N, with one hidden layer of size N×D, where N is the number of nodes and D is the number of dimensions of the embedding space. The network predicts the probabilities that each node will appear within a context window of the input word. The network is trained for 5 epochs using stochastic gradient descent. After training, the output layer is ignored, and the weights of the hidden layer are used as the embeddings for the graph’s nodes.

    VALIDATION:

    This model was validated empirically on multiple networks by verifying that the embeddings result in similar nodes and similar edges being near each other in the embedding space.

    INPUT SPECIFICATION

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    edges.txt 10M .txt

    The input file should contain the word “graph” or “digraph” in the first line, and each subsequent line should contains the names of the nodes which are the endpoints of an edge, along with an optional edge weight, for example: graph 1 2 1 1 2 1 2 0 2 1 1 3 4

    OUTPUT DETAILS

    This model will output the following:

    Filename Maximum Size Format
    results.json 1000G .json

    The output file, “results.json”, will contain the node and edge embedding vectors. The top level keys are “Node Embeddings” and “Edge Embeddings”. The value corresponding to the “Node Embeddings” key is an object in which keys are node names and values are corresponding embedding vectors. The value corresponding to the “Edge Embeddings” key is an object in which keys are edges denoted by their corresponding node names, and values are corresponding embedding vectors.