Graph Embedding

Embedding algorithms are rooted in the mathematical concept of embedding, the goal is to abstract (aka embedding) data in high-dimensional space to low-dimension space. Graph embedding is to extract high-dimensional data and morph it into low-dimensional data, the work is usually done against and to represent a node or a struct. From the name of graph embedding algorithms, you can tell what they are trying to achieve, for instance, node2vec is to abstract node to vector, literally, and struc2vec is to abstract a struct or high-dimension structure into a vector.

There are multiple graph embedding algorithms supported, such as:

  • Random Walk
  • Node2Vec
  • Struc2Vec
  • Line

Among these algorithms the Randon Walk is starting from a random node in the graph, by following certain rule(s), randomly walk on the graph and eventually forming a set of paths, which can be used as a training sample for embedding. Ultipa offers multiple random walking mechanism:

  • Random Walk
  • node2vec random walk
  • path-template based random walk
  • struc2vec random walk

algo(random_walk)

  Basic  

Configuration items for Random Walk operation:

Item Data Type Specification Description
<walk_num> int >0 Number of walks
<walk_length> int >0 Length per walk
<edge_property_name> string edge property (Optional) The edge property to Random-walk along
<limit> int >0; -1 The maximum number of results to return; -1: return all the results

Validity of write_back():

Approach Desitation
To database /
To disk file RPC port

algo(random_walk_node2vec)

  AI  

Configuration items for Node2Vec-Random Walk operation:

Item Data Type Specification Description
<buffer_size> int >0 The number of walks to complete before start training
<walk_num> int >0 Number of walks
<walk_length> int >0 Length per walk
<p> float >0 The bigger the value, the lower the probability to walk backward
<q> float >0 The probability to walk away (farther or deeper)
>1:same-depth walk
<1:deeper
<edge_property_name> string edge property (Optional) The edge property to Random-walk along

Validity of write_back():

Approach Desitation
To database /
To disk file RPC port

algo(random_walk_struc2vec)

  AI  

Configuration items for Struc2Vec-Random Walk operation:

Item Data Type Specification Description
<walk_num> int >0 Number of walks
<walk_length> int >0 Length per walk
<k> int >0 The number of layer (depth). Use a number smaller or equal to graph-wide avg. shortest path amongst all nodes.Bigger k means much longer time to process! Start from 1 and grow incrementally
<stay_probability> float (0, 1) The probability to stay at the current layer (depth)

Validity of write_back():

Approach Desitation
To database /
To disk file RPC port

algo(node2vec)

  AI     Very Expensive  

Node2Vec is a graph embedding method that combines DFS neighborhood and BFS neighborhood - by leveraging both DFS and BFS random walk to generate training samples, and using skip-gram and SDG’s word2vec training model to assign embedding vector to vertices in the graph.

Node2Vec relies on Node2Vec random walk algorithm, more specifically, node2vec random walk is considered the 1st step of execution during the overall node2vec execution process. And this is true to Struc2Vec vs. Struc2Vec Random Walk as well.

Configuration items for Node2Vec operation:

Item Data Type Specification Description
<walk_num> int >0 Number of walks
<walk_length> int >0 Length per walk
<p> float >0 The bigger the value, the lower the probability to walk backward
<q> float >0 The probability to walk away (farther or deeper)
>1:same-depth walk
<1:deeper
<window_size> int >0 The window size
<dimension> int >0 example: 100
<learning_rate> double (0, 1) Recommend: 0.025
<min_learning_rate> double (0, 1) example: learning_rate * 0.0001
<sub_sample_alpha> double / -1 as default,can be 0.001 in word2vec, the equation is (sqrt(x / 0.001) + 1) * (0.001 / x)
<resolution> int / Recommend: 10 or 100
<neg_num> int (0, 10) The number of words in negative sampling phrase
<loop_num> int (0, 100) The loop number
<min_frequency> int >=0 To remove nodes that occur fewer than 'min_frequency'
<edge_property_name> string edge property (numeric type) The edge property to Random-walk along
<limit> int >0; -1 The maximum number of results to return; -1: return all the results

Validity of write_back():

Approach Desitation
To database /
To disk file RPC port

algo(struc2vec)

  AI     Very Expensive  

Struc2Vec algorithm uses a novel and flexible framework for learning latent representations for the structural identity of nodes. It uses a hierarchy to measure node similarity at different scales, and constructs a multilayer graph to encode structural similarities and generate a structural context for nodes.

Configuration items for Struc2Vec operation:

Item Data Type Specification Description
<walk_num> int >0 Number of walks
<walk_length> int >0 Length per walk
<k> int >0 The number of layer(depth). Use a number smaller or equal to graph-wide avg. shortest path amongst all nodes.Bigger k means much longer time to process! Start from 1 and grow incrementally
<stay_probability> float (0, 1) The probability to stay at the current layer(depth)
<window_size> int > 0 window size
<dimension> int > 0 example: 100
<learning_rate> double (0, 1) Recommend: 0.025
<min_learning_rate> double (0, 1) example: learning * rate * 0.0001
<sub_sample_alpha> double / -1 as default,can be 0.001 in word2vec, the equation is (sqrt(x / 0.001) + 1) _ (0.001 / x)
<resolution> int Recommend: 10 or 100
<neg_num> int >0 number of words in negative sampling phrase
<loop_num> int >0 loop number
<min_frequency> int >= 0 To remove nodes that occur fewer than 'min_frequency'

Validity of write_back():

Approach Desitation
To database /
To disk file RPC port

algo(line)

  AI  

LINE is also a graph-embedding training model, it leverages BFS information (1-hop or 2-hop similarity) to construct training samples, and uses a training method that's similar to Node2Vec's to generate node's embedding vector.

Configuration items for LINE operation:

Item Data Type Specification Description
<edge_property_name> string edge property edge property name
<dimension> int >0 dimension of the embedded vector, 10~n10 (e.g : 100)
<resolution> int >0 Recommend: 10 or 100
<start_alpha> double > 0 0.025 recommended
<neg_num> int >0 number of words in negative sampling phrase
<total_sample> int (0,10) number of samples
<train_order> int 1; 2 1: first order proximity
2: second order proximity

Validity of write_back():

Approach Desitation
To database /
To disk file RPC port