Node Attribute Clustering

Download ZIP File

This is the data and code necessary to reproduce the main results in the paper “Unsupervised Learning with Network-Aware Embeddings.”

If you only want to use the library for your own clustering tasks, you will only need to use the network_clustering.py file. Once it is in your path, this is the simplest working example to use it:

import network_clustering as nc
tensor = nc._make_tensor(G, O, labels)
distance_matrix = nc.compute_distances(tensor, "nvd+emb")
clusters = nc.cluster(distance_matrix)

This assumes that G is a valid undirected networkx object, O is a pandas dataframe with a set of observations on the rows and one column per node in G, and labels is a list of ground truth labels for each of the rows in O (if you don’t have a ground truth, pass a vector of constant). The code will calculate in distance_matrix the distances between your observations using tSNE embeddings in a GE space, then will run DBSCAN on them.

Note that you might want to use a different method to calculate the distance, check the code to see the valid options. Also, you might want to fiddle with the DBSCAN parameters (as optional arguments for the nc.cluster function). Also, this is really slow, for a fast version you’ll need to wait for a Julia version of the library (or help me with the Python one).

To replicate the results of the paper, you want to focus on all content of the archive.

The “data” folder contains the data for Figure 4. There are three datasets: tradeatlas, littlesis and tivoli. For all three datasets you have two files.

  • “*_network.csv” files contain the graph structure in edgelist format. The first column is the first node in the edge and the second column is the second node in the edge. The networks are undirected. Only for tivoli, you have a third column with the weight of the edge.
  • “*_nodeattributes.csv” files contain the node attributes. These are one value per node for the first n columns (with n being the number of nodes in the network. The last column, called “label” represent the ground truth, the objective of the prediction.

The four scripts are the following:

  • “network_clustering.py”: this is the library developed for the paper and should not be run directly.
  • “01_fig_3_tab_a1.py”: this reproduces the results on synthetic networks, for Figure 3 and Table A1 in the main paper. Runs without parameters, with Python.
  • “02_fig_4.py”: this reproduces the results on real world networks, for Figure 4 in the main paper. Runs without parameters, with Python.
  • “03_fig_5.jl”: this reproduces the runtime scalability results, for Figure 5 in the main paper. Runs without parameters, with Julia.

These are the dependencies you need to have to successfully run all experiments:

Python: numpy, scipy, pandas, networkx, sklearn, pytorch, pytorch geometric.
Julia: Laplacians, LinearAlgebra, Graphs, SimpleWeightedGraphs.

Download ZIP File