init_v0.0

This commit is contained in:
Chengbin Hou 2018-11-17 12:30:56 +00:00
parent 0634d25c6a
commit 857859f2c4
44 changed files with 14224 additions and 2 deletions

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2017 THUNLP
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

242
README.md
View File

@ -1,2 +1,240 @@
# OpenANE
Attributed Network Embedding
# OpenNE: An open source toolkit for Network Embedding
This repository provides a standard NE/NRL(Network Representation Learningtraining and testing framework. In this framework, we unify the input and output interfaces of different NE models and provide scalable options for each model. Moreover, we implement typical NE models under this framework based on tensorflow, which enables these models to be trained with GPUs.
We develop this toolkit according to the settings of DeepWalk. The implemented or modified models include [DeepWalk](https://github.com/phanein/deepwalk), [LINE](https://github.com/tangjianpku/LINE), [node2vec](https://github.com/aditya-grover/node2vec), [GraRep](https://github.com/ShelsonCao/GraRep), [TADW](https://github.com/thunlp/TADW) and [GCN](https://github.com/tkipf/gcn). We will implement more representative NE models continuously according to our released [NRL paper list](https://github.com/thunlp/nrlpapers). Specifically, we welcome other researchers to contribute NE models into this toolkit based on our framework. We will announce the contribution in this project.
## Requirements
- numpy==1.13.1
- networkx==2.0
- scipy==0.19.1
- tensorflow==1.3.0
- gensim==3.0.1
- scikit-learn==0.19.0
## Usage
#### General Options
You can check out the other options available to use with *OpenNE* using:
python src/main.py --help
- --input, the input file of a network;
- --graph-format, the format of input graph, adjlist or edgelist;
- --output, the output file of representation (GCN doesn't need it);
- --representation-size, the number of latent dimensions to learn for each node; the default is 128
- --method, the NE model to learn, including deepwalk, line, node2vec, grarep, tadw and gcn;
- --directed, treat the graph as directed; this is an action;
- --weighted, treat the graph as weighted; this is an action;
- --label-file, the file of node label; ignore this option if not testing;
- --clf-ratio, the ratio of training data for node classification; the default is 0.5;
- --epochs, the training epochs of LINE and GCN; the default is 5;
#### Example
To run "node2vec" on BlogCatalog network and evaluate the learned representations on multi-label node classification task, run the following command in the home directory of this project:
python src/main.py --method node2vec --label-file data/blogCatalog/bc_labels.txt --input data/blogCatalog/bc_adjlist.txt --graph-format adjlist --output vec_all.txt --q 0.25 --p 0.25
To run "gcn" on Cora network and evaluate the learned representations on multi-label node classification task, run the following command in the home directory of this project:
python src/main.py --method gcn --label-file data/cora/cora_labels.txt --input data/cora/cora_edgelist.txt --graph-format edgelist --feature-file data/cora/cora.features --epochs 200 --output vec_all.txt --clf-ratio 0.1
#### Specific Options
DeepWalk and node2vec:
- --number-walks, the number of random walks to start at each node; the default is 10;
- --walk-length, the length of random walk started at each node; the default is 80;
- --workers, the number of parallel processes; the default is 8;
- --window-size, the window size of skip-gram model; the default is 10;
- --q, only for node2vec; the default is 1.0;
- --p, only for node2vec; the default is 1.0;
LINE:
- --negative-ratio, the default is 5;
- --order, 1 for the 1st-order, 2 for the 2nd-order and 3 for 1st + 2nd; the default is 3;
- --no-auto-save, no early save when training LINE; this is an action; when training LINE, we will calculate F1 scores every epoch. If current F1 is the best F1, the embeddings will be saved.
GraRep:
- --kstep, use k-step transition probability matrixmake sure representation-size%k-step == 0).
TADW:
- --lamb, lamb is a hyperparameter in TADW that controls the weight of regularization terms.
GCN:
- --feature-file, The file of node features;
- --epochs, the training epochs of GCN; the default is 5;
- --dropout, dropout rate;
- --weight-decay, weight for l2-loss of embedding matrix;
- --hidden, number of units in the first hidden layer.
#### Input
The supported input format is an edgelist or an adjlist:
edgelist: node1 node2 <weight_float, optional>
adjlist: node n1 n2 n3 ... nk
The graph is assumed to be undirected and unweighted by default. These options can be changed by setting the appropriate flags.
If the model needs additional features, the supported feature input format is as follow (**feature_i** should be a float number):
node feature_1 feature_2 ... feature_n
#### Output
The output file has *n+1* lines for a graph with *n* nodes.
The first line has the following format:
num_of_nodes dim_of_representation
The next *n* lines are as follows:
node_id dim1 dim2 ... dimd
where dim1, ... , dimd is the *d*-dimensional representation learned by *OpenNE*.
#### Evaluation
If you want to evaluate the learned node representations, you can input the node labels. It will use a portion (default: 50%) of nodes to train a classifier and calculate F1-score on the rest dataset.
The supported input label format is
node label1 label2 label3...
## Comparisons with other implementations
Running environment: <br />
BlogCatalog: CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz. <br />
Wiki, Cora: CPU: Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz. <br />
We show the node classification results of various methods in different datasets. We set representation dimension to 128, **kstep=4** in GraRep.
Note that, both GCN(a semi-supervised NE model) and TADW need additional text features as inputs. Thus, we evaluate these two models on Cora in which each node has text information. We use 10% labeled data to train GCN.
[BlogCatalog](http://leitang.net/social_dimension.html): 10312 nodes, 333983 edges, 39 labels, undirected:
- data/blogCatalog/bc_adjlist.txt
- data/blogCatalog/bc_edgelist.txt
- data/blogCatalog/bc_labels.txt
|Algorithm | Time| Micro-F1 | Macro-F1|
|:------------|-------------:|------------:|-------:|
|[DeepWalk](https://github.com/phanein/deepwalk) | 271s | 0.385 | 0.238|
|[LINE 1st+2nd](https://github.com/tangjianpku/LINE) | 2008s | 0.398 | 0.235|
|[Node2vec](https://github.com/aditya-grover/node2vec) | 2623s | 0.404| 0.264|
|[GraRep](https://github.com/ShelsonCao/GraRep) | - | - | - |
|OpenNE(DeepWalk) | 986s | 0.394 | 0.249|
|OpenNE(LINE 1st+2nd) | 1555s | 0.390 | 0.253|
|OpenNE(node2vec) | 3501s | 0.405 | 0.275|
|OpenNE(GraRep) | 4178s | 0.393 | 0.230 |
[Wiki](https://github.com/thunlp/MMDW/tree/master/data) (Wiki dataset is provided by [LBC project](http://www.cs.umd.edu/~sen/lbc-proj/LBC.html). But the original link failed.): 2405 nodes, 17981 edges, 19 labels, directed:
- data/wiki/Wiki_edgelist.txt
- data/wiki/Wiki_category.txt
|Algorithm | Time| Micro-F1 | Macro-F1|
|:------------|-------------:|------------:|-------:|
|[DeepWalk](https://github.com/phanein/deepwalk) | 52s | 0.669 | 0.560|
|[LINE 2nd](https://github.com/tangjianpku/LINE) | 70s | 0.576 | 0.387|
|[node2vec](https://github.com/aditya-grover/node2vec) | 32s | 0.651 | 0.541|
|[GraRep](https://github.com/ShelsonCao/GraRep) | 19.6s | 0.633 | 0.476|
|OpenNE(DeepWalk) | 42s | 0.658 | 0.570|
|OpenNE(LINE 2nd) | 90s | 0.661 | 0.521|
|OpenNE(Node2vec) | 33s | 0.655 | 0.538|
|OpenNE(GraRep) | 23.7s | 0.649 | 0.507 |
[Cora](https://linqs.soe.ucsc.edu/data): 2708 nodes, 5429 edges, 7 labels, directed:
- data/cora/cora_edgelist.txt
- data/cora/cora.features
- data/cora/cora_labels.txt
|Algorithm | Dropout | Weight_decay | Hidden | Dimension | Time| Accuracy |
|:------------|-------------:|-------:|-------:|-------:|-------:|-------:|
| [TADW](https://github.com/thunlp/TADW) | - | - | - | 80*2 | 13.9s | 0.780 |
| [GCN](https://github.com/tkipf/gcn) | 0.5 | 5e-4 | 16 | - | 4.0s | 0.790 |
| OpenNE(TADW) | - | - | - | 80*2 | 20.8s | 0.791 |
| OpenNE(GCN) | 0.5 | 5e-4 | 16 | - | 5.5s | 0.789 |
| OpenNE(GCN) | 0 | 5e-4 | 16 | - | 6.1s | 0.779 |
| OpenNE(GCN) | 0.5 | 1e-4 | 16 | - | 5.4s | 0.783 |
| OpenNE(GCN) | 0.5 | 5e-4 | 64 | - | 6.5s | 0.779 |
## Citing
If you find *OpenNE* is useful for your research, please consider citing the following papers:
@InProceedings{perozzi2014deepwalk,
Title = {Deepwalk: Online learning of social representations},
Author = {Perozzi, Bryan and Al-Rfou, Rami and Skiena, Steven},
Booktitle = {Proceedings of KDD},
Year = {2014},
Pages = {701--710}
}
@InProceedings{tang2015line,
Title = {Line: Large-scale information network embedding},
Author = {Tang, Jian and Qu, Meng and Wang, Mingzhe and Zhang, Ming and Yan, Jun and Mei, Qiaozhu},
Booktitle = {Proceedings of WWW},
Year = {2015},
Pages = {1067--1077}
}
@InProceedings{grover2016node2vec,
Title = {node2vec: Scalable feature learning for networks},
Author = {Grover, Aditya and Leskovec, Jure},
Booktitle = {Proceedings of KDD},
Year = {2016},
Pages = {855--864}
}
@article{kipf2016semi,
Title = {Semi-Supervised Classification with Graph Convolutional Networks},
Author = {Kipf, Thomas N and Welling, Max},
journal = {arXiv preprint arXiv:1609.02907},
Year = {2016}
}
@InProceedings{cao2015grarep,
Title = {Grarep: Learning graph representations with global structural information},
Author = {Cao, Shaosheng and Lu, Wei and Xu, Qiongkai},
Booktitle = {Proceedings of CIKM},
Year = {2015},
Pages = {891--900}
}
@InProceedings{yang2015network,
Title = {Network representation learning with rich text information},
Author = {Yang, Cheng and Liu, Zhiyuan and Zhao, Deli and Sun, Maosong and Chang, Edward},
Booktitle = {Proceedings of IJCAI},
Year = {2015}
}
@Article{tu2017network,
Title = {Network representation learning: an overview},
Author = {TU, Cunchao and YANG, Cheng and LIU, Zhiyuan and SUN, Maosong},
Journal = {SCIENTIA SINICA Informationis},
Volume = {47},
Number = {8},
Pages = {980--996},
Year = {2017}
}
## Sponsor
This research is supported by Tencent, MSRA and NSFC.
<img src="http://logonoid.com/images/tencent-logo.png" width = "300" height = "30" alt="tencent" align=center />
<img src="http://net.pku.edu.cn/~xjl/images/msra.png" width = "200" height = "100" alt="MSRA" align=center />
<img src="http://www.dragon-star.eu/wp-content/uploads/2014/04/NSFC_logo.jpg" width = "100" height = "80" alt="NSFC" align=center />

2708
data/cora/cora_adjlist.txt Normal file

File diff suppressed because it is too large Load Diff

2708
data/cora/cora_attr.txt Normal file

File diff suppressed because one or more lines are too long

2708
data/cora/cora_label.txt Normal file

File diff suppressed because it is too large Load Diff

38
requirements.txt Normal file
View File

@ -0,0 +1,38 @@
setuptools==39.1.0 #tensorflow 1.10.0 has requirement setuptools<=39.1.0, but you'll have setuptools 39.2.0 which is incompatible
absl-py==0.2.2
astor==0.6.2
backports.weakref==1.0.post1
bleach==1.5.0
decorator==4.3.0
#enum34==1.1.6 # enum34 is not necessary for python > 3.4
funcsigs==1.0.2
gast==0.2.0
grpcio==1.12.1
html5lib==0.9999999
Markdown==2.6.11
mock==2.0.0
numpy==1.14.5
pbr==4.0.4
protobuf==3.6.0
scipy==1.1.0
six==1.11.0
#sklearn==0.0
termcolor==1.1.0
Werkzeug==0.14.1
# we update to the latest version of the following packages @18 Oct 2018
# the orignal one: https://github.com/williamleif/GraphSAGE
#futures==3.2.0
networkx==2.2
tensorflow==1.10.0
tensorboard==1.10.0
gensim==3.0.1
scikit-learn==0.19.0 #0.20.0 is OK but may get some warnings
# if your want utilize your gpu for speeding up, try simply use the following conda command
# tested in python==3.6.6
# either -> conda install tensorflow-gpu==1.10.0 #this version will help you to install cuda and cudnn
# for cuda driver compatibility: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
# e.g. if driver 384.xx -> conda install tensorflow-gpu=1.10.0 cudatoolkit=9.0
# or -> simply build from docker image: docker pull tensorflow/tensorflow:1.10.0-gpu-py3
# ref: https://www.tensorflow.org/install/docker#gpu_support

2
src/libnrl/__init__.py Normal file
View File

@ -0,0 +1,2 @@
from __future__ import print_function
from __future__ import division

153
src/libnrl/aane.py Normal file
View File

@ -0,0 +1,153 @@
# -*- coding: utf-8 -*-
import numpy as np
from scipy import sparse
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from math import ceil
'''
#-----------------------------------------------------------------------------
# modified by Chengbin Hou 2018
# part of code was originally forked from https://github.com/xhuang31/AANE_Python
#-----------------------------------------------------------------------------
'''
class AANE:
"""Jointly embed Net and Attri into embedding representation H
H = AANE(Net,Attri,d).function()
H = AANE(Net,Attri,d,lambd,rho).function()
H = AANE(Net,Attri,d,lambd,rho,maxiter).function()
H = AANE(Net,Attri,d,lambd,rho,maxiter,'Att').function()
H = AANE(Net,Attri,d,lambd,rho,maxiter,'Att',splitnum).function()
:param Net: the weighted adjacency matrix
:param Attri: the attribute information matrix with row denotes nodes
:param d: the dimension of the embedding representation
:param lambd: the regularization parameter
:param rho: the penalty parameter
:param maxiter: the maximum number of iteration
:param 'Att': refers to conduct Initialization from the SVD of Attri
:param splitnum: the number of pieces we split the SA for limited cache
:return: the embedding representation H
Copyright 2017 & 2018, Xiao Huang and Jundong Li.
$Revision: 1.0.2 $ $Date: 2018/02/19 00:00:00 $
"""
def __init__(self, graph, dim=100, lambd=0.05, rho=5, mode='comb', *varargs): #paper said lambd should not too large; suggest [0, 0.1]; lambd=0 -> attrpure
self.d = dim
self.look_back_list = graph.look_back_list #look back node id for A and X
if mode == 'comb':
print('==============AANE-comb mode: jointly learn emb from both structure and attribute info========')
Net = sparse.csr_matrix(graph.getA())
Attri = sparse.csr_matrix(graph.getX())
elif mode == 'pure':
print('======================AANE-pure mode: learn emb from structure info purely====================')
Net = graph.getA()
Attri = Net
else:
exit(0)
self.maxiter = 2 # Max num of iteration
[self.n, m] = Attri.shape # n = Total num of nodes, m = attribute category num
Net = sparse.lil_matrix(Net)
Net.setdiag(np.zeros(self.n))
Net = csc_matrix(Net)
Attri = csc_matrix(Attri)
self.lambd = 0.05 # Initial regularization parameter
self.rho = 5 # Initial penalty parameter
splitnum = 1 # number of pieces we split the SA for limited cache
if len(varargs) >= 4 and varargs[3] == 'Att':
sumcol = np.arange(m)
np.random.shuffle(sumcol)
self.H = svds(Attri[:, sumcol[0:min(10 * d, m)]], d)[0]
else:
sumcol = Net.sum(0)
self.H = svds(Net[:, sorted(range(self.n), key=lambda k: sumcol[0, k], reverse=True)[0:min(10 * self.d, self.n)]], self.d)[0]
if len(varargs) > 0:
self.lambd = varargs[0]
self.rho = varargs[1]
if len(varargs) >= 3:
self.maxiter = varargs[2]
if len(varargs) >= 5:
splitnum = varargs[4]
self.block = min(int(ceil(float(self.n) / splitnum)), 7575) # Treat at least each 7575 nodes as a block
self.splitnum = int(ceil(float(self.n) / self.block))
with np.errstate(divide='ignore'): # inf will be ignored
self.Attri = Attri.transpose() * sparse.diags(np.ravel(np.power(Attri.power(2).sum(1), -0.5)))
self.Z = self.H.copy()
self.affi = -1 # Index for affinity matrix sa
self.U = np.zeros((self.n, self.d))
self.nexidx = np.split(Net.indices, Net.indptr[1:-1])
self.Net = np.split(Net.data, Net.indptr[1:-1])
self.vectors = {}
self.function() #run aane
'''################# Update functions #################'''
def updateH(self):
xtx = np.dot(self.Z.transpose(), self.Z) * 2 + self.rho * np.eye(self.d)
for blocki in range(self.splitnum): # Split nodes into different Blocks
indexblock = self.block * blocki # Index for splitting blocks
if self.affi != blocki:
self.sa = self.Attri[:, range(indexblock, indexblock + min(self.n - indexblock, self.block))].transpose() * self.Attri
self.affi = blocki
sums = self.sa.dot(self.Z) * 2
for i in range(indexblock, indexblock + min(self.n - indexblock, self.block)):
neighbor = self.Z[self.nexidx[i], :] # the set of adjacent nodes of node i
for j in range(1):
normi_j = np.linalg.norm(neighbor - self.H[i, :], axis=1) # norm of h_i^k-z_j^k
nzidx = normi_j != 0 # Non-equal Index
if np.any(nzidx):
normi_j = (self.lambd * self.Net[i][nzidx]) / normi_j[nzidx]
self.H[i, :] = np.linalg.solve(xtx + normi_j.sum() * np.eye(self.d), sums[i - indexblock, :] + (
neighbor[nzidx, :] * normi_j.reshape((-1, 1))).sum(0) + self.rho * (
self.Z[i, :] - self.U[i, :]))
else:
self.H[i, :] = np.linalg.solve(xtx, sums[i - indexblock, :] + self.rho * (
self.Z[i, :] - self.U[i, :]))
def updateZ(self):
xtx = np.dot(self.H.transpose(), self.H) * 2 + self.rho * np.eye(self.d)
for blocki in range(self.splitnum): # Split nodes into different Blocks
indexblock = self.block * blocki # Index for splitting blocks
if self.affi != blocki:
self.sa = self.Attri[:, range(indexblock, indexblock + min(self.n - indexblock, self.block))].transpose() * self.Attri
self.affi = blocki
sums = self.sa.dot(self.H) * 2
for i in range(indexblock, indexblock + min(self.n - indexblock, self.block)):
neighbor = self.H[self.nexidx[i], :] # the set of adjacent nodes of node i
for j in range(1):
normi_j = np.linalg.norm(neighbor - self.Z[i, :], axis=1) # norm of h_i^k-z_j^k
nzidx = normi_j != 0 # Non-equal Index
if np.any(nzidx):
normi_j = (self.lambd * self.Net[i][nzidx]) / normi_j[nzidx]
self.Z[i, :] = np.linalg.solve(xtx + normi_j.sum() * np.eye(self.d), sums[i - indexblock, :] + (
neighbor[nzidx, :] * normi_j.reshape((-1, 1))).sum(0) + self.rho * (
self.H[i, :] + self.U[i, :]))
else:
self.Z[i, :] = np.linalg.solve(xtx, sums[i - indexblock, :] + self.rho * (
self.H[i, :] + self.U[i, :]))
def function(self):
self.updateH()
'''################# Iterations #################'''
for __ in range(self.maxiter - 1):
self.updateZ()
self.U = self.U + self.H - self.Z
self.updateH()
#-------save emb to self.vectors and return
ind = 0
for id in self.look_back_list:
self.vectors[id] = self.H[ind]
ind += 1
return self.vectors
def save_embeddings(self, filename):
'''
save embeddings to file
'''
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.dim))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,' '.join([str(x) for x in vec])))
fout.close()

131
src/libnrl/abrw.py Normal file
View File

@ -0,0 +1,131 @@
# -*- coding: utf-8 -*-
import numpy as np
import time
from numpy import linalg as la
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from gensim.models import Word2Vec
from . import walker
import networkx as nx
from libnrl.utils import *
import multiprocessing
'''
#-----------------------------------------------------------------------------
# author: Chengbin Hou @ SUSTech 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
def multiprocessor_argpartition(vec):
topk = 20
print('len of vec...',len(vec))
return np.argpartition(vec, -topk)[-topk:]
class ABRW(object):
def __init__(self, graph, dim, alpha, topk, path_length, num_paths, **kwargs):
self.g = graph
self.alpha = float(alpha)
self.topk = int(topk)
kwargs["workers"] = kwargs.get("workers", 1)
self.P = self.biasedTransProb() #obtain biased transition probs mat
weighted_walker = walker.BiasedWalker(g=self.g, P=self.P, workers=kwargs["workers"]) #instance weighted walker
#generate sentences according to biased transition probs mat P
sentences = weighted_walker.simulate_walks(num_walks=num_paths, walk_length=path_length)
#skip-gram parameters
kwargs["sentences"] = sentences
kwargs["min_count"] = kwargs.get("min_count", 0)
kwargs["size"] = kwargs.get("size", dim)
kwargs["sg"] = 1 #use skip-gram; but see deepwalk which uses 'hs' = 1
self.size = kwargs["size"]
#learning embedding by skip-gram model
print("Learning representation...")
word2vec = Word2Vec(**kwargs)
#save emb for later eval
self.vectors = {}
for word in self.g.G.nodes():
self.vectors[word] = word2vec.wv[word] #save emb
del word2vec
#----------------------------------------key of our method---------------------------------------------
def biasedTransProb(self):
'''
given: A and X --> P_A and P_X
research question: how to combine A and X in a more principled way
genral idea: Attribute Biased Random Walk
i.e. a walker based on a mixed transition matrix by P=alpha*P_A + (1-alpha)*P_X
result: ABRW-trainsition matrix; P
*** questions: 1) what about if we have some single nodes i.e. some rows of P_A gives 0s
2) the similarity/distance metric to obtain P_X
3) alias sampling as used in node2vec for speeding up, but this is the case
if each row of P gives many 0s
--> how to make each row of P is a pdf and meanwhile is sparse
'''
print("obtaining biased transition probs mat...")
t1 = time.time()
A = self.g.get_adj_mat() #adj/struc info mat
P_A = row_as_probdist(A) #if single node, return [0, 0, 0 ..] we will fix this later
X = self.g.get_attr_mat() #attr info mat
X_compressed = X #if need speed up, try to use svd or pca for compression, but will loss some acc
#X_compressed = self.g.preprocessAttrInfo(X=X, dim=200, method='pca') #svd or pca for dim reduction; follow TADW setting use svd with dim=200
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity, cosine_distances, euclidean_distances # we may try diff metrics
#ref http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise
#t1=time.time()
X_sim = cosine_similarity(X_compressed, X_compressed)
#t2=time.time()
#print('======no need pre proce', t2-t1)
#way5: a faster implementation of way5 by Zeyu Dong
topk = self.topk
print('way5 remain self---------topk = ', topk)
t1 = time.time()
cutoff = np.partition(X_sim, -topk, axis=1)[:,-topk:].min(axis=1)
X_sim[(X_sim < cutoff)] = 0
t2 = time.time()
P_X = row_as_probdist(X_sim)
t3 = time.time()
for i in range(P_X.shape[0]):
sum_row = P_X[i].sum()
if sum_row != 1.0: #to avoid some numerical issue...
delta = 1.0 - sum_row #delta is very very samll number say 1e-10 or even less...
P_X[i][i] = P_X[i][i] + delta #the diagnoal must be largest of the that row + delta --> almost no effect
t4 = time.time()
print('topk time: ',t2-t1 ,'row normlize time: ',t3-t2, 'dealing numerical issue time: ', t4-t3)
del A, X, X_compressed, X_sim
#=====================================core of our idea========================================
print('------alpha for P = alpha * P_A + (1-alpha) * P_X----: ', self.alpha)
n = self.g.get_num_nodes()
P = np.zeros((n,n), dtype=float)
for i in range(n):
if (P_A[i] == 0).all(): #single node case if the whole row are 0s
#if P_A[i].sum() == 0:
P[i] = P_X[i] #use 100% attr info to compensate
else: #non-single node case; use (1.0-self.alpha) attr info to compensate
P[i] = self.alpha * P_A[i] + (1.0-self.alpha) * P_X[i]
print('# of single nodes for P_A: ', n - P_A.sum(axis=1).sum(), ' # of non-zero entries of P_A: ', np.count_nonzero(P_A))
print('# of single nodes for P_X: ', n - P_X.sum(axis=1).sum(), ' # of non-zero entries of P_X: ', np.count_nonzero(P_X))
t5 = time.time()
print('ABRW biased transition prob preprocessing time: {:.2f}s'.format(t5-t4))
return P
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.size))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,
' '.join([str(x) for x in vec])))
fout.close()

244
src/libnrl/asne.py Normal file
View File

@ -0,0 +1,244 @@
# -*- coding: utf-8 -*-
'''
Tensorflow implementation of Social Network Embedding framework (SNE)
@author: Lizi Liao (liaolizi.llz@gmail.com)
part of code was originally forked from https://github.com/lizi-git/ASNE
modified by Chengbin Hou 2018
1) convert OpenANE data format to ASNE data format
2) compatible with latest tensorflow 1.2
3) add more comments
4) support eval testing set during each xx epoches
5) as ASNE paper stated, we add two hidden layers with softsign activation func
'''
import math
import numpy as np
import tensorflow as tf
from sklearn.base import BaseEstimator, TransformerMixin
from .classify import ncClassifier, lpClassifier, read_node_label
from sklearn.linear_model import LogisticRegression
def format_data_from_OpenANE_to_ASNE(g, dim):
'''
convert OpenANE data format to ASNE data format
g: OpenANE graph data structure
dim: final embedding dim
'''
attr_Matrix = g.getX()
#attr_Matrix = g.preprocessAttrInfo(attr_Matrix, dim=200, method='svd') #similar to aane, the same preprocessing
#print('with this preprocessing, ASNE can get better result, as well as, faster speed----------------')
id_N = attr_Matrix.shape[0] #n nodes
attr_M = attr_Matrix.shape[1] #m features
edge_num = len(g.G.edges) #total edges for traning
X={} #one-to-one correspondence
X['data_id_list'] = np.zeros(edge_num) #start node list for traning
X['data_label_list'] = np.zeros(edge_num) #end node list for training
X['data_attr_list'] = np.zeros([edge_num, attr_M]) #attr corresponds to start node
edgelist = [edge for edge in g.G.edges]
i = 0
for edge in edgelist: #traning sample = start node, end node, start node attr
X['data_id_list'][i] = edge[0]
X['data_label_list'][i] = edge[1]
X['data_attr_list'][i] = attr_Matrix[ g.look_up_dict[edge[0]] ][:]
i += 1
X['data_id_list'] = X['data_id_list'].reshape(-1).astype(int)
X['data_label_list'] = X['data_label_list'].reshape(-1,1).astype(int)
nodes={} #one-to-one correspondence
nodes['node_id'] = g.look_back_list #n nodes
nodes['node_attr'] = list(attr_Matrix) #m features -> n*m
id_embedding_size = int(dim/2)
attr_embedding_size = int(dim/2)
print('id_embedding_size', id_embedding_size, 'attr_embedding_size', attr_embedding_size)
return X, nodes, id_N, attr_M, id_embedding_size, attr_embedding_size
def add_layer(inputs, in_size, out_size, activation_function=None):
# add one more layer and return the output of this layer
Weights = tf.Variable(tf.random_uniform([in_size, out_size], -1.0, 1.0)) #init as paper stated
biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
Wx_plus_b = tf.matmul(inputs, Weights) + biases
if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b)
return outputs
class ASNE(BaseEstimator, TransformerMixin):
def __init__(self, graph, dim, alpha = 1.0, batch_size=128, learning_rate=0.001,
n_neg_samples=10, epoch=100, random_seed=2018, X_test=0, Y_test=0, task='nc', nc_ratio=0.5, lp_ratio=0.9, label_file=''):
# bind params to class
X, nodes, id_N, attr_M, id_embedding_size, attr_embedding_size = format_data_from_OpenANE_to_ASNE(g=graph, dim=dim)
self.node_N = id_N #n
self.attr_M = attr_M #m
self.X_train = X #{'data_id_list': [], 'data_label_list': [], 'data_attr_list': []}
self.nodes = nodes #{'node_id': [], 'node_attr: []'}
self.id_embedding_size = id_embedding_size # set to dim/2
self.attr_embedding_size = attr_embedding_size # set to dim/2
self.vectors = {}
self.dim = dim
self.look_back_list = graph.look_back_list #from OpenANE data stcuture
self.alpha = alpha #set to 1.0 by default
self.n_neg_samples = n_neg_samples #set to 10 by default
self.batch_size = batch_size #set to 128 by default
self.learning_rate = learning_rate
self.epoch = epoch #set to 20 by default
self.random_seed = random_seed
self._init_graph() #init all variables in a tensorflow graph
self.task = task
self.nc_ratio = nc_ratio
self.lp_ratio = lp_ratio
if self.task == 'lp': #if not lp task, we do not need to keep testing edges
self.X_test = X_test
self.Y_test = Y_test
self.train() #train our tf asne model-----------------
elif self.task == 'nc' or self.task == 'nclp':
self.X_nc_label, self.Y_nc_label = read_node_label(label_file)
self.train() #train our tf asne model-----------------
def _init_graph(self):
'''
Init a tensorflow Graph containing: input data, variables, model, loss, optimizer
'''
self.graph = tf.Graph()
#with self.graph.as_default(), tf.device('/gpu:0'):
with self.graph.as_default():
# Set graph level random seed
tf.set_random_seed(self.random_seed)
# Input data.
self.train_data_id = tf.placeholder(tf.int32, shape=[None]) # batch_size * 1
self.train_data_attr = tf.placeholder(tf.float32, shape=[None, self.attr_M]) # batch_size * attr_M
self.train_labels = tf.placeholder(tf.int32, shape=[None, 1]) # batch_size * 1
# Variables.
network_weights = self._initialize_weights()
self.weights = network_weights
# Model.
# Look up embeddings for node_id.
self.id_embed = tf.nn.embedding_lookup(self.weights['in_embeddings'], self.train_data_id) # batch_size * id_dim
self.attr_embed = tf.matmul(self.train_data_attr, self.weights['attr_embeddings']) # batch_size * attr_dim
self.embed_layer = tf.concat([self.id_embed, self.alpha * self.attr_embed], 1) # batch_size * (id_dim + attr_dim) #an error due to old tf!
## can add hidden_layers component here!
#0) no hidden layer
#1) 128
#2) 256+128 ##--------paper stated it used two hidden layers with activation function softsign....
#3) 512+256+128
len_h1_in = self.id_embedding_size+self.attr_embedding_size
len_h1_out = 256
len_h2_in = len_h1_out
len_h2_out = 128
self.h1 = add_layer(inputs=self.embed_layer, in_size=len_h1_in, out_size=len_h1_out, activation_function=tf.nn.softsign)
self.h2 = add_layer(inputs=self.h1, in_size=len_h2_in, out_size=len_h2_out, activation_function=tf.nn.softsign)
# Compute the loss, using a sample of the negative labels each time.
self.loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights = self.weights['out_embeddings'], biases = self.weights['biases'],
inputs = self.h2, labels = self.train_labels, num_sampled = self.n_neg_samples, num_classes=self.node_N))
# Optimizer.
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8).minimize(self.loss) #tune these parameters?
# print("AdamOptimizer")
# init
init = tf.initialize_all_variables()
self.sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))
self.sess.run(init)
def _initialize_weights(self):
all_weights = dict()
all_weights['in_embeddings'] = tf.Variable(tf.random_uniform([self.node_N, self.id_embedding_size], -1.0, 1.0)) # id_N * id_dim
all_weights['attr_embeddings'] = tf.Variable(tf.random_uniform([self.attr_M,self.attr_embedding_size], -1.0, 1.0)) # attr_M * attr_dim
all_weights['out_embeddings'] = tf.Variable(tf.truncated_normal([self.node_N, self.id_embedding_size + self.attr_embedding_size],
stddev=1.0 / math.sqrt(self.id_embedding_size + self.attr_embedding_size)))
all_weights['biases'] = tf.Variable(tf.zeros([self.node_N]))
return all_weights
def partial_fit(self, X): # fit a batch
feed_dict = {self.train_data_id: X['batch_data_id'], self.train_data_attr: X['batch_data_attr'],
self.train_labels: X['batch_data_label']}
loss, opt = self.sess.run((self.loss, self.optimizer), feed_dict=feed_dict)
return loss
def get_random_block_from_data(self, data, batch_size): #useless for a moment...
start_index = np.random.randint(0, len(data) - batch_size)
return data[start_index:(start_index + batch_size)]
def train(self): # fit a dataset
self.Embeddings = []
print('Using in + out embedding')
for epoch in range( self.epoch ):
total_batch = int( len(self.X_train['data_id_list']) / self.batch_size) #total_batch*batch_size = numOFlinks??
# print('total_batch in 1 epoch: ', total_batch)
# Loop over all batches
for i in range(total_batch):
# generate a batch data
batch_xs = {}
start_index = np.random.randint(0, len(self.X_train['data_id_list']) - self.batch_size)
batch_xs['batch_data_id'] = self.X_train['data_id_list'][start_index:(start_index + self.batch_size)] #generate batch data
batch_xs['batch_data_attr'] = self.X_train['data_attr_list'][start_index:(start_index + self.batch_size)]
batch_xs['batch_data_label'] = self.X_train['data_label_list'][start_index:(start_index + self.batch_size)]
# Fit training using batch data
cost = self.partial_fit(batch_xs)
# Display logs per epoch
Embeddings_out = self.getEmbedding('out_embedding', self.nodes)
Embeddings_in = self.getEmbedding('embed_layer', self.nodes)
self.Embeddings = Embeddings_out + Embeddings_in #simply mean them and as final embedding; try concat? to do...
#print('training tensorflow asne model, epoc: ', epoch+1 , ' / ', self.epoch)
#to save training time, we delete eval testing data @ each epoch
#-----------for each xx epoches; save embeddings {node_id1: [], node_id2: [], ...}----------
if (epoch+1)%1 == 0 and epoch != 0: #for every xx epoches, try eval
print('@@@ epoch ------- ', epoch+1 , ' / ', self.epoch)
ind = 0
for id in self.nodes['node_id']: #self.nodes['node_id']=self.look_back_list
self.vectors[id] = self.Embeddings[ind]
ind += 1
#self.eval(vectors=self.vectors)
print('please note that: the fianl embedding returned and its output file are not the best embedding!')
print('for the best embeddings, please check which epoch got the best eval metric(s)......')
def getEmbedding(self, type, nodes):
if type == 'embed_layer':
feed_dict = {self.train_data_id: nodes['node_id'], self.train_data_attr: nodes['node_attr']}
Embedding = self.sess.run(self.embed_layer, feed_dict=feed_dict)
return Embedding
if type == 'out_embedding':
Embedding = self.sess.run(self.weights['out_embeddings']) #sess.run to get embeddings from tf
return Embedding # nodes_number * (id_dim + attr_dim)
def save_embeddings(self, filename):
'''
save embeddings to file
'''
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.dim))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,' '.join([str(x) for x in vec])))
fout.close()
def eval(self, vectors):
#------nc task
if self.task == 'nc' or self.task == 'nclp':
print("Training nc classifier using {:.2f}% node labels...".format(self.nc_ratio*100))
clf = ncClassifier(vectors=vectors, clf=LogisticRegression()) #use Logistic Regression as clf; we may choose SVM or more advanced ones
clf.split_train_evaluate(self.X_nc_label, self.Y_nc_label, self.nc_ratio)
#------lp task
if self.task == 'lp':
#X_test, Y_test = read_edge_label(args.label_file) #enable this if you want to load your own lp testing data, see classfiy.py
print("During embedding we have used {:.2f}% links and the remaining will be left for lp evaluation...".format(self.lp_ratio*100))
clf = lpClassifier(vectors=vectors) #similarity/distance metric as clf; basically, lp is a binary clf probelm
clf.evaluate(self.X_test, self.Y_test)

96
src/libnrl/attrcomb.py Normal file
View File

@ -0,0 +1,96 @@
# -*- coding: utf-8 -*-
import numpy as np
import time
import networkx as nx
from . import node2vec, line, grarep
'''
#-----------------------------------------------------------------------------
# author: Chengbin Hou 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
class ATTRCOMB(object):
def __init__(self, graph, dim, comb_method='concat', num_paths=10, comb_with='deepWalk'):
self.g = graph
self.dim = dim
self.num_paths = num_paths
print("Learning representation...")
self.vectors = {}
print('attr naively combined method ', comb_method, '=====================')
if comb_method == 'concat':
print('comb_method == concat by default; dim/2 from attr and dim/2 from nrl.............')
attr_embeddings = self.train_attr(dim=int(self.dim/2))
nrl_embeddings = self.train_nrl(dim=int(self.dim/2), comb_with='deepWalk')
embeddings = np.concatenate((attr_embeddings, nrl_embeddings), axis=1)
print('shape of embeddings', embeddings.shape)
elif comb_method == 'elementwise-mean':
print('comb_method == elementwise-mean.............')
attr_embeddings = self.train_attr(dim=self.dim)
nrl_embeddings = self.train_nrl(dim=self.dim, comb_with='deepWalk') #we may try deepWalk, node2vec, line and etc...
embeddings = np.add(attr_embeddings, nrl_embeddings)/2.0
print('shape of embeddings', embeddings.shape)
elif comb_method == 'elementwise-max':
print('comb_method == elementwise-max.............')
attr_embeddings = self.train_attr(dim=self.dim)
nrl_embeddings = self.train_nrl(dim=self.dim, comb_with='deepWalk') #we may try deepWalk, node2vec, line and etc...
embeddings = np.zeros(shape=(attr_embeddings.shape[0],attr_embeddings.shape[1]))
for i in range(attr_embeddings.shape[0]): #size(attr_embeddings) = size(nrl_embeddings)
for j in range(attr_embeddings.shape[1]):
if attr_embeddings[i][j] > nrl_embeddings[i][j]:
embeddings[i][j] = attr_embeddings[i][j]
else:
embeddings[i][j] = nrl_embeddings[i][j]
print('shape of embeddings', embeddings.shape)
else:
print('error, no comb_method was found....')
exit(0)
for key, ind in self.g.look_up_dict.items():
self.vectors[key] = embeddings[ind]
def train_attr(self, dim):
X = self.g.getX()
X_compressed = self.g.preprocessAttrInfo(X=X, dim=dim, method='svd') #svd or pca for dim reduction
print('X_compressed shape: ', X_compressed.shape)
return np.array(X_compressed) #n*dim matrix, each row corresponding to node ID stored in graph.look_back_list
def train_nrl(self, dim, comb_with):
print('attr naively combined with ', comb_with, '=====================')
if comb_with == 'deepWalk':
model = node2vec.Node2vec(graph=self.g, path_length=80, num_paths=self.num_paths, dim=dim, workers=4, window=10, dw=True)
nrl_embeddings = []
for key in self.g.look_back_list:
nrl_embeddings.append(model.vectors[key])
return np.array(nrl_embeddings)
elif args.method == 'node2vec':
model = node2vec.Node2vec(graph=self.g, path_length=80, num_paths=self.num_paths, dim=dim, workers=4, p=0.8, q=0.8, window=10)
nrl_embeddings = []
for key in self.g.look_back_list:
nrl_embeddings.append(model.vectors[key])
return np.array(nrl_embeddings)
else:
print('error, no comb_with was found....')
print('to do.... line, grarep, and etc...')
exit(0)
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.dim))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,
' '.join([str(x) for x in vec])))
fout.close()

38
src/libnrl/attrpure.py Normal file
View File

@ -0,0 +1,38 @@
# -*- coding: utf-8 -*-
import numpy as np
import time
import networkx as nx
'''
#-----------------------------------------------------------------------------
# author: Chengbin Hou 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
class ATTRPURE(object):
def __init__(self, graph, dim):
self.g = graph
self.dim = dim
print("Learning representation...")
self.vectors = {}
embeddings = self.train()
for key, ind in self.g.look_up_dict.items():
self.vectors[key] = embeddings[ind]
def train(self):
X = self.g.getX()
X_compressed = self.g.preprocessAttrInfo(X=X, dim=self.dim, method='svd') #svd or pca for dim reduction
return X_compressed #n*dim matrix, each row corresponding to node ID stored in graph.look_back_list
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.dim))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,
' '.join([str(x) for x in vec])))
fout.close()

235
src/libnrl/classify.py Normal file
View File

@ -0,0 +1,235 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import numpy as np
import math
import random
import networkx as nx
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='sklearn')
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, classification_report, roc_curve, auc
from sklearn.preprocessing import MultiLabelBinarizer
'''
#-----------------------------------------------------------------------------
# part of code was originally forked from https://github.com/thunlp/OpenNE
# modified by Chengbin Hou 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
# node classification classifier
class ncClassifier(object):
def __init__(self, vectors, clf):
self.embeddings = vectors
self.clf = TopKRanker(clf) #here clf is LR
self.binarizer = MultiLabelBinarizer(sparse_output=True)
def split_train_evaluate(self, X, Y, train_precent, seed=0):
state = np.random.get_state()
training_size = int(train_precent * len(X))
#np.random.seed(seed)
shuffle_indices = np.random.permutation(np.arange(len(X)))
X_train = [X[shuffle_indices[i]] for i in range(training_size)]
Y_train = [Y[shuffle_indices[i]] for i in range(training_size)]
X_test = [X[shuffle_indices[i]] for i in range(training_size, len(X))]
Y_test = [Y[shuffle_indices[i]] for i in range(training_size, len(X))]
self.train(X_train, Y_train, Y)
np.random.set_state(state) #why??? for binarizer.transform??
return self.evaluate(X_test, Y_test)
def train(self, X, Y, Y_all):
self.binarizer.fit(Y_all) #to support multi-labels, fit means dict mapping {orig cat: binarized vec}
X_train = [self.embeddings[x] for x in X]
Y = self.binarizer.transform(Y) #since we have use Y_all fitted, then we simply transform
self.clf.fit(X_train, Y)
def predict(self, X, top_k_list):
X_ = np.asarray([self.embeddings[x] for x in X])
# see TopKRanker(OneVsRestClassifier)
Y = self.clf.predict(X_, top_k_list=top_k_list) # the top k probs to be output...
return Y
def evaluate(self, X, Y):
top_k_list = [len(l) for l in Y] #multi-labels, diff len of labels of each node
Y_ = self.predict(X, top_k_list) #pred val of X_test i.e. Y_pred
Y = self.binarizer.transform(Y) #true val i.e. Y_test
averages = ["micro", "macro", "samples", "weighted"]
results = {}
for average in averages:
results[average] = f1_score(Y, Y_, average=average)
# print('Results, using embeddings of dimensionality', len(self.embeddings[X[0]]))
print(results)
return results
class TopKRanker(OneVsRestClassifier): #orignal LR or SVM is for binary clf
def predict(self, X, top_k_list): #re-define predict func of OneVsRestClassifier
probs = np.asarray(super(TopKRanker, self).predict_proba(X))
all_labels = []
for i, k in enumerate(top_k_list):
probs_ = probs[i, :]
labels = self.classes_[probs_.argsort()[-k:]].tolist() #denote labels
probs_[:] = 0 #reset probs_ to all 0
probs_[labels] = 1 #reset probs_ to 1 if labels denoted...
all_labels.append(probs_)
return np.asarray(all_labels)
'''
#note: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in samples with no true labels
#see: https://stackoverflow.com/questions/43162506/undefinedmetricwarning-f-score-is-ill-defined-and-being-set-to-0-0-in-labels-wi
'''
'''
import matplotlib.pyplot as plt
def plt_roc(y_test, y_score):
"""
calculate AUC value and plot the ROC curve
"""
fpr, tpr, threshold = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.stackplot(fpr, tpr, color='steelblue', alpha = 0.5, edgecolor = 'black')
plt.plot(fpr, tpr, color='black', lw = 1)
plt.plot([0,1],[0,1], color = 'red', linestyle = '--')
plt.text(0.5,0.3,'ROC curve (area = %0.3f)' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
return roc_auc
'''
# link prediction binary classifier
class lpClassifier(object):
def __init__(self, vectors):
self.embeddings = vectors
def evaluate(self, X_test, Y_test, seed=0): #clf here is simply a similarity/distance metric
state = np.random.get_state()
#np.random.seed(seed)
test_size = len(X_test)
#shuffle_indices = np.random.permutation(np.arange(test_size))
#X_test = [X_test[shuffle_indices[i]] for i in range(test_size)]
#Y_test = [Y_test[shuffle_indices[i]] for i in range(test_size)]
Y_true = [int(i) for i in Y_test]
Y_probs = []
for i in range(test_size):
start_node_emb = np.array(self.embeddings[X_test[i][0]]).reshape(-1,1)
end_node_emb = np.array(self.embeddings[X_test[i][1]]).reshape(-1,1)
score = cosine_similarity(start_node_emb, end_node_emb) #ranging from [-1, +1]
Y_probs.append( (score+1)/2.0 ) #switch to prob... however, we may also directly y_score = score
#in sklearn roc... which yields the same reasult
roc = roc_auc_score(y_true = Y_true, y_score = Y_probs)
if roc < 0.5:
roc = 1.0 - roc #since lp is binary clf task, just predict the opposite if<0.5
print("roc=", "{:.9f}".format(roc))
#plt_roc(Y_true, Y_probs) #enable to plot roc curve and return auc value
def norm(a):
sum = 0.0
for i in range(len(a)):
sum = sum + a[i] * a[i]
return math.sqrt(sum)
def cosine_similarity(a, b):
sum = 0.0
for i in range(len(a)):
sum = sum + a[i] * b[i]
#return sum/(norm(a) * norm(b))
return sum/(norm(a) * norm(b) + 1e-20) #fix numerical issue 1e-20 almost = 0!
'''
#cosine_similarity realized by use...
#or try sklearn....
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity, cosine_distances, euclidean_distances # we may try diff metrics
#ref http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise
'''
def lp_train_test_split(graph, ratio=0.5, neg_pos_link_ratio=1.0, test_pos_links_ratio=0.1):
#randomly split links/edges into training set and testing set
#*** note: we do not assume every node must be connected after removing links
#*** hence, the resulting graph might have few single nodes --> more realistic scenario
#*** e.g. a user just sign in a website has no link to others
#graph: OpenANE graph data strcture
#ratio: perc of links for training; ranging [0, 1]
#neg_pos_link_ratio: 1.0 means neg-links/pos-links = 1.0 i.e. balance case; raning [0, +inf)
g = graph
test_pos_links = int(nx.number_of_edges(g.G) * test_pos_links_ratio)
print("test_pos_links_ratio {:.2f}, test_pos_links {:.2f}, neg_pos_link_ratio is {:.2f}, links for training {:.2f}%,".format(test_pos_links_ratio, test_pos_links, neg_pos_link_ratio, ratio*100))
test_pos_sample = []
test_neg_sample = []
#random.seed(2018) #generate testing set that contains both pos and neg samples
test_pos_sample = random.sample(g.G.edges(), test_pos_links)
#test_neg_sample = random.sample(list(nx.classes.function.non_edges(g.G)), int(test_size * neg_pos_link_ratio)) #using nx build-in func, not efficient, to do...
#more efficient way:
test_neg_sample = []
num_neg_sample = int(test_pos_links * neg_pos_link_ratio)
num = 0
while num < num_neg_sample:
pair_nodes = np.random.choice(g.look_back_list, size=2, replace=False)
if pair_nodes not in g.G.edges():
num += 1
test_neg_sample.append(list(pair_nodes))
test_edge_pair = test_pos_sample + test_neg_sample
test_edge_label = list(np.ones(len(test_pos_sample))) + list(np.zeros(len(test_neg_sample)))
print('before removing, the # of links: ', nx.number_of_edges(g.G), '; the # of single nodes: ', g.numSingleNodes())
g.G.remove_edges_from(test_pos_sample) #training set should NOT contain testing set i.e. delete testing pos samples
g.simulate_sparsely_linked_net(link_reserved = ratio) #simulate sparse net
print('after removing, the # of links: ', nx.number_of_edges(g.G), '; the # of single nodes: ', g.numSingleNodes())
print("# training links {0}; # positive testing links {1}; # negative testing links {2},".format(nx.number_of_edges(g.G), len(test_pos_sample), len(test_neg_sample)))
return g.G, test_edge_pair, test_edge_label
#---------------------------------ulits for downstream tasks--------------------------------
def load_embeddings(filename):
fin = open(filename, 'r')
node_num, size = [int(x) for x in fin.readline().strip().split()]
vectors = {}
while 1:
l = fin.readline()
if l == '':
break
vec = l.strip().split(' ')
assert len(vec) == size+1
vectors[vec[0]] = [float(x) for x in vec[1:]]
fin.close()
assert len(vectors) == node_num
return vectors
def read_node_label(filename):
fin = open(filename, 'r')
X = []
Y = []
while 1:
l = fin.readline()
if l == '':
break
vec = l.strip().split(' ')
X.append(vec[0])
Y.append(vec[1:])
fin.close()
return X, Y
def read_edge_label(filename):
fin = open(filename, 'r')
X = []
Y = []
while 1:
l = fin.readline()
if l == '':
break
vec = l.strip().split(' ')
X.append(vec[:2])
Y.append(vec[2])
fin.close()
return X, Y

193
src/libnrl/downstream.py Normal file
View File

@ -0,0 +1,193 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import math
import random
import warnings
import networkx as nx
import numpy as np
from sklearn.metrics import (accuracy_score, auc, classification_report,
f1_score, roc_auc_score, roc_curve)
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
warnings.filterwarnings(
action='ignore', category=UserWarning, module='sklearn')
'''
#-----------------------------------------------------------------------------
# by Chengbin Hou 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
# node classification classifier
class ncClassifier(object):
def __init__(self, vectors, clf):
self.embeddings = vectors
self.clf = TopKRanker(clf) # here clf is LR
self.binarizer = MultiLabelBinarizer(sparse_output=True)
def split_train_evaluate(self, X, Y, train_precent, seed=0):
state = np.random.get_state()
training_size = int(train_precent * len(X))
# np.random.seed(seed)
shuffle_indices = np.random.permutation(np.arange(len(X)))
X_train = [X[shuffle_indices[i]] for i in range(training_size)]
Y_train = [Y[shuffle_indices[i]] for i in range(training_size)]
X_test = [X[shuffle_indices[i]] for i in range(training_size, len(X))]
Y_test = [Y[shuffle_indices[i]] for i in range(training_size, len(X))]
self.train(X_train, Y_train, Y)
np.random.set_state(state) # why??? for binarizer.transform??
return self.evaluate(X_test, Y_test)
def train(self, X, Y, Y_all):
# to support multi-labels, fit means dict mapping {orig cat: binarized vec}
self.binarizer.fit(Y_all)
X_train = [self.embeddings[x] for x in X]
# since we have use Y_all fitted, then we simply transform
Y = self.binarizer.transform(Y)
self.clf.fit(X_train, Y)
def predict(self, X, top_k_list):
X_ = np.asarray([self.embeddings[x] for x in X])
# see TopKRanker(OneVsRestClassifier)
# the top k probs to be output...
Y = self.clf.predict(X_, top_k_list=top_k_list)
return Y
def evaluate(self, X, Y):
# multi-labels, diff len of labels of each node
top_k_list = [len(l) for l in Y]
Y_ = self.predict(X, top_k_list) # pred val of X_test i.e. Y_pred
Y = self.binarizer.transform(Y) # true val i.e. Y_test
averages = ["micro", "macro", "samples", "weighted"]
results = {}
for average in averages:
results[average] = f1_score(Y, Y_, average=average)
# print('Results, using embeddings of dimensionality', len(self.embeddings[X[0]]))
print(results)
return results
class TopKRanker(OneVsRestClassifier): # orignal LR or SVM is for binary clf
def predict(self, X, top_k_list): # re-define predict func of OneVsRestClassifier
probs = np.asarray(super(TopKRanker, self).predict_proba(X))
all_labels = []
for i, k in enumerate(top_k_list):
probs_ = probs[i, :]
labels = self.classes_[
probs_.argsort()[-k:]].tolist() # denote labels
probs_[:] = 0 # reset probs_ to all 0
probs_[labels] = 1 # reset probs_ to 1 if labels denoted...
all_labels.append(probs_)
return np.asarray(all_labels)
# link prediction binary classifier
class lpClassifier(object):
def __init__(self, vectors):
self.embeddings = vectors
# clf here is simply a similarity/distance metric
def evaluate(self, X_test, Y_test, seed=0):
state = np.random.get_state()
# np.random.seed(seed)
test_size = len(X_test)
# shuffle_indices = np.random.permutation(np.arange(test_size))
# X_test = [X_test[shuffle_indices[i]] for i in range(test_size)]
# Y_test = [Y_test[shuffle_indices[i]] for i in range(test_size)]
Y_true = [int(i) for i in Y_test]
Y_probs = []
for i in range(test_size):
start_node_emb = np.array(
self.embeddings[X_test[i][0]]).reshape(-1, 1)
end_node_emb = np.array(
self.embeddings[X_test[i][1]]).reshape(-1, 1)
# ranging from [-1, +1]
score = cosine_similarity(start_node_emb, end_node_emb)
# switch to prob... however, we may also directly y_score = score
Y_probs.append((score + 1) / 2.0)
# in sklearn roc... which yields the same reasult
roc = roc_auc_score(y_true=Y_true, y_score=Y_probs)
if roc < 0.5:
roc = 1.0 - roc # since lp is binary clf task, just predict the opposite if<0.5
print("roc=", "{:.9f}".format(roc))
# plt_roc(Y_true, Y_probs) #enable to plot roc curve and return auc value
def norm(a):
sum = 0.0
for i in range(len(a)):
sum = sum + a[i] * a[i]
return math.sqrt(sum)
def cosine_similarity(a, b):
sum = 0.0
for i in range(len(a)):
sum = sum + a[i] * b[i]
# return sum/(norm(a) * norm(b))
# fix numerical issue 1e-100 almost = 0!
return sum / (norm(a) * norm(b) + 1e-100)
'''
#cosine_similarity realized by use...
#or try sklearn....
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity, cosine_distances, euclidean_distances # we may try diff metrics
#ref http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise
'''
def lp_train_test_split(graph, ratio=0.8, neg_pos_link_ratio=1.0):
# randomly split links/edges into training set and testing set
# *** note: we do not assume every node must be connected after removing links
# *** hence, the resulting graph might have few single nodes --> more realistic scenario
# *** e.g. a user just sign in a website has no link to others
# graph: OpenANE graph data strcture
# ratio: perc of links for training; ranging [0, 1]
# neg_pos_link_ratio: 1.0 means neg-links/pos-links = 1.0 i.e. balance case; raning [0, +inf)
g = graph
print("links for training {:.2f}%, and links for testing {:.2f}%, neg_pos_link_ratio is {:.2f}".format(
ratio * 100, (1 - ratio) * 100, neg_pos_link_ratio))
test_pos_sample = []
test_neg_sample = []
train_size = int(ratio * len(g.G.edges))
test_size = len(g.G.edges) - train_size
# random.seed(2018) #generate testing set that contains both pos and neg samples
test_pos_sample = random.sample(g.G.edges(), int(test_size))
# test_neg_sample = random.sample(list(nx.classes.function.non_edges(g.G)), int(test_size * neg_pos_link_ratio)) #using nx build-in func, not efficient, to do...
# more efficient way:
test_neg_sample = []
num_neg_sample = int(test_size * neg_pos_link_ratio)
num = 0
while num < num_neg_sample:
pair_nodes = np.random.choice(g.look_back_list, size=2, replace=False)
if pair_nodes not in g.G.edges():
num += 1
test_neg_sample.append(list(pair_nodes))
test_edge_pair = test_pos_sample + test_neg_sample
test_edge_label = list(np.ones(len(test_pos_sample))) + \
list(np.zeros(len(test_neg_sample)))
print('before removing, the # of links: ', g.numDiEdges(),
'; the # of single nodes: ', g.numSingleNodes())
# training set should NOT contain testing set i.e. delete testing pos samples
g.G.remove_edges_from(test_pos_sample)
print('after removing, the # of links: ', g.numDiEdges(),
'; the # of single nodes: ', g.numSingleNodes())
print("# training links {0}; # positive testing links {1}; # negative testing links {2},".format(
g.numDiEdges(), len(test_pos_sample), len(test_neg_sample)))
return g.G, test_edge_pair, test_edge_label

View File

@ -0,0 +1,2 @@
from __future__ import print_function
from __future__ import division

168
src/libnrl/gcn/gcnAPI.py Normal file
View File

@ -0,0 +1,168 @@
import numpy as np
from .utils import *
from . import models
import time
import scipy.sparse as sp
import tensorflow as tf
class GCN(object):
def __init__(self, graph, learning_rate=0.01, epochs=200,
hidden1=16, dropout=0.5, weight_decay=5e-4, early_stopping=10,
max_degree=3, clf_ratio=0.1):
"""
learning_rate: Initial learning rate
epochs: Number of epochs to train
hidden1: Number of units in hidden layer 1
dropout: Dropout rate (1 - keep probability)
weight_decay: Weight for L2 loss on embedding matrix
early_stopping: Tolerance for early stopping (# of epochs)
max_degree: Maximum Chebyshev polynomial degree
"""
self.graph = graph
self.clf_ratio = clf_ratio
self.learning_rate = learning_rate
self.epochs = epochs
self.hidden1 = hidden1
self.dropout = dropout
self.weight_decay = weight_decay
self.early_stopping = early_stopping
self.max_degree = max_degree
self.preprocess_data()
self.build_placeholders()
# Create model
self.model = models.GCN(self.placeholders, input_dim=self.features[2][1], hidden1=self.hidden1, weight_decay=self.weight_decay, logging=True)
# Initialize session
self.sess = tf.Session()
# Init variables
self.sess.run(tf.global_variables_initializer())
cost_val = []
# Train model
for epoch in range(self.epochs):
t = time.time()
# Construct feed dictionary
feed_dict = self.construct_feed_dict(self.train_mask)
feed_dict.update({self.placeholders['dropout']: self.dropout})
# Training step
outs = self.sess.run([self.model.opt_op, self.model.loss, self.model.accuracy], feed_dict=feed_dict)
# Validation
cost, acc, duration = self.evaluate(self.val_mask)
cost_val.append(cost)
# Print results
print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(outs[1]),
"train_acc=", "{:.5f}".format(outs[2]), "val_loss=", "{:.5f}".format(cost),
"val_acc=", "{:.5f}".format(acc), "time=", "{:.5f}".format(time.time() - t))
''' #something wrong for early stoppting?? to do...
if epoch > self.early_stopping and cost_val[-1] > np.mean(cost_val[-(self.early_stopping+1):-1]):
print("Early stopping...")
break
'''
print("Optimization Finished!")
# Testing
test_cost, test_acc, test_duration = self.evaluate(self.test_mask)
print("Test set results:", "cost=", "{:.5f}".format(test_cost),
"accuracy=", "{:.5f}".format(test_acc), "time=", "{:.5f}".format(test_duration))
# Define model evaluation function
def evaluate(self, mask):
t_test = time.time()
feed_dict_val = self.construct_feed_dict(mask)
outs_val = self.sess.run([self.model.loss, self.model.accuracy], feed_dict=feed_dict_val)
return outs_val[0], outs_val[1], (time.time() - t_test)
def build_placeholders(self):
num_supports = 1
self.placeholders = {
'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(self.features[2], dtype=tf.int64)),
'labels': tf.placeholder(tf.float32, shape=(None, self.labels.shape[1])),
'labels_mask': tf.placeholder(tf.int32),
'dropout': tf.placeholder_with_default(0., shape=()),
# helper variable for sparse dropout
'num_features_nonzero': tf.placeholder(tf.int32)
}
def build_label(self):
g = self.graph.G
look_up = self.graph.look_up_dict
labels = []
label_dict = {}
label_id = 0
for node in g.nodes():
labels.append((node, g.nodes[node]['label']))
for l in g.nodes[node]['label']:
if l not in label_dict:
label_dict[l] = label_id
label_id += 1
self.labels = np.zeros((len(labels), label_id))
self.label_dict = label_dict
for node, l in labels:
node_id = look_up[node]
for ll in l:
l_id = label_dict[ll]
self.labels[node_id][l_id] = 1
def build_train_val_test(self):
"""
build train_mask test_mask val_mask
"""
train_precent = self.clf_ratio
training_size = int(train_precent * self.graph.G.number_of_nodes())
state = np.random.get_state()
np.random.seed(0)
shuffle_indices = np.random.permutation(np.arange(self.graph.G.number_of_nodes()))
np.random.set_state(state)
look_up = self.graph.look_up_dict
g = self.graph.G
def sample_mask(begin, end):
mask = np.zeros(g.number_of_nodes())
for i in range(begin, end):
mask[shuffle_indices[i]] = 1
return mask
# nodes_num = len(self.labels)
# self.train_mask = sample_mask('train', nodes_num)
# self.val_mask = sample_mask('valid', nodes_num)
# self.test_mask = sample_mask('test', nodes_num)
self.train_mask = sample_mask(0, training_size-100)
self.val_mask = sample_mask(training_size-100, training_size)
self.test_mask = sample_mask(training_size, g.number_of_nodes())
def preprocess_data(self):
"""
adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask
y_train, y_val, y_test can merge to y
"""
g = self.graph.G
look_back = self.graph.look_back_list
self.features = np.vstack([g.nodes[look_back[i]]['feature']
for i in range(g.number_of_nodes())])
self.features = preprocess_features(self.features)
self.build_label()
self.build_train_val_test()
adj = nx.adjacency_matrix(g) # the type of graph
self.support = [preprocess_adj(adj)]
def construct_feed_dict(self, labels_mask):
"""Construct feed dictionary."""
feed_dict = dict()
feed_dict.update({self.placeholders['labels']: self.labels})
feed_dict.update({self.placeholders['labels_mask']: labels_mask})
feed_dict.update({self.placeholders['features']: self.features})
feed_dict.update({self.placeholders['support'][i]: self.support[i] for i in range(len(self.support))})
feed_dict.update({self.placeholders['num_features_nonzero']: self.features[1].shape})
return feed_dict

27
src/libnrl/gcn/inits.py Normal file
View File

@ -0,0 +1,27 @@
import tensorflow as tf
import numpy as np
def uniform(shape, scale=0.05, name=None):
"""Uniform init."""
initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
return tf.Variable(initial, name=name)
def glorot(shape, name=None):
"""Glorot & Bengio (AISTATS 2010) init."""
init_range = np.sqrt(6.0/(shape[0]+shape[1]))
initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32)
return tf.Variable(initial, name=name)
def zeros(shape, name=None):
"""All zeros."""
initial = tf.zeros(shape, dtype=tf.float32)
return tf.Variable(initial, name=name)
def ones(shape, name=None):
"""All ones."""
initial = tf.ones(shape, dtype=tf.float32)
return tf.Variable(initial, name=name)

188
src/libnrl/gcn/layers.py Normal file
View File

@ -0,0 +1,188 @@
from .inits import *
import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS
# global unique layer ID dictionary for layer name assignment
_LAYER_UIDS = {}
def get_layer_uid(layer_name=''):
"""Helper function, assigns unique layer IDs."""
if layer_name not in _LAYER_UIDS:
_LAYER_UIDS[layer_name] = 1
return 1
else:
_LAYER_UIDS[layer_name] += 1
return _LAYER_UIDS[layer_name]
def sparse_dropout(x, keep_prob, noise_shape):
"""Dropout for sparse tensors."""
random_tensor = keep_prob
random_tensor += tf.random_uniform(noise_shape)
dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool)
pre_out = tf.sparse_retain(x, dropout_mask)
return pre_out * (1./keep_prob)
def dot(x, y, sparse=False):
"""Wrapper for tf.matmul (sparse vs dense)."""
if sparse:
res = tf.sparse_tensor_dense_matmul(x, y)
else:
res = tf.matmul(x, y)
return res
class Layer(object):
"""Base layer class. Defines basic API for all layer objects.
Implementation inspired by keras (http://keras.io).
# Properties
name: String, defines the variable scope of the layer.
logging: Boolean, switches Tensorflow histogram logging on/off
# Methods
_call(inputs): Defines computation graph of layer
(i.e. takes input, returns output)
__call__(inputs): Wrapper for _call()
_log_vars(): Log all variables
"""
def __init__(self, **kwargs):
allowed_kwargs = {'name', 'logging'}
for kwarg in kwargs.keys():
assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
name = kwargs.get('name')
if not name:
layer = self.__class__.__name__.lower()
name = layer + '_' + str(get_layer_uid(layer))
self.name = name
self.vars = {}
logging = kwargs.get('logging', False)
self.logging = logging
self.sparse_inputs = False
def _call(self, inputs):
return inputs
def __call__(self, inputs):
with tf.name_scope(self.name):
if self.logging and not self.sparse_inputs:
tf.summary.histogram(self.name + '/inputs', inputs)
outputs = self._call(inputs)
if self.logging:
tf.summary.histogram(self.name + '/outputs', outputs)
return outputs
def _log_vars(self):
for var in self.vars:
tf.summary.histogram(self.name + '/vars/' + var, self.vars[var])
class Dense(Layer):
"""Dense layer."""
def __init__(self, input_dim, output_dim, placeholders, dropout=0., sparse_inputs=False,
act=tf.nn.relu, bias=False, featureless=False, **kwargs):
super(Dense, self).__init__(**kwargs)
if dropout:
self.dropout = placeholders['dropout']
else:
self.dropout = 0.
self.act = act
self.sparse_inputs = sparse_inputs
self.featureless = featureless
self.bias = bias
# helper variable for sparse dropout
self.num_features_nonzero = placeholders['num_features_nonzero']
with tf.variable_scope(self.name + '_vars'):
self.vars['weights'] = glorot([input_dim, output_dim],
name='weights')
if self.bias:
self.vars['bias'] = zeros([output_dim], name='bias')
if self.logging:
self._log_vars()
def _call(self, inputs):
x = inputs
# dropout
if self.sparse_inputs:
x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
else:
x = tf.nn.dropout(x, 1-self.dropout)
# transform
output = dot(x, self.vars['weights'], sparse=self.sparse_inputs)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class GraphConvolution(Layer):
"""Graph convolution layer."""
def __init__(self, input_dim, output_dim, placeholders, dropout=0.,
sparse_inputs=False, act=tf.nn.relu, bias=False,
featureless=False, **kwargs):
super(GraphConvolution, self).__init__(**kwargs)
if dropout:
self.dropout = placeholders['dropout']
else:
self.dropout = 0.
self.act = act
self.support = placeholders['support']
self.sparse_inputs = sparse_inputs
self.featureless = featureless
self.bias = bias
# helper variable for sparse dropout
self.num_features_nonzero = placeholders['num_features_nonzero']
with tf.variable_scope(self.name + '_vars'):
for i in range(len(self.support)):
self.vars['weights_' + str(i)] = glorot([input_dim, output_dim],
name='weights_' + str(i))
if self.bias:
self.vars['bias'] = zeros([output_dim], name='bias')
if self.logging:
self._log_vars()
def _call(self, inputs):
x = inputs
# dropout
if self.sparse_inputs:
x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
else:
x = tf.nn.dropout(x, 1-self.dropout)
# convolve
supports = list()
for i in range(len(self.support)):
if not self.featureless:
pre_sup = dot(x, self.vars['weights_' + str(i)],
sparse=self.sparse_inputs)
else:
pre_sup = self.vars['weights_' + str(i)]
support = dot(self.support[i], pre_sup, sparse=True)
supports.append(support)
output = tf.add_n(supports)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)

20
src/libnrl/gcn/metrics.py Normal file
View File

@ -0,0 +1,20 @@
import tensorflow as tf
def masked_softmax_cross_entropy(preds, labels, mask):
"""Softmax cross-entropy loss with masking."""
loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
loss *= mask
return tf.reduce_mean(loss)
def masked_accuracy(preds, labels, mask):
"""Accuracy with masking."""
correct_prediction = tf.equal(tf.argmax(preds, 1), tf.argmax(labels, 1))
accuracy_all = tf.cast(correct_prediction, tf.float32)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
accuracy_all *= mask
return tf.reduce_mean(accuracy_all)

179
src/libnrl/gcn/models.py Normal file
View File

@ -0,0 +1,179 @@
from .layers import *
from .metrics import *
flags = tf.app.flags
FLAGS = flags.FLAGS
class Model(object):
def __init__(self, **kwargs):
allowed_kwargs = {'name', 'logging'}
for kwarg in kwargs.keys():
assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
name = kwargs.get('name')
if not name:
name = self.__class__.__name__.lower()
self.name = name
logging = kwargs.get('logging', False)
self.logging = logging
self.vars = {}
self.placeholders = {}
self.layers = []
self.activations = []
self.inputs = None
self.outputs = None
self.loss = 0
self.accuracy = 0
self.optimizer = None
self.opt_op = None
def _build(self):
raise NotImplementedError
def build(self):
""" Wrapper for _build() """
with tf.variable_scope(self.name):
self._build()
# Build sequential layer model
self.activations.append(self.inputs)
for layer in self.layers:
hidden = layer(self.activations[-1])
self.activations.append(hidden)
self.outputs = self.activations[-1]
# Store model variables for easy access
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
self.vars = {var.name: var for var in variables}
# Build metrics
self._loss()
self._accuracy()
self.opt_op = self.optimizer.minimize(self.loss)
def predict(self):
pass
def _loss(self):
raise NotImplementedError
def _accuracy(self):
raise NotImplementedError
def save(self, sess=None):
if not sess:
raise AttributeError("TensorFlow session not provided.")
saver = tf.train.Saver(self.vars)
save_path = saver.save(sess, "tmp/%s.ckpt" % self.name)
print("Model saved in file: %s" % save_path)
def load(self, sess=None):
if not sess:
raise AttributeError("TensorFlow session not provided.")
saver = tf.train.Saver(self.vars)
save_path = "tmp/%s.ckpt" % self.name
saver.restore(sess, save_path)
print("Model restored from file: %s" % save_path)
class MLP(Model):
def __init__(self, placeholders, input_dim, **kwargs):
super(MLP, self).__init__(**kwargs)
self.inputs = placeholders['features']
self.input_dim = input_dim
# self.input_dim = self.inputs.get_shape().as_list()[1] # To be supported in future Tensorflow versions
self.output_dim = placeholders['labels'].get_shape().as_list()[1]
self.placeholders = placeholders
self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)
self.build()
def _loss(self):
# Weight decay loss
for var in self.layers[0].vars.values():
self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
# Cross entropy error
self.loss += masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
def _accuracy(self):
self.accuracy = masked_accuracy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
def _build(self):
self.layers.append(Dense(input_dim=self.input_dim,
output_dim=FLAGS.hidden1,
placeholders=self.placeholders,
act=tf.nn.relu,
dropout=True,
sparse_inputs=True,
logging=self.logging))
self.layers.append(Dense(input_dim=FLAGS.hidden1,
output_dim=self.output_dim,
placeholders=self.placeholders,
act=lambda x: x,
dropout=True,
logging=self.logging))
def predict(self):
return tf.nn.softmax(self.outputs)
class GCN(Model):
def __init__(self, placeholders, input_dim, hidden1, weight_decay, **kwargs):
super(GCN, self).__init__(**kwargs)
self.inputs = placeholders['features']
self.hidden1 = hidden1
self.weight_decay = weight_decay
self.input_dim = input_dim
# self.input_dim = self.inputs.get_shape().as_list()[1] # To be supported in future Tensorflow versions
self.output_dim = placeholders['labels'].get_shape().as_list()[1]
self.placeholders = placeholders
self.optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
self.build()
def _loss(self):
# Weight decay loss
for var in self.layers[0].vars.values():
self.loss += self.weight_decay * tf.nn.l2_loss(var)
# Cross entropy error
self.loss += masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
def _accuracy(self):
self.accuracy = masked_accuracy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
def _build(self):
self.layers.append(GraphConvolution(input_dim=self.input_dim,
output_dim=self.hidden1,
placeholders=self.placeholders,
act=tf.nn.relu,
dropout=True,
sparse_inputs=True,
logging=self.logging))
self.layers.append(GraphConvolution(input_dim=self.hidden1,
output_dim=self.output_dim,
placeholders=self.placeholders,
act=lambda x: x,
dropout=True,
logging=self.logging))
def predict(self):
return tf.nn.softmax(self.outputs)

107
src/libnrl/gcn/train.py Normal file
View File

@ -0,0 +1,107 @@
from __future__ import division
from __future__ import print_function
import time
import tensorflow as tf
from gcn.utils import *
from gcn.models import GCN, MLP
# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)
# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('dataset', 'cora', 'Dataset string.') # 'cora', 'citeseer', 'pubmed'
flags.DEFINE_string('model', 'gcn', 'Model string.') # 'gcn', 'gcn_cheby', 'dense'
flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
flags.DEFINE_integer('epochs', 200, 'Number of epochs to train.')
flags.DEFINE_integer('hidden1', 16, 'Number of units in hidden layer 1.')
flags.DEFINE_float('dropout', 0.5, 'Dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 5e-4, 'Weight for L2 loss on embedding matrix.')
flags.DEFINE_integer('early_stopping', 10, 'Tolerance for early stopping (# of epochs).')
flags.DEFINE_integer('max_degree', 3, 'Maximum Chebyshev polynomial degree.')
# Load data
adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask = load_data(FLAGS.dataset)
# Some preprocessing
features = preprocess_features(features)
if FLAGS.model == 'gcn':
support = [preprocess_adj(adj)]
num_supports = 1
model_func = GCN
elif FLAGS.model == 'gcn_cheby':
support = chebyshev_polynomials(adj, FLAGS.max_degree)
num_supports = 1 + FLAGS.max_degree
model_func = GCN
elif FLAGS.model == 'dense':
support = [preprocess_adj(adj)] # Not used
num_supports = 1
model_func = MLP
else:
raise ValueError('Invalid argument for model: ' + str(FLAGS.model))
# Define placeholders
placeholders = {
'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64)),
'labels': tf.placeholder(tf.float32, shape=(None, y_train.shape[1])),
'labels_mask': tf.placeholder(tf.int32),
'dropout': tf.placeholder_with_default(0., shape=()),
'num_features_nonzero': tf.placeholder(tf.int32) # helper variable for sparse dropout
}
# Create model
model = model_func(placeholders, input_dim=features[2][1], logging=True)
# Initialize session
sess = tf.Session()
# Define model evaluation function
def evaluate(features, support, labels, mask, placeholders):
t_test = time.time()
feed_dict_val = construct_feed_dict(features, support, labels, mask, placeholders)
outs_val = sess.run([model.loss, model.accuracy], feed_dict=feed_dict_val)
return outs_val[0], outs_val[1], (time.time() - t_test)
# Init variables
sess.run(tf.global_variables_initializer())
cost_val = []
# Train model
for epoch in range(FLAGS.epochs):
t = time.time()
# Construct feed dictionary
feed_dict = construct_feed_dict(features, support, y_train, train_mask, placeholders)
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
# Training step
outs = sess.run([model.opt_op, model.loss, model.accuracy], feed_dict=feed_dict)
# Validation
cost, acc, duration = evaluate(features, support, y_val, val_mask, placeholders)
cost_val.append(cost)
# Print results
print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(outs[1]),
"train_acc=", "{:.5f}".format(outs[2]), "val_loss=", "{:.5f}".format(cost),
"val_acc=", "{:.5f}".format(acc), "time=", "{:.5f}".format(time.time() - t))
if epoch > FLAGS.early_stopping and cost_val[-1] > np.mean(cost_val[-(FLAGS.early_stopping+1):-1]):
print("Early stopping...")
break
print("Optimization Finished!")
# Testing
test_cost, test_acc, test_duration = evaluate(features, support, y_test, test_mask, placeholders)
print("Test set results:", "cost=", "{:.5f}".format(test_cost),
"accuracy=", "{:.5f}".format(test_acc), "time=", "{:.5f}".format(test_duration))

152
src/libnrl/gcn/utils.py Normal file
View File

@ -0,0 +1,152 @@
import numpy as np
import pickle as pkl
import networkx as nx
import scipy.sparse as sp
from scipy.sparse.linalg.eigen.arpack import eigsh
import sys
def parse_index_file(filename):
"""Parse index file."""
index = []
for line in open(filename):
index.append(int(line.strip()))
return index
def sample_mask(idx, l):
"""Create mask."""
mask = np.zeros(l)
mask[idx] = 1
return np.array(mask, dtype=np.bool)
def load_data(dataset_str):
"""Load data."""
names = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(names)):
with open("data/ind.{}.{}".format(dataset_str, names[i]), 'rb') as f:
if sys.version_info > (3, 0):
objects.append(pkl.load(f, encoding='latin1'))
else:
objects.append(pkl.load(f))
x, y, tx, ty, allx, ally, graph = tuple(objects)
test_idx_reorder = parse_index_file("data/ind.{}.test.index".format(dataset_str))
test_idx_range = np.sort(test_idx_reorder)
if dataset_str == 'citeseer':
# Fix citeseer dataset (there are some isolated nodes in the graph)
# Find isolated nodes, add them as zero-vecs into the right position
test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder)+1)
tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
tx_extended[test_idx_range-min(test_idx_range), :] = tx
tx = tx_extended
ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
ty_extended[test_idx_range-min(test_idx_range), :] = ty
ty = ty_extended
features = sp.vstack((allx, tx)).tolil()
features[test_idx_reorder, :] = features[test_idx_range, :]
adj = nx.adjacency_matrix(nx.from_dict_of_lists(graph))
labels = np.vstack((ally, ty))
labels[test_idx_reorder, :] = labels[test_idx_range, :]
idx_test = test_idx_range.tolist()
idx_train = range(len(y))
idx_val = range(len(y), len(y)+500)
train_mask = sample_mask(idx_train, labels.shape[0])
val_mask = sample_mask(idx_val, labels.shape[0])
test_mask = sample_mask(idx_test, labels.shape[0])
y_train = np.zeros(labels.shape)
y_val = np.zeros(labels.shape)
y_test = np.zeros(labels.shape)
y_train[train_mask, :] = labels[train_mask, :]
y_val[val_mask, :] = labels[val_mask, :]
y_test[test_mask, :] = labels[test_mask, :]
return adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask
def sparse_to_tuple(sparse_mx):
"""Convert sparse matrix to tuple representation."""
def to_tuple(mx):
if not sp.isspmatrix_coo(mx):
mx = mx.tocoo()
coords = np.vstack((mx.row, mx.col)).transpose()
values = mx.data
shape = mx.shape
return coords, values, shape
if isinstance(sparse_mx, list):
for i in range(len(sparse_mx)):
sparse_mx[i] = to_tuple(sparse_mx[i])
else:
sparse_mx = to_tuple(sparse_mx)
return sparse_mx
def preprocess_features(features):
"""Row-normalize feature matrix and convert to tuple representation"""
rowsum = np.array(features.sum(1))
r_inv = np.power(rowsum, -1).flatten()
r_inv[np.isinf(r_inv)] = 0.
r_mat_inv = sp.diags(r_inv)
features = sp.coo_matrix(features)
features = r_mat_inv.dot(features)
return sparse_to_tuple(features)
def normalize_adj(adj):
"""Symmetrically normalize adjacency matrix."""
adj = sp.coo_matrix(adj)
rowsum = np.array(adj.sum(1))
d_inv_sqrt = np.power(rowsum, -0.5).flatten()
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()
def preprocess_adj(adj):
"""Preprocessing of adjacency matrix for simple GCN model and conversion to tuple representation."""
adj_normalized = normalize_adj(adj + sp.eye(adj.shape[0]))
return sparse_to_tuple(adj_normalized)
def construct_feed_dict(features, support, labels, labels_mask, placeholders):
"""Construct feed dictionary."""
feed_dict = dict()
feed_dict.update({placeholders['labels']: labels})
feed_dict.update({placeholders['labels_mask']: labels_mask})
feed_dict.update({placeholders['features']: features})
feed_dict.update({placeholders['support'][i]: support[i] for i in range(len(support))})
feed_dict.update({placeholders['num_features_nonzero']: features[1].shape})
return feed_dict
def chebyshev_polynomials(adj, k):
"""Calculate Chebyshev polynomials up to order k. Return a list of sparse matrices (tuple representation)."""
print("Calculating Chebyshev polynomials up to order {}...".format(k))
adj_normalized = normalize_adj(adj)
laplacian = sp.eye(adj.shape[0]) - adj_normalized
largest_eigval, _ = eigsh(laplacian, 1, which='LM')
scaled_laplacian = (2. / largest_eigval[0]) * laplacian - sp.eye(adj.shape[0])
t_k = list()
t_k.append(sp.eye(adj.shape[0]))
t_k.append(scaled_laplacian)
def chebyshev_recurrence(t_k_minus_one, t_k_minus_two, scaled_lap):
s_lap = sp.csr_matrix(scaled_lap, copy=True)
return 2 * s_lap.dot(t_k_minus_one) - t_k_minus_two
for i in range(2, k+1):
t_k.append(chebyshev_recurrence(t_k[-1], t_k[-2], scaled_laplacian))
return sparse_to_tuple(t_k)

160
src/libnrl/graph.py Normal file
View File

@ -0,0 +1,160 @@
"""
commonly used graph APIs based NetworkX;
use g.xxx to access the commonly used APIs offered by us;
use g.G.xxx to access NetworkX APIs;
by Chengbin Hou 2018 <chengbin.hou10@foxmail.com>
"""
import time
import random
import numpy as np
import scipy.sparse as sp
import networkx as nx
class Graph(object):
def __init__(self):
self.G = None #to access NetworkX graph data structure
self.look_up_dict = {} #use node ID to find index via g.look_up_dict['0']
self.look_back_list = [] #use index to find node ID via g.look_back_list[0]
#--------------------------------------------------------------------------------------
#--------------------commonly used APIs that will modify graph-------------------------
#--------------------------------------------------------------------------------------
def node_mapping(self):
""" node id and index mapping;
based on the order given by networkx G.nodes();
NB: updating is needed if any node is added/removed;
"""
i = 0 #node index
self.look_up_dict = {} #init
self.look_back_list = [] #init
for node_id in self.G.nodes(): #node id
self.look_up_dict[node_id] = i
self.look_back_list.append(node_id)
i += 1
def read_adjlist(self, path, directed=False):
""" read adjacency list format graph;
support unweighted and (un)directed graph;
format: see https://networkx.github.io/documentation/stable/reference/readwrite/adjlist.html
NB: not supoort weighted graph
"""
if directed:
self.G = nx.read_adjlist(path, create_using=nx.DiGraph())
else:
self.G = nx.read_adjlist(path, create_using=nx.Graph())
self.node_mapping() #update node id index mapping
def read_edgelist(self, path, weighted=False, directed=False):
""" read edge list format graph;
support (un)weighted and (un)directed graph;
format: see https://networkx.github.io/documentation/stable/reference/readwrite/edgelist.html
"""
if directed:
self.G = nx.read_edgelist(path, create_using=nx.DiGraph())
else:
self.G = nx.read_edgelist(path, create_using=nx.Graph())
self.node_mapping() #update node id index mapping
def read_node_attr(self, path):
""" read node attributes and store as NetworkX graph {'node_id': {'attr': values}}
input file format: node_id1 attr1 attr2 ... attrM
node_id2 attr1 attr2 ... attrM
"""
with open(path, 'r') as fin:
for l in fin.readlines():
vec = l.split()
self.G.nodes[vec[0]]['attr'] = np.array([float(x) for x in vec[1:]])
def read_node_label(self, path):
""" todo... read node labels and store as NetworkX graph {'node_id': {'label': values}}
input file format: node_id1 labels
node_id2 labels
with open(path, 'r') as fin:
for l in fin.readlines():
vec = l.split()
self.G.nodes[vec[0]]['label'] = np.array([float(x) for x in vec[1:]])
"""
pass #to do...
def remove_edge(self, ratio=0.0):
""" randomly remove edges/links
ratio: the percentage of edges to be removed
edges_removed: return removed edges, each of which is a pair of nodes
"""
num_edges_removed = int( ratio * self.G.number_of_edges() )
#random.seed(2018)
edges_removed = random.sample(self.G.edges(), int(num_edges_removed))
print('before removing, the # of edges: ', self.G.number_of_edges())
self.G.remove_edges_from(edges_removed)
print('after removing, the # of edges: ', self.G.number_of_edges())
return edges_removed
def remove_node_attr(self, ratio):
""" todo... randomly remove node attributes;
"""
pass #to do...
def remove_node(self, ratio):
""" todo... randomly remove nodes;
#self.node_mapping() #update node id index mapping is needed
"""
pass #to do...
#------------------------------------------------------------------------------------------
#--------------------commonly used APIs that will not modify graph-------------------------
#------------------------------------------------------------------------------------------
def get_adj_mat(self, is_sparse=True):
""" return adjacency matrix;
use 'csr' format for sparse matrix
"""
if is_sparse:
return nx.to_scipy_sparse_matrix(self.G, nodelist=self.look_back_list, format='csr', dtype='float64')
else:
return nx.to_numpy_matrix(self.G, nodelist=self.look_back_list, dtype='float64')
def get_attr_mat(self, is_sparse=True):
""" return attribute matrix;
use 'csr' format for sparse matrix
"""
attr_dense_narray = np.vstack([self.G.nodes[self.look_back_list[i]]['attr'] for i in range(self.get_num_nodes())])
if is_sparse:
return sp.csr_matrix(attr_dense_narray, dtype='float64')
else:
return np.matrix(attr_dense_narray, dtype='float64')
def get_num_nodes(self):
""" return the number of nodes """
return nx.number_of_nodes(self.G)
def get_num_edges(self):
""" return the number of edges """
return nx.number_of_edges(self.G)
def get_num_isolates(self):
""" return the number of isolated nodes """
return len(list(nx.isolates(self.G)))
def get_isdirected(self):
""" return True if it is directed graph """
return nx.is_directed(self.G)
def get_isweighted(self):
""" return True if it is weighted graph """
return nx.is_weighted(self.G)
def get_neighbors(self, node):
""" return neighbors connected to a node """
return list(nx.neighbors(self.G, node))
def get_common_neighbors(self, node1, node2):
""" return common neighbors of two nodes """
return list(nx.common_neighbors(self.G, node1, node2))
def get_centrality(self, centrality_type='degree'):
""" todo... return specified type of centrality
see https://networkx.github.io/documentation/stable/reference/algorithms/centrality.html
"""
pass #to do...

View File

@ -0,0 +1,117 @@
## GraphSage: Representation Learning on Large Graphs
#### Authors: [William L. Hamilton](http://stanford.edu/~wleif) (wleif@stanford.edu), [Rex Ying](http://joy-of-thinking.weebly.com/) (rexying@stanford.edu)
#### [Project Website](http://snap.stanford.edu/graphsage/)
#### [Alternative reference PyTorch implementation](https://github.com/williamleif/graphsage-simple/)
### Overview
This directory contains code necessary to run the GraphSage algorithm.
GraphSage can be viewed as a stochastic generalization of graph convolutions, and it is especially useful for massive, dynamic graphs that contain rich feature information.
See our [paper](https://arxiv.org/pdf/1706.02216.pdf) for details on the algorithm.
*Note:* GraphSage now also has better support for training on smaller, static graphs and graphs that don't have node features.
The original algorithm and paper are focused on the task of inductive generalization (i.e., generating embeddings for nodes that were not present during training),
but many benchmarks/tasks use simple static graphs that do not necessarily have features.
To support this use case, GraphSage now includes optional "identity features" that can be used with or without other node attributes.
Including identity features will increase the runtime, but also potentially increase performance (at the usual risk of overfitting).
See the section on "Running the code" below.
*Note:* GraphSage is intended for use on large graphs (>100,000) nodes. The overhead of subsampling will start to outweigh its benefits on smaller graphs.
The example_data subdirectory contains a small example of the protein-protein interaction data,
which includes 3 training graphs + one validation graph and one test graph.
The full Reddit and PPI datasets (described in the paper) are available on the [project website](http://snap.stanford.edu/graphsage/).
If you make use of this code or the GraphSage algorithm in your work, please cite the following paper:
@inproceedings{hamilton2017inductive,
author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
title = {Inductive Representation Learning on Large Graphs},
booktitle = {NIPS},
year = {2017}
}
### Requirements
Recent versions of TensorFlow, numpy, scipy, sklearn, and networkx are required (but networkx must be <=1.11). You can install all the required packages using the following command:
$ pip install -r requirements.txt
To guarantee that you have the right package versions, you can use [docker](https://docs.docker.com/) to easily set up a virtual environment. See the Docker subsection below for more info.
#### Docker
If you do not have [docker](https://docs.docker.com/) installed, you will need to do so. (Just click on the preceding link, the installation is pretty painless).
You can run GraphSage inside a [docker](https://docs.docker.com/) image. After cloning the project, build and run the image as following:
$ docker build -t graphsage .
$ docker run -it graphsage bash
or start a Jupyter Notebook instead of bash:
$ docker run -it -p 8888:8888 graphsage
You can also run the GPU image using [nvidia-docker](https://github.com/NVIDIA/nvidia-docker):
$ docker build -t graphsage:gpu -f Dockerfile.gpu .
$ nvidia-docker run -it graphsage:gpu bash
### Running the code
The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSage, respectively.
If your benchmark/task does not require generalizing to unseen data, we recommend you try setting the "--identity_dim" flag to a value in the range [64,256].
This flag will make the model embed unique node ids as attributes, which will increase the runtime and number of parameters but also potentially increase the performance.
Note that you should set this flag and *not* try to pass dense one-hot vectors as features (due to sparsity).
The "dimension" of identity features specifies how many parameters there are per node in the sparse identity-feature lookup table.
Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance.
We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate).
*Note:* For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the `--sigmoid` flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting.
#### Input format
As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:
* <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers.
* <train_prefix>-class_map.json -- A json-stored dictionary mapping the graph node ids to classes.
* <train_prefix>-feats.npy [optional] --- A numpy-stored array of node features; ordering given by id_map.json. Can be omitted and only identity features will be used.
* <train_prefix>-walks.txt [optional] --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage)
To run the model on a new dataset, you need to make data files in the format described above.
To run random walks for the unsupervised model and to generate the <prefix>-walks.txt file)
you can use the `run_walks` function in `graphsage.utils`.
#### Model variants
The user must also specify a --model, the variants of which are described in detail in the paper:
* graphsage_mean -- GraphSage with mean-based aggregator
* graphsage_seq -- GraphSage with LSTM-based aggregator
* graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
* graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
* gcn -- GraphSage with GCN-based aggregator
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)
#### Logging directory
Finally, a --base_log_dir should be specified (it defaults to the current directory).
The output of the model and log files will be stored in a subdirectory of the base_log_dir.
The path to the logged data will be of the form `<sup/unsup>-<data_prefix>/graphsage-<model_description>/`.
The supervised model will output F1 scores, while the unsupervised model will train embeddings and store them.
The unsupervised embeddings will be stored in a numpy formated file named val.npy with val.txt specifying the order of embeddings as a per-line list of node ids.
Note that the full log outputs and stored embeddings can be 5-10Gb in size (on the full data when running with the unsupervised variant).
#### Using the output of the unsupervised models
The unsupervised variants of GraphSage will output embeddings to the logging directory as described above.
These embeddings can then be used in downstream machine learning applications.
The `eval_scripts` directory contains examples of feeding the embeddings into simple logistic classifiers.
#### Acknowledgements
The original version of this code base was originally forked from https://github.com/tkipf/gcn/, and we owe many thanks to Thomas Kipf for making his code available.
We also thank Yuanfang Li and Xin Li who contributed to a course project that was based on this work.
Please see the [paper](https://arxiv.org/pdf/1706.02216.pdf) for funding details and additional (non-code related) acknowledgements.

View File

@ -0,0 +1,40 @@
from __future__ import print_function
from __future__ import division
import numpy as np
import tensorflow as tf
#default parameters
#seed = 2018
#np.random.seed(seed)
#tf.set_random_seed(seed)
log_device_placement = False
# follow the orignal code by the paper author https://github.com/williamleif/GraphSAGE
# we follow the opt parameters given by papers GCN and graphSAGE
# note: citeseer+pubmed all follow the same parameters as cora, see their papers)
# tensorflow + Adam optimizer + Random weight init + row norm of attr
epochs = 100
dim_1 = 64 #dim = dim1+dim2 = 128
dim_2 = 64
samples_1 = 25
samples_2 = 10
dropout = 0.5
weight_decay = 0.0001
learning_rate = 0.0001
batch_size = 128 #if run out of memory, try to reduce them, but we use the default e.g. 64, default=512
normalize = True #row norm of node attributes/features
#other parameters that paper did not mentioned, but we also follow the defaults https://github.com/williamleif/GraphSAGE
model_size = 'small'
max_degree = 100
neg_sample_size = 20
random_context= True
validate_batch_size = 64 #if run out of memory, try to reduce them, but we use the default e.g. 64, default=256
validate_iter = 5000
max_total_steps = 10**10
n2v_test_epochs = 1
identity_dim = 0
train_prefix = ''
base_log_dir = ''
#print_every = 50

View File

@ -0,0 +1,450 @@
import tensorflow as tf
from libnrl.graphsage.layers import Layer, Dense
from libnrl.graphsage.inits import glorot, zeros
class MeanAggregator(Layer):
"""
Aggregates via mean followed by matmul and non-linearity.
"""
def __init__(self, input_dim, output_dim, neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu,
name=None, concat=False, **kwargs):
super(MeanAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([neigh_input_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
neigh_means = tf.reduce_mean(neigh_vecs, axis=1)
# [nodes] x [out_dim]
from_neighs = tf.matmul(neigh_means, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class GCNAggregator(Layer):
"""
Aggregates via mean followed by matmul and non-linearity.
Same matmul parameters are used self vector and neighbor vectors.
"""
def __init__(self, input_dim, output_dim, neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(GCNAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
with tf.variable_scope(self.name + name + '_vars'):
self.vars['weights'] = glorot([neigh_input_dim, output_dim],
name='neigh_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
means = tf.reduce_mean(tf.concat([neigh_vecs,
tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)
# [nodes] x [out_dim]
output = tf.matmul(means, self.vars['weights'])
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class MaxPoolingAggregator(Layer):
""" Aggregates via max-pooling over MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(MaxPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim = self.hidden_dim = 512
elif model_size == "big":
hidden_dim = self.hidden_dim = 1024
self.mlp_layers = []
self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
output_dim=hidden_dim,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
neigh_h = tf.reduce_max(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class MeanPoolingAggregator(Layer):
""" Aggregates via mean-pooling over MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(MeanPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim = self.hidden_dim = 512
elif model_size == "big":
hidden_dim = self.hidden_dim = 1024
self.mlp_layers = []
self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
output_dim=hidden_dim,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
neigh_h = tf.reduce_mean(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class TwoMaxLayerPoolingAggregator(Layer):
""" Aggregates via pooling over two MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(TwoMaxLayerPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim_1 = self.hidden_dim_1 = 512
hidden_dim_2 = self.hidden_dim_2 = 256
elif model_size == "big":
hidden_dim_1 = self.hidden_dim_1 = 1024
hidden_dim_2 = self.hidden_dim_2 = 512
self.mlp_layers = []
self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
output_dim=hidden_dim_1,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
self.mlp_layers.append(Dense(input_dim=hidden_dim_1,
output_dim=hidden_dim_2,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim_2, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim_2))
neigh_h = tf.reduce_max(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class SeqAggregator(Layer):
""" Aggregates via a standard LSTM.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(SeqAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim = self.hidden_dim = 128
elif model_size == "big":
hidden_dim = self.hidden_dim = 256
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
self.cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_dim)
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
dims = tf.shape(neigh_vecs)
batch_size = dims[0]
initial_state = self.cell.zero_state(batch_size, tf.float32)
used = tf.sign(tf.reduce_max(tf.abs(neigh_vecs), axis=2))
length = tf.reduce_sum(used, axis=1)
length = tf.maximum(length, tf.constant(1.))
length = tf.cast(length, tf.int32)
with tf.variable_scope(self.name) as scope:
try:
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
self.cell, neigh_vecs,
initial_state=initial_state, dtype=tf.float32, time_major=False,
sequence_length=length)
except ValueError:
scope.reuse_variables()
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
self.cell, neigh_vecs,
initial_state=initial_state, dtype=tf.float32, time_major=False,
sequence_length=length)
batch_size = tf.shape(rnn_outputs)[0]
max_len = tf.shape(rnn_outputs)[1]
out_size = int(rnn_outputs.get_shape()[2])
index = tf.range(0, batch_size) * max_len + (length - 1)
flat = tf.reshape(rnn_outputs, [-1, out_size])
neigh_h = tf.gather(flat, index)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
output = tf.add_n([from_self, from_neighs])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)

View File

@ -0,0 +1,112 @@
# -*- coding: utf-8 -*-
'''
#-----------------------------------------------------------------------------
# author: Chengbin Hou @ SUSTech 2018
# Email: Chengbin.Hou10@foxmail.com
# we provide utils to transform the orignal data into graphSAGE format
# you may easily use these APIs as what we demostrated in main.py of OpenANE
# the APIs are designed for unsupervised, for supervised way, plz complete 'label' to do codes...
#-----------------------------------------------------------------------------
'''
from networkx.readwrite import json_graph
import json
import random
import networkx as nx
import numpy as np
from libnrl.graphsage import unsupervised_train
def add_train_val_test_to_G(graph, test_perc=0.0, val_perc=0.1): #due to unsupervised, we do not need test data
G = graph.G #take out nx G
random.seed(2018)
num_nodes = nx.number_of_nodes(G)
test_ind = random.sample(range(0, num_nodes), int(num_nodes*test_perc))
val_ind = random.sample(range(0, num_nodes), int(num_nodes*val_perc))
for ind in range(0, num_nodes):
id = graph.look_back_list[ind]
if ind in test_ind:
G.nodes[id]['test'] = True
G.nodes[id]['val'] = False
elif ind in val_ind:
G.nodes[id]['test'] = False
G.nodes[id]['val'] = True
else:
G.nodes[id]['test'] = False
G.nodes[id]['val'] = False
## Make sure the graph has edge train_removed annotations
## (some datasets might already have this..)
print("Loaded data.. now preprocessing..")
for edge in G.edges():
if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
G[edge[0]][edge[1]]['train_removed'] = True
else:
G[edge[0]][edge[1]]['train_removed'] = False
return G
def run_random_walks(G, num_walks=50, walk_len=5):
nodes = [n for n in G.nodes() if not G.node[n]["val"] and not G.node[n]["test"]]
G = G.subgraph(nodes)
pairs = []
for count, node in enumerate(nodes):
if G.degree(node) == 0:
continue
for i in range(num_walks):
curr_node = node
for j in range(walk_len):
if len(list(G.neighbors(curr_node))) == 0: #isolated nodes! often appeared in real-world
break
next_node = random.choice(list(G.neighbors(curr_node))) #changed due to compatibility
#next_node = random.choice(G.neighbors(curr_node))
# self co-occurrences are useless
if curr_node != node:
pairs.append((node,curr_node))
curr_node = next_node
if count % 1000 == 0:
print("Done walks for", count, "nodes")
return pairs
def tranform_data_for_graphsage(graph):
G = add_train_val_test_to_G(graph) #given OpenANE graph --> obtain graphSAGE graph
#G_json = json_graph.node_link_data(G) #train_data[0] in unsupervised_train.py
id_map = graph.look_up_dict
#conversion = lambda n : int(n) # compatible with networkx >2.0
#id_map = {conversion(k):int(v) for k,v in id_map.items()} # due to graphSAGE requirement
feats = np.array([G.nodes[id]['feature'] for id in id_map.keys()])
normalize = True #have decleared in __init__.py
if normalize and not feats is None:
print("-------------row norm of node attributes/features------------------")
from sklearn.preprocessing import StandardScaler
train_inds = [id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]
train_feats = feats[train_inds]
scaler = StandardScaler()
scaler.fit(train_feats)
feats = scaler.transform(feats)
#feats1 = nx.get_node_attributes(G,'test')
#feats2 = nx.get_node_attributes(G,'val')
walks = []
walks = run_random_walks(G, num_walks=50, walk_len=5) #use the defualt parameter in graphSAGE
class_map = 0 #to do... use sklearn to make class into binary form, no need for unsupervised...
return G, feats, id_map, walks, class_map
def graphsage_unsupervised_train(graph, graphsage_model = 'graphsage_mean'):
train_data = tranform_data_for_graphsage(graph)
#from unsupervised_train.py
vectors = unsupervised_train.train(train_data, test_data=None, model = graphsage_model)
return vectors
'''
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.size))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,
' '.join([str(x) for x in vec])))
fout.close()
'''

View File

@ -0,0 +1,30 @@
import tensorflow as tf
import numpy as np
# DISCLAIMER:
# Parts of this code file are derived from
# https://github.com/tkipf/gcn
# which is under an identical MIT license as GraphSAGE
def uniform(shape, scale=0.05, name=None):
"""Uniform init."""
initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
return tf.Variable(initial, name=name)
def glorot(shape, name=None):
"""Glorot & Bengio (AISTATS 2010) init."""
init_range = np.sqrt(6.0/(shape[0]+shape[1]))
initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32)
return tf.Variable(initial, name=name)
def zeros(shape, name=None):
"""All zeros."""
initial = tf.zeros(shape, dtype=tf.float32)
return tf.Variable(initial, name=name)
def ones(shape, name=None):
"""All ones."""
initial = tf.ones(shape, dtype=tf.float32)
return tf.Variable(initial, name=name)

View File

@ -0,0 +1,116 @@
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from libnrl.graphsage.inits import zeros
flags = tf.app.flags
FLAGS = flags.FLAGS
# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package
# global unique layer ID dictionary for layer name assignment
_LAYER_UIDS = {}
def get_layer_uid(layer_name=''):
"""Helper function, assigns unique layer IDs."""
if layer_name not in _LAYER_UIDS:
_LAYER_UIDS[layer_name] = 1
return 1
else:
_LAYER_UIDS[layer_name] += 1
return _LAYER_UIDS[layer_name]
class Layer(object):
"""Base layer class. Defines basic API for all layer objects.
Implementation inspired by keras (http://keras.io).
# Properties
name: String, defines the variable scope of the layer.
logging: Boolean, switches Tensorflow histogram logging on/off
# Methods
_call(inputs): Defines computation graph of layer
(i.e. takes input, returns output)
__call__(inputs): Wrapper for _call()
_log_vars(): Log all variables
"""
def __init__(self, **kwargs):
allowed_kwargs = {'name', 'logging', 'model_size'}
for kwarg in kwargs.keys():
assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
name = kwargs.get('name')
if not name:
layer = self.__class__.__name__.lower()
name = layer + '_' + str(get_layer_uid(layer))
self.name = name
self.vars = {}
logging = kwargs.get('logging', False)
self.logging = logging
self.sparse_inputs = False
def _call(self, inputs):
return inputs
def __call__(self, inputs):
with tf.name_scope(self.name):
if self.logging and not self.sparse_inputs:
tf.summary.histogram(self.name + '/inputs', inputs)
outputs = self._call(inputs)
if self.logging:
tf.summary.histogram(self.name + '/outputs', outputs)
return outputs
def _log_vars(self):
for var in self.vars:
tf.summary.histogram(self.name + '/vars/' + var, self.vars[var])
class Dense(Layer):
"""Dense layer."""
def __init__(self, input_dim, output_dim, dropout=0.,
act=tf.nn.relu, placeholders=None, bias=True, featureless=False,
sparse_inputs=False, **kwargs):
super(Dense, self).__init__(**kwargs)
self.dropout = dropout
self.act = act
self.featureless = featureless
self.bias = bias
self.input_dim = input_dim
self.output_dim = output_dim
# helper variable for sparse dropout
self.sparse_inputs = sparse_inputs
if sparse_inputs:
self.num_features_nonzero = placeholders['num_features_nonzero']
with tf.variable_scope(self.name + '_vars'):
self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim),
dtype=tf.float32,
initializer=tf.contrib.layers.xavier_initializer(),
regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay))
if self.bias:
self.vars['bias'] = zeros([output_dim], name='bias')
if self.logging:
self._log_vars()
def _call(self, inputs):
x = inputs
x = tf.nn.dropout(x, 1-self.dropout)
# transform
output = tf.matmul(x, self.vars['weights'])
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)

View File

@ -0,0 +1,40 @@
import tensorflow as tf
# DISCLAIMER:
# Parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package
def masked_logit_cross_entropy(preds, labels, mask):
"""Logit cross-entropy loss with masking."""
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels)
loss = tf.reduce_sum(loss, axis=1)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.]))
loss *= mask
return tf.reduce_mean(loss)
def masked_softmax_cross_entropy(preds, labels, mask):
"""Softmax cross-entropy loss with masking."""
loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.]))
loss *= mask
return tf.reduce_mean(loss)
def masked_l2(preds, actuals, mask):
"""L2 loss with masking."""
loss = tf.nn.l2(preds, actuals)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
loss *= mask
return tf.reduce_mean(loss)
def masked_accuracy(preds, labels, mask):
"""Accuracy with masking."""
correct_prediction = tf.equal(tf.argmax(preds, 1), tf.argmax(labels, 1))
accuracy_all = tf.cast(correct_prediction, tf.float32)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
accuracy_all *= mask
return tf.reduce_mean(accuracy_all)

View File

@ -0,0 +1,320 @@
from __future__ import division
from __future__ import print_function
import numpy as np
np.random.seed(123)
class EdgeMinibatchIterator(object):
""" This minibatch iterator iterates over batches of sampled edges or
random pairs of co-occuring edges.
G -- networkx graph
id2idx -- dict mapping node ids to index in feature tensor
placeholders -- tensorflow placeholders object
context_pairs -- if not none, then a list of co-occuring node pairs (from random walks)
batch_size -- size of the minibatches
max_degree -- maximum size of the downsampled adjacency lists
n2v_retrain -- signals that the iterator is being used to add new embeddings to a n2v model
fixed_n2v -- signals that the iterator is being used to retrain n2v with only existing nodes as context
"""
def __init__(self, G, id2idx,
placeholders, context_pairs=None, batch_size=100, max_degree=25,
n2v_retrain=False, fixed_n2v=False,
**kwargs):
self.G = G
self.nodes = G.nodes()
self.id2idx = id2idx
self.placeholders = placeholders
self.batch_size = batch_size
self.max_degree = max_degree
self.batch_num = 0
self.nodes = np.random.permutation(G.nodes())
self.adj, self.deg = self.construct_adj()
self.test_adj = self.construct_test_adj()
if context_pairs is None:
edges = G.edges()
else:
edges = context_pairs
self.train_edges = self.edges = np.random.permutation(edges)
if not n2v_retrain:
self.train_edges = self._remove_isolated(self.train_edges)
self.val_edges = [e for e in G.edges() if G[e[0]][e[1]]['train_removed']]
else:
if fixed_n2v:
self.train_edges = self.val_edges = self._n2v_prune(self.edges)
else:
self.train_edges = self.val_edges = self.edges
print(len([n for n in G.nodes() if not G.node[n]['test'] and not G.node[n]['val']]), 'train nodes')
print(len([n for n in G.nodes() if G.node[n]['test'] or G.node[n]['val']]), 'test nodes')
self.val_set_size = len(self.val_edges)
def _n2v_prune(self, edges):
is_val = lambda n : self.G.node[n]["val"] or self.G.node[n]["test"]
return [e for e in edges if not is_val(e[1])]
def _remove_isolated(self, edge_list):
new_edge_list = []
missing = 0
for n1, n2 in edge_list:
if not n1 in self.G.node or not n2 in self.G.node:
missing += 1
continue
if (self.deg[self.id2idx[n1]] == 0 or self.deg[self.id2idx[n2]] == 0) \
and (not self.G.node[n1]['test'] or self.G.node[n1]['val']) \
and (not self.G.node[n2]['test'] or self.G.node[n2]['val']):
continue
else:
new_edge_list.append((n1,n2))
print("Unexpected missing:", missing)
return new_edge_list
def construct_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
deg = np.zeros((len(self.id2idx),))
for nodeid in self.G.nodes():
if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:
continue
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)
if (not self.G[nodeid][neighbor]['train_removed'])])
deg[self.id2idx[nodeid]] = len(neighbors)
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj, deg
def construct_test_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
for nodeid in self.G.nodes():
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)])
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj
def end(self):
return self.batch_num * self.batch_size >= len(self.train_edges)
def batch_feed_dict(self, batch_edges):
batch1 = []
batch2 = []
for node1, node2 in batch_edges:
batch1.append(self.id2idx[node1])
batch2.append(self.id2idx[node2])
feed_dict = dict()
feed_dict.update({self.placeholders['batch_size'] : len(batch_edges)})
feed_dict.update({self.placeholders['batch1']: batch1})
feed_dict.update({self.placeholders['batch2']: batch2})
return feed_dict
def next_minibatch_feed_dict(self):
start_idx = self.batch_num * self.batch_size
self.batch_num += 1
end_idx = min(start_idx + self.batch_size, len(self.train_edges))
batch_edges = self.train_edges[start_idx : end_idx]
return self.batch_feed_dict(batch_edges)
def num_training_batches(self):
return len(self.train_edges) // self.batch_size + 1
def val_feed_dict(self, size=None):
edge_list = self.val_edges
if size is None:
return self.batch_feed_dict(edge_list)
else:
ind = np.random.permutation(len(edge_list))
val_edges = [edge_list[i] for i in ind[:min(size, len(ind))]]
return self.batch_feed_dict(val_edges)
def incremental_val_feed_dict(self, size, iter_num):
edge_list = self.val_edges
val_edges = edge_list[iter_num*size:min((iter_num+1)*size,
len(edge_list))]
return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(self.val_edges), val_edges
def incremental_embed_feed_dict(self, size, iter_num):
node_list = self.nodes
val_nodes = node_list[iter_num*size:min((iter_num+1)*size,
len(node_list))]
val_edges = [(n,n) for n in val_nodes]
return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(node_list), val_edges
def label_val(self):
train_edges = []
val_edges = []
for n1, n2 in self.G.edges():
if (self.G.node[n1]['val'] or self.G.node[n1]['test']
or self.G.node[n2]['val'] or self.G.node[n2]['test']):
val_edges.append((n1,n2))
else:
train_edges.append((n1,n2))
return train_edges, val_edges
def shuffle(self):
""" Re-shuffle the training set.
Also reset the batch number.
"""
self.train_edges = np.random.permutation(self.train_edges)
self.nodes = np.random.permutation(self.nodes)
self.batch_num = 0
class NodeMinibatchIterator(object):
"""
This minibatch iterator iterates over nodes for supervised learning.
G -- networkx graph
id2idx -- dict mapping node ids to integer values indexing feature tensor
placeholders -- standard tensorflow placeholders object for feeding
label_map -- map from node ids to class values (integer or list)
num_classes -- number of output classes
batch_size -- size of the minibatches
max_degree -- maximum size of the downsampled adjacency lists
"""
def __init__(self, G, id2idx,
placeholders, label_map, num_classes,
batch_size=100, max_degree=25,
**kwargs):
self.G = G
self.nodes = G.nodes()
self.id2idx = id2idx
self.placeholders = placeholders
self.batch_size = batch_size
self.max_degree = max_degree
self.batch_num = 0
self.label_map = label_map
self.num_classes = num_classes
self.adj, self.deg = self.construct_adj()
self.test_adj = self.construct_test_adj()
self.val_nodes = [n for n in self.G.nodes() if self.G.node[n]['val']]
self.test_nodes = [n for n in self.G.nodes() if self.G.node[n]['test']]
self.no_train_nodes_set = set(self.val_nodes + self.test_nodes)
self.train_nodes = set(G.nodes()).difference(self.no_train_nodes_set)
# don't train on nodes that only have edges to test set
self.train_nodes = [n for n in self.train_nodes if self.deg[id2idx[n]] > 0]
def _make_label_vec(self, node):
label = self.label_map[node]
if isinstance(label, list):
label_vec = np.array(label)
else:
label_vec = np.zeros((self.num_classes))
class_ind = self.label_map[node]
label_vec[class_ind] = 1
return label_vec
def construct_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
deg = np.zeros((len(self.id2idx),))
for nodeid in self.G.nodes():
if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:
continue
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)
if (not self.G[nodeid][neighbor]['train_removed'])])
deg[self.id2idx[nodeid]] = len(neighbors)
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj, deg
def construct_test_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
for nodeid in self.G.nodes():
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)])
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj
def end(self):
return self.batch_num * self.batch_size >= len(self.train_nodes)
def batch_feed_dict(self, batch_nodes, val=False):
batch1id = batch_nodes
batch1 = [self.id2idx[n] for n in batch1id]
labels = np.vstack([self._make_label_vec(node) for node in batch1id])
feed_dict = dict()
feed_dict.update({self.placeholders['batch_size'] : len(batch1)})
feed_dict.update({self.placeholders['batch']: batch1})
feed_dict.update({self.placeholders['labels']: labels})
return feed_dict, labels
def node_val_feed_dict(self, size=None, test=False):
if test:
val_nodes = self.test_nodes
else:
val_nodes = self.val_nodes
if not size is None:
val_nodes = np.random.choice(val_nodes, size, replace=True)
# add a dummy neighbor
ret_val = self.batch_feed_dict(val_nodes)
return ret_val[0], ret_val[1]
def incremental_node_val_feed_dict(self, size, iter_num, test=False):
if test:
val_nodes = self.test_nodes
else:
val_nodes = self.val_nodes
val_node_subset = val_nodes[iter_num*size:min((iter_num+1)*size,
len(val_nodes))]
# add a dummy neighbor
ret_val = self.batch_feed_dict(val_node_subset)
return ret_val[0], ret_val[1], (iter_num+1)*size >= len(val_nodes), val_node_subset
def num_training_batches(self):
return len(self.train_nodes) // self.batch_size + 1
def next_minibatch_feed_dict(self):
start_idx = self.batch_num * self.batch_size
self.batch_num += 1
end_idx = min(start_idx + self.batch_size, len(self.train_nodes))
batch_nodes = self.train_nodes[start_idx : end_idx]
return self.batch_feed_dict(batch_nodes)
def incremental_embed_feed_dict(self, size, iter_num):
node_list = self.nodes
val_nodes = node_list[iter_num*size:min((iter_num+1)*size,
len(node_list))]
return self.batch_feed_dict(val_nodes), (iter_num+1)*size >= len(node_list), val_nodes
def shuffle(self):
""" Re-shuffle the training set.
Also reset the batch number.
"""
self.train_nodes = np.random.permutation(self.train_nodes)
self.batch_num = 0

View File

@ -0,0 +1,504 @@
from collections import namedtuple
import tensorflow as tf
import math
import libnrl.graphsage.layers as layers
import libnrl.graphsage.metrics as metrics
from libnrl.graphsage.prediction import BipartiteEdgePredLayer
from libnrl.graphsage.aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregator
from libnrl.graphsage.__init__ import * #import default parameters
'''
flags = tf.app.flags
FLAGS = FLAGS
'''
# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package
class Model(object):
def __init__(self, **kwargs):
allowed_kwargs = {'name', 'logging', 'model_size'}
for kwarg in kwargs.keys():
assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
name = kwargs.get('name')
if not name:
name = self.__class__.__name__.lower()
self.name = name
logging = kwargs.get('logging', False)
self.logging = logging
self.vars = {}
self.placeholders = {}
self.layers = []
self.activations = []
self.inputs = None
self.outputs = None
self.loss = 0
self.accuracy = 0
self.optimizer = None
self.opt_op = None
def _build(self):
raise NotImplementedError
def build(self):
""" Wrapper for _build() """
with tf.variable_scope(self.name):
self._build()
# Build sequential layer model
self.activations.append(self.inputs)
for layer in self.layers:
hidden = layer(self.activations[-1])
self.activations.append(hidden)
self.outputs = self.activations[-1]
# Store model variables for easy access
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
self.vars = {var.name: var for var in variables}
# Build metrics
self._loss()
self._accuracy()
self.opt_op = self.optimizer.minimize(self.loss)
def predict(self):
pass
def _loss(self):
raise NotImplementedError
def _accuracy(self):
raise NotImplementedError
def save(self, sess=None):
if not sess:
raise AttributeError("TensorFlow session not provided.")
saver = tf.train.Saver(self.vars)
save_path = saver.save(sess, "tmp/%s.ckpt" % self.name)
print("Model saved in file: %s" % save_path)
def load(self, sess=None):
if not sess:
raise AttributeError("TensorFlow session not provided.")
saver = tf.train.Saver(self.vars)
save_path = "tmp/%s.ckpt" % self.name
saver.restore(sess, save_path)
print("Model restored from file: %s" % save_path)
class MLP(Model):
""" A standard multi-layer perceptron """
def __init__(self, placeholders, dims, categorical=True, **kwargs):
super(MLP, self).__init__(**kwargs)
self.dims = dims
self.input_dim = dims[0]
self.output_dim = dims[-1]
self.placeholders = placeholders
self.categorical = categorical
self.inputs = placeholders['features']
self.labels = placeholders['labels']
self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
self.build()
def _loss(self):
# Weight decay loss
for var in self.layers[0].vars.values():
self.loss += weight_decay * tf.nn.l2_loss(var)
# Cross entropy error
if self.categorical:
self.loss += metrics.masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
# L2
else:
diff = self.labels - self.outputs
self.loss += tf.reduce_sum(tf.sqrt(tf.reduce_sum(diff * diff, axis=1)))
def _accuracy(self):
if self.categorical:
self.accuracy = metrics.masked_accuracy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
def _build(self):
self.layers.append(layers.Dense(input_dim=self.input_dim,
output_dim=self.dims[1],
act=tf.nn.relu,
dropout=self.placeholders['dropout'],
sparse_inputs=False,
logging=self.logging))
self.layers.append(layers.Dense(input_dim=self.dims[1],
output_dim=self.output_dim,
act=lambda x: x,
dropout=self.placeholders['dropout'],
logging=self.logging))
def predict(self):
return tf.nn.softmax(self.outputs)
class GeneralizedModel(Model):
"""
Base class for models that aren't constructed from traditional, sequential layers.
Subclasses must set self.outputs in _build method
(Removes the layers idiom from build method of the Model class)
"""
def __init__(self, **kwargs):
super(GeneralizedModel, self).__init__(**kwargs)
def build(self):
""" Wrapper for _build() """
with tf.variable_scope(self.name):
self._build()
# Store model variables for easy access
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
self.vars = {var.name: var for var in variables}
# Build metrics
self._loss()
self._accuracy()
self.opt_op = self.optimizer.minimize(self.loss)
# SAGEInfo is a namedtuple that specifies the parameters
# of the recursive GraphSAGE layers
SAGEInfo = namedtuple("SAGEInfo",
['layer_name', # name of the layer (to get feature embedding etc.)
'neigh_sampler', # callable neigh_sampler constructor
'num_samples',
'output_dim' # the output (i.e., hidden) dimension
])
class SampleAndAggregate(GeneralizedModel):
"""
Base implementation of unsupervised GraphSAGE
"""
def __init__(self, placeholders, features, adj, degrees,
layer_infos, concat=True, aggregator_type="mean",
model_size="small", identity_dim=0,
**kwargs):
'''
Args:
- placeholders: Stanford TensorFlow placeholder object.
- features: Numpy array with node features.
NOTE: Pass a None object to train in featureless mode (identity features for nodes)!
- adj: Numpy array with adjacency lists (padded with random re-samples)
- degrees: Numpy array with node degrees.
- layer_infos: List of SAGEInfo namedtuples that describe the parameters of all
the recursive layers. See SAGEInfo definition above.
- concat: whether to concatenate during recursive iterations
- aggregator_type: how to aggregate neighbor information
- model_size: one of "small" and "big"
- identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy)
'''
super(SampleAndAggregate, self).__init__(**kwargs)
if aggregator_type == "mean":
self.aggregator_cls = MeanAggregator
elif aggregator_type == "seq":
self.aggregator_cls = SeqAggregator
elif aggregator_type == "maxpool":
self.aggregator_cls = MaxPoolingAggregator
elif aggregator_type == "meanpool":
self.aggregator_cls = MeanPoolingAggregator
elif aggregator_type == "gcn":
self.aggregator_cls = GCNAggregator
else:
raise Exception("Unknown aggregator: ", self.aggregator_cls)
# get info from placeholders...
self.inputs1 = placeholders["batch1"]
self.inputs2 = placeholders["batch2"]
self.model_size = model_size
self.adj_info = adj
if identity_dim > 0:
self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
else:
self.embeds = None
if features is None:
if identity_dim == 0:
raise Exception("Must have a positive value for identity feature dimension if no input features given.")
self.features = self.embeds
else:
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)
if not self.embeds is None:
self.features = tf.concat([self.embeds, self.features], axis=1)
self.degrees = degrees
self.concat = concat
self.dims = [(0 if features is None else features.shape[1]) + identity_dim]
self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])
self.batch_size = placeholders["batch_size"]
self.placeholders = placeholders
self.layer_infos = layer_infos
self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
self.build()
def sample(self, inputs, layer_infos, batch_size=None):
""" Sample neighbors to be the supportive fields for multi-layer convolutions.
Args:
inputs: batch inputs
batch_size: the number of inputs (different for batch inputs and negative samples).
"""
if batch_size is None:
batch_size = self.batch_size
samples = [inputs]
# size of convolution support at each layer per node
support_size = 1
support_sizes = [support_size]
for k in range(len(layer_infos)):
t = len(layer_infos) - k - 1
support_size *= layer_infos[t].num_samples
sampler = layer_infos[t].neigh_sampler
node = sampler((samples[k], layer_infos[t].num_samples))
samples.append(tf.reshape(node, [support_size * batch_size,]))
support_sizes.append(support_size)
return samples, support_sizes
def aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,
aggregators=None, name=None, concat=False, model_size="small"):
""" At each layer, aggregate hidden representations of neighbors to compute the hidden representations
at next layer.
Args:
samples: a list of samples of variable hops away for convolving at each layer of the
network. Length is the number of layers + 1. Each is a vector of node indices.
input_features: the input features for each sample of various hops away.
dims: a list of dimensions of the hidden representations from the input layer to the
final layer. Length is the number of layers + 1.
num_samples: list of number of samples for each layer.
support_sizes: the number of nodes to gather information from for each layer.
batch_size: the number of inputs (different for batch inputs and negative samples).
Returns:
The hidden representation at the final layer for all nodes in batch
"""
if batch_size is None:
batch_size = self.batch_size
# length: number of layers + 1
hidden = [tf.nn.embedding_lookup(input_features, node_samples) for node_samples in samples]
new_agg = aggregators is None
if new_agg:
aggregators = []
for layer in range(len(num_samples)):
if new_agg:
dim_mult = 2 if concat and (layer != 0) else 1
# aggregator at current layer
if layer == len(num_samples) - 1:
aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1], act=lambda x : x,
dropout=self.placeholders['dropout'],
name=name, concat=concat, model_size=model_size)
else:
aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1],
dropout=self.placeholders['dropout'],
name=name, concat=concat, model_size=model_size)
aggregators.append(aggregator)
else:
aggregator = aggregators[layer]
# hidden representation at current layer for all support nodes that are various hops away
next_hidden = []
# as layer increases, the number of support nodes needed decreases
for hop in range(len(num_samples) - layer):
dim_mult = 2 if concat and (layer != 0) else 1
neigh_dims = [batch_size * support_sizes[hop],
num_samples[len(num_samples) - hop - 1],
dim_mult*dims[layer]]
h = aggregator((hidden[hop],
tf.reshape(hidden[hop + 1], neigh_dims)))
next_hidden.append(h)
hidden = next_hidden
return hidden[0], aggregators
def _build(self):
labels = tf.reshape(
tf.cast(self.placeholders['batch2'], dtype=tf.int64),
[self.batch_size, 1])
self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
true_classes=labels,
num_true=1,
num_sampled=neg_sample_size,
unique=False,
range_max=len(self.degrees),
distortion=0.75,
unigrams=self.degrees.tolist()))
# perform "convolution"
samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos)
samples2, support_sizes2 = self.sample(self.inputs2, self.layer_infos)
num_samples = [layer_info.num_samples for layer_info in self.layer_infos]
self.outputs1, self.aggregators = self.aggregate(samples1, [self.features], self.dims, num_samples,
support_sizes1, concat=self.concat, model_size=self.model_size)
self.outputs2, _ = self.aggregate(samples2, [self.features], self.dims, num_samples,
support_sizes2, aggregators=self.aggregators, concat=self.concat,
model_size=self.model_size)
neg_samples, neg_support_sizes = self.sample(self.neg_samples, self.layer_infos,
neg_sample_size)
self.neg_outputs, _ = self.aggregate(neg_samples, [self.features], self.dims, num_samples,
neg_support_sizes, batch_size=neg_sample_size, aggregators=self.aggregators,
concat=self.concat, model_size=self.model_size)
dim_mult = 2 if self.concat else 1
self.link_pred_layer = BipartiteEdgePredLayer(dim_mult*self.dims[-1],
dim_mult*self.dims[-1], self.placeholders, act=tf.nn.sigmoid,
bilinear_weights=False,
name='edge_predict')
self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1)
self.outputs2 = tf.nn.l2_normalize(self.outputs2, 1)
self.neg_outputs = tf.nn.l2_normalize(self.neg_outputs, 1)
def build(self):
self._build()
# TF graph management
self._loss()
self._accuracy()
self.loss = self.loss / tf.cast(self.batch_size, tf.float32)
grads_and_vars = self.optimizer.compute_gradients(self.loss)
clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var)
for grad, var in grads_and_vars]
self.grad, _ = clipped_grads_and_vars[0]
self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars)
def _loss(self):
for aggregator in self.aggregators:
for var in aggregator.vars.values():
self.loss += weight_decay * tf.nn.l2_loss(var)
self.loss += self.link_pred_layer.loss(self.outputs1, self.outputs2, self.neg_outputs)
tf.summary.scalar('loss', self.loss)
def _accuracy(self):
# shape: [batch_size]
aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)
# shape : [batch_size x num_neg_samples]
self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)
self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, neg_sample_size])
_aff = tf.expand_dims(aff, axis=1)
self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])
size = tf.shape(self.aff_all)[1]
_, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)
_, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)
self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))
tf.summary.scalar('mrr', self.mrr)
class Node2VecModel(GeneralizedModel):
def __init__(self, placeholders, dict_size, degrees, name=None,
nodevec_dim=50, lr=0.001, **kwargs):
""" Simple version of Node2Vec/DeepWalk algorithm.
Args:
dict_size: the total number of nodes.
degrees: numpy array of node degrees, ordered as in the data's id_map
nodevec_dim: dimension of the vector representation of node.
lr: learning rate of optimizer.
"""
super(Node2VecModel, self).__init__(**kwargs)
self.placeholders = placeholders
self.degrees = degrees
self.inputs1 = placeholders["batch1"]
self.inputs2 = placeholders["batch2"]
self.batch_size = placeholders['batch_size']
self.hidden_dim = nodevec_dim
# following the tensorflow word2vec tutorial
self.target_embeds = tf.Variable(
tf.random_uniform([dict_size, nodevec_dim], -1, 1),
name="target_embeds")
self.context_embeds = tf.Variable(
tf.truncated_normal([dict_size, nodevec_dim],
stddev=1.0 / math.sqrt(nodevec_dim)),
name="context_embeds")
self.context_bias = tf.Variable(
tf.zeros([dict_size]),
name="context_bias")
self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
self.build()
def _build(self):
labels = tf.reshape(
tf.cast(self.placeholders['batch2'], dtype=tf.int64),
[self.batch_size, 1])
self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
true_classes=labels,
num_true=1,
num_sampled=neg_sample_size,
unique=True,
range_max=len(self.degrees),
distortion=0.75,
unigrams=self.degrees.tolist()))
self.outputs1 = tf.nn.embedding_lookup(self.target_embeds, self.inputs1)
self.outputs2 = tf.nn.embedding_lookup(self.context_embeds, self.inputs2)
self.outputs2_bias = tf.nn.embedding_lookup(self.context_bias, self.inputs2)
self.neg_outputs = tf.nn.embedding_lookup(self.context_embeds, self.neg_samples)
self.neg_outputs_bias = tf.nn.embedding_lookup(self.context_bias, self.neg_samples)
self.link_pred_layer = BipartiteEdgePredLayer(self.hidden_dim, self.hidden_dim,
self.placeholders, bilinear_weights=False)
def build(self):
self._build()
# TF graph management
self._loss()
self._minimize()
self._accuracy()
def _minimize(self):
self.opt_op = self.optimizer.minimize(self.loss)
def _loss(self):
aff = tf.reduce_sum(tf.multiply(self.outputs1, self.outputs2), 1) + self.outputs2_bias
neg_aff = tf.matmul(self.outputs1, tf.transpose(self.neg_outputs)) + self.neg_outputs_bias
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(aff), logits=aff)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_aff), logits=neg_aff)
loss = tf.reduce_sum(true_xent) + tf.reduce_sum(negative_xent)
self.loss = loss / tf.cast(self.batch_size, tf.float32)
tf.summary.scalar('loss', self.loss)
def _accuracy(self):
# shape: [batch_size]
aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)
# shape : [batch_size x num_neg_samples]
self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)
self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, neg_sample_size])
_aff = tf.expand_dims(aff, axis=1)
self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])
size = tf.shape(self.aff_all)[1]
_, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)
_, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)
self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))
tf.summary.scalar('mrr', self.mrr)

View File

@ -0,0 +1,29 @@
from __future__ import division
from __future__ import print_function
from libnrl.graphsage.layers import Layer
import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS
"""
Classes that are used to sample node neighborhoods
"""
class UniformNeighborSampler(Layer):
"""
Uniformly samples neighbors.
Assumes that adj lists are padded with random re-sampling
"""
def __init__(self, adj_info, **kwargs):
super(UniformNeighborSampler, self).__init__(**kwargs)
self.adj_info = adj_info
def _call(self, inputs):
ids, num_samples = inputs
adj_lists = tf.nn.embedding_lookup(self.adj_info, ids)
adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists)))
adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples])
return adj_lists

View File

@ -0,0 +1,128 @@
from __future__ import division
from __future__ import print_function
from libnrl.graphsage.inits import zeros
from libnrl.graphsage.layers import Layer
import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS
class BipartiteEdgePredLayer(Layer):
def __init__(self, input_dim1, input_dim2, placeholders, dropout=False, act=tf.nn.sigmoid,
loss_fn='xent', neg_sample_weights=1.0,
bias=False, bilinear_weights=False, **kwargs):
"""
Basic class that applies skip-gram-like loss
(i.e., dot product of node+target and node and negative samples)
Args:
bilinear_weights: use a bilinear weight for affinity calculation: u^T A v. If set to
false, it is assumed that input dimensions are the same and the affinity will be
based on dot product.
"""
super(BipartiteEdgePredLayer, self).__init__(**kwargs)
self.input_dim1 = input_dim1
self.input_dim2 = input_dim2
self.act = act
self.bias = bias
self.eps = 1e-7
# Margin for hinge loss
self.margin = 0.1
self.neg_sample_weights = neg_sample_weights
self.bilinear_weights = bilinear_weights
if dropout:
self.dropout = placeholders['dropout']
else:
self.dropout = 0.
# output a likelihood term
self.output_dim = 1
with tf.variable_scope(self.name + '_vars'):
# bilinear form
if bilinear_weights:
#self.vars['weights'] = glorot([input_dim1, input_dim2],
# name='pred_weights')
self.vars['weights'] = tf.get_variable(
'pred_weights',
shape=(input_dim1, input_dim2),
dtype=tf.float32,
initializer=tf.contrib.layers.xavier_initializer())
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if loss_fn == 'xent':
self.loss_fn = self._xent_loss
elif loss_fn == 'skipgram':
self.loss_fn = self._skipgram_loss
elif loss_fn == 'hinge':
self.loss_fn = self._hinge_loss
if self.logging:
self._log_vars()
def affinity(self, inputs1, inputs2):
""" Affinity score between batch of inputs1 and inputs2.
Args:
inputs1: tensor of shape [batch_size x feature_size].
"""
# shape: [batch_size, input_dim1]
if self.bilinear_weights:
prod = tf.matmul(inputs2, tf.transpose(self.vars['weights']))
self.prod = prod
result = tf.reduce_sum(inputs1 * prod, axis=1)
else:
result = tf.reduce_sum(inputs1 * inputs2, axis=1)
return result
def neg_cost(self, inputs1, neg_samples, hard_neg_samples=None):
""" For each input in batch, compute the sum of its affinity to negative samples.
Returns:
Tensor of shape [batch_size x num_neg_samples]. For each node, a list of affinities to
negative samples is computed.
"""
if self.bilinear_weights:
inputs1 = tf.matmul(inputs1, self.vars['weights'])
neg_aff = tf.matmul(inputs1, tf.transpose(neg_samples))
return neg_aff
def loss(self, inputs1, inputs2, neg_samples):
""" negative sampling loss.
Args:
neg_samples: tensor of shape [num_neg_samples x input_dim2]. Negative samples for all
inputs in batch inputs1.
"""
return self.loss_fn(inputs1, inputs2, neg_samples)
def _xent_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):
aff = self.affinity(inputs1, inputs2)
neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(aff), logits=aff)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_aff), logits=neg_aff)
loss = tf.reduce_sum(true_xent) + self.neg_sample_weights * tf.reduce_sum(negative_xent)
return loss
def _skipgram_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):
aff = self.affinity(inputs1, inputs2)
neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)
neg_cost = tf.log(tf.reduce_sum(tf.exp(neg_aff), axis=1))
loss = tf.reduce_sum(aff - neg_cost)
return loss
def _hinge_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):
aff = self.affinity(inputs1, inputs2)
neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)
diff = tf.nn.relu(tf.subtract(neg_aff, tf.expand_dims(aff, 1) - self.margin), name='diff')
loss = tf.reduce_sum(diff)
self.neg_shape = tf.shape(neg_aff)
return loss
def weights_norm(self):
return tf.nn.l2_norm(self.vars['weights'])

View File

@ -0,0 +1,293 @@
from __future__ import division
from __future__ import print_function
import os
import time
import tensorflow as tf
import numpy as np
from libnrl.graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from libnrl.graphsage.minibatch import EdgeMinibatchIterator
from libnrl.graphsage.neigh_samplers import UniformNeighborSampler
#from libnrl.graphsage.utils import load_data
from libnrl.graphsage.__init__ import * #import default parameters
# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):
t_test = time.time()
feed_dict_val = minibatch_iter.val_feed_dict(size)
outs_val = sess.run([model.loss, model.ranks, model.mrr],
feed_dict=feed_dict_val)
return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)
'''
def incremental_evaluate(sess, model, minibatch_iter, size):
t_test = time.time()
finished = False
val_losses = []
val_mrrs = []
iter_num = 0
while not finished:
feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)
iter_num += 1
outs_val = sess.run([model.loss, model.ranks, model.mrr],
feed_dict=feed_dict_val)
val_losses.append(outs_val[0])
val_mrrs.append(outs_val[2])
return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)
'''
def save_val_embeddings(sess, model, minibatch_iter, size, mod=""):
val_embeddings = []
finished = False
seen = set([]) #this as set to store already seen emb-node id!
nodes = []
iter_num = 0
name = "val"
while not finished:
feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)
iter_num += 1
outs_val = sess.run([model.loss, model.mrr, model.outputs1],
feed_dict=feed_dict_val)
#ONLY SAVE FOR embeds1 because of planetoid
for i, edge in enumerate(edges):
if not edge[0] in seen:
val_embeddings.append(outs_val[-1][i,:])
nodes.append(edge[0]) #nodes: a list; has order
seen.add(edge[0]) #seen: a set; NO order!!!
#if not os.path.exists(out_dir):
# os.makedirs(out_dir)
val_embeddings = np.vstack(val_embeddings)
print(val_embeddings.shape)
vectors = {}
for i, embedding in enumerate(val_embeddings):
vectors[nodes[i]] = embedding #warning: seen: a set; nodes: a list
return vectors
''' #if we want to save embs, modify the following code
np.save(out_dir + name + mod + ".npy", val_embeddings)
with open(out_dir + name + mod + ".txt", "w") as fp:
fp.write("\n".join(map(str,nodes)))
'''
def construct_placeholders():
# Define placeholders
placeholders = {
'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
'batch2' : tf.placeholder(tf.int32, shape=(None), name='batch2'),
# negative samples for all nodes in the batch
'neg_samples': tf.placeholder(tf.int32, shape=(None,),
name='neg_sample_size'),
'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
}
return placeholders
def train(train_data, test_data=None, model='graphsage_mean'):
print('---------- the graphsage model we used: ', model)
print('---------- parameters we sued: epochs, dim_1+dim_2, samples_1, samples_2, dropout, weight_decay, learning_rate, batch_size, normalize',
epochs, dim_1+dim_2, samples_1, samples_2, dropout, weight_decay, learning_rate, batch_size, normalize)
G = train_data[0]
features = train_data[1] #note: features are in order of graph.look_up_list, since id_map = {k: v for v, k in enumerate(graph.look_back_list)}
id_map = train_data[2]
if not features is None:
# pad with dummy zero vector
features = np.vstack([features, np.zeros((features.shape[1],))])
random_context = False
context_pairs = train_data[3] if random_context else None
placeholders = construct_placeholders()
minibatch = EdgeMinibatchIterator(G,
id_map,
placeholders, batch_size=batch_size,
max_degree=max_degree,
num_neg_samples=neg_sample_size,
context_pairs = context_pairs)
adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)
adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")
if model == 'graphsage_mean':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, samples_1, dim_1),
SAGEInfo("node", sampler, samples_2, dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
model_size=model_size,
identity_dim = identity_dim,
logging=True)
elif model == 'gcn':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, samples_1, 2*dim_1),
SAGEInfo("node", sampler, samples_2, 2*dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="gcn",
model_size=model_size,
identity_dim = identity_dim,
concat=False,
logging=True)
elif model == 'graphsage_seq': #LSTM as stated in paper? very slow anyway...
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, samples_1, dim_1),
SAGEInfo("node", sampler, samples_2, dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
identity_dim = identity_dim,
aggregator_type="seq",
model_size=model_size,
logging=True)
elif model == 'graphsage_maxpool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, samples_1, dim_1),
SAGEInfo("node", sampler, samples_2, dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="maxpool",
model_size=model_size,
identity_dim = identity_dim,
logging=True)
elif model == 'graphsage_meanpool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, samples_1, dim_1),
SAGEInfo("node", sampler, samples_2, dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="meanpool",
model_size=model_size,
identity_dim = identity_dim,
logging=True)
elif model == 'n2v':
model = Node2VecModel(placeholders, features.shape[0],
minibatch.deg,
#2x because graphsage uses concat
nodevec_dim=2*dim_1,
lr=learning_rate)
else:
raise Exception('Error: model name unrecognized.')
config = tf.ConfigProto(log_device_placement=log_device_placement)
config.gpu_options.allow_growth = True
#config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
config.allow_soft_placement = True
# Initialize session
sess = tf.Session(config=config)
merged = tf.summary.merge_all()
#summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
# Init variables
sess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})
# Train model
train_shadow_mrr = None
shadow_mrr = None
total_steps = 0
avg_time = 0.0
epoch_val_costs = []
train_adj_info = tf.assign(adj_info, minibatch.adj)
val_adj_info = tf.assign(adj_info, minibatch.test_adj)
for epoch in range(epochs):
minibatch.shuffle()
iter = 0
epoch_val_costs.append(0)
train_cost = 0
train_mrr = 0
train_shadow_mrr = 0
val_cost = 0
val_mrr = 0
shadow_mrr = 0
avg_time = 0
while not minibatch.end():
# Construct feed dictionary
feed_dict = minibatch.next_minibatch_feed_dict()
feed_dict.update({placeholders['dropout']: dropout})
t = time.time()
# Training step
outs = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all,
model.mrr, model.outputs1], feed_dict=feed_dict)
train_cost = outs[2]
train_mrr = outs[5]
if train_shadow_mrr is None:
train_shadow_mrr = train_mrr#
else:
train_shadow_mrr -= (1-0.99) * (train_shadow_mrr - train_mrr)
if iter % validate_iter == 0:
# Validation
sess.run(val_adj_info.op)
val_cost, ranks, val_mrr, duration = evaluate(sess, model, minibatch, size=validate_batch_size)
sess.run(train_adj_info.op)
epoch_val_costs[-1] += val_cost
if shadow_mrr is None:
shadow_mrr = val_mrr
else:
shadow_mrr -= (1-0.99) * (shadow_mrr - val_mrr)
#if total_steps % print_every == 0:
#summary_writer.add_summary(outs[0], total_steps)
# Print results
avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)
iter += 1
total_steps += 1
if total_steps > max_total_steps:
break
epoch += 1
print("Epoch:", '%04d' % epoch,
"train_loss=", "{:.5f}".format(train_cost),
"train_mrr=", "{:.5f}".format(train_mrr),
"train_mrr_ema=", "{:.5f}".format(train_shadow_mrr), # exponential moving average
"val_loss=", "{:.5f}".format(val_cost),
"val_mrr=", "{:.5f}".format(val_mrr),
"val_mrr_ema=", "{:.5f}".format(shadow_mrr), # exponential moving average
"time=", "{:.5f}".format(avg_time))
if total_steps > max_total_steps:
break
print("Optimization Finished!")
sess.run(val_adj_info.op)
#save_val_embeddings(sess, model, minibatch, validate_batch_size, log_dir())
return save_val_embeddings(sess, model, minibatch, validate_batch_size) #return embs
def graphsage_save_embeddings(self, filename): #to do...
pass

View File

@ -0,0 +1,117 @@
from __future__ import print_function
#-----------
#compatible with networkx >2.0 in line 18 and 32 by Chengbin
#compatible with latest random.choice in line 94 by Chengbin
#--------------
import numpy as np
import random
import json
import sys
import os
import networkx as nx
from networkx.readwrite import json_graph
version_info = list(map(int, nx.__version__.split('.')))
major = version_info[0]
minor = version_info[1]
#assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"
WALK_LEN=5
N_WALKS=50
def load_data(prefix, normalize=True, load_walks=False):
G_data = json.load(open(prefix + "-G.json"))
G = json_graph.node_link_graph(G_data)
'''
if isinstance(G.nodes()[0], int):
conversion = lambda n : int(n)
else:
conversion = lambda n : n
'''
conversion = lambda n : int(n) # compatible with networkx >2.0
if os.path.exists(prefix + "-feats.npy"):
feats = np.load(prefix + "-feats.npy")
else:
print("No features present.. Only identity features will be used.")
feats = None
id_map = json.load(open(prefix + "-id_map.json"))
id_map = {conversion(k):int(v) for k,v in id_map.items()}
walks = []
class_map = json.load(open(prefix + "-class_map.json"))
if isinstance(list(class_map.values())[0], list):
lab_conversion = lambda n : n
else:
lab_conversion = lambda n : int(n)
class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()}
## Remove all nodes that do not have val/test annotations
## (necessary because of networkx weirdness with the Reddit data)
broken_count = 0
for node in G.nodes():
if not 'val' in G.node[node] or not 'test' in G.node[node]:
G.remove_node(node)
broken_count += 1
print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count))
## Make sure the graph has edge train_removed annotations
## (some datasets might already have this..)
print("Loaded data.. now preprocessing..")
for edge in G.edges():
if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
G[edge[0]][edge[1]]['train_removed'] = True
else:
G[edge[0]][edge[1]]['train_removed'] = False
if normalize and not feats is None:
from sklearn.preprocessing import StandardScaler
train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])
train_feats = feats[train_ids]
scaler = StandardScaler()
scaler.fit(train_feats)
feats = scaler.transform(feats)
if load_walks:
with open(prefix + "-walks.txt") as fp:
for line in fp:
walks.append(map(conversion, line.split()))
return G, feats, id_map, walks, class_map
def run_random_walks(G, nodes, num_walks=N_WALKS):
pairs = []
for count, node in enumerate(nodes):
if G.degree(node) == 0:
continue
for i in range(num_walks):
curr_node = node
for j in range(WALK_LEN):
next_node = random.choice(list(G.neighbors(curr_node))) #changed due to compatibility
#next_node = random.choice(G.neighbors(curr_node))
# self co-occurrences are useless
if curr_node != node:
pairs.append((node,curr_node))
curr_node = next_node
if count % 1000 == 0:
print("Done walks for", count, "nodes")
return pairs
if __name__ == "__main__": #这个地方需要改写,可以每次运行都跑一次
""" Run random walks """
graph_file = sys.argv[1]
out_file = sys.argv[2]
G_data = json.load(open(graph_file))
G = json_graph.node_link_graph(G_data)
nodes = [n for n in G.nodes() if not G.node[n]["val"] and not G.node[n]["test"]]
G = G.subgraph(nodes)
pairs = run_random_walks(G, nodes)
with open(out_file, "w") as fp:
fp.write("\n".join([str(p[0]) + "\t" + str(p[1]) for p in pairs]))
#go to this file dir and run the following line in CMD
#python utils.py ../example_data/toy-ppi-G.json ../example_data/toy-ppi-walks.txt

68
src/libnrl/grarep.py Normal file
View File

@ -0,0 +1,68 @@
import math
import numpy as np
from numpy import linalg as la
from sklearn.preprocessing import normalize
class GraRep(object):
def __init__(self, graph, Kstep, dim):
self.g = graph
self.Kstep = Kstep
assert dim%Kstep == 0
self.dim = int(dim/Kstep)
self.train()
def getAdjMat(self):
graph = self.g.G
node_size = self.g.node_size
look_up = self.g.look_up_dict
adj = np.zeros((node_size, node_size))
for edge in self.g.G.edges():
adj[look_up[edge[0]]][look_up[edge[1]]] = 1.0
adj[look_up[edge[1]]][look_up[edge[0]]] = 1.0
# ScaleSimMat
return np.matrix(adj/np.sum(adj, axis=1))
def GetProbTranMat(self, Ak):
probTranMat = np.log(Ak/np.tile(
np.sum(Ak, axis=0), (self.node_size, 1))) \
- np.log(1.0/self.node_size)
probTranMat[probTranMat < 0] = 0
probTranMat[probTranMat == np.nan] = 0
return probTranMat
def GetRepUseSVD(self, probTranMat, alpha):
U, S, VT = la.svd(probTranMat)
Ud = U[:, 0:self.dim]
Sd = S[0:self.dim]
return np.array(Ud)*np.power(Sd, alpha).reshape((self.dim))
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.Kstep*self.dim))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,' '.join([str(x) for x in vec])))
fout.close()
def train(self):
self.adj = self.getAdjMat()
self.node_size = self.adj.shape[0]
self.Ak = np.matrix(np.identity(self.node_size))
self.RepMat = np.zeros((self.node_size, int(self.dim*self.Kstep)))
for i in range(self.Kstep):
print('Kstep =', i)
self.Ak = np.dot(self.Ak, self.adj)
probTranMat = self.GetProbTranMat(self.Ak)
Rk = self.GetRepUseSVD(probTranMat, 0.5)
Rk = normalize(Rk, axis=1, norm='l2')
self.RepMat[:, self.dim*i:self.dim*(i+1)] = Rk[:, :]
# get embeddings
self.vectors = {}
look_back = self.g.look_back_list
for i, embedding in enumerate(self.RepMat):
self.vectors[look_back[i]] = embedding

259
src/libnrl/line.py Normal file
View File

@ -0,0 +1,259 @@
from __future__ import print_function
import random
import math
import numpy as np
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from .classify import ncClassifier, lpClassifier, read_node_label, read_edge_label
class _LINE(object):
def __init__(self, graph, rep_size=128, batch_size=1000, negative_ratio=5, order=3):
self.cur_epoch = 0
self.order = order
self.g = graph
self.node_size = graph.G.number_of_nodes()
self.rep_size = rep_size
self.batch_size = batch_size
self.negative_ratio = negative_ratio
self.gen_sampling_table()
self.sess = tf.Session()
cur_seed = random.getrandbits(32)
initializer = tf.contrib.layers.xavier_initializer(uniform=False, seed=cur_seed)
with tf.variable_scope("model", reuse=None, initializer=initializer):
self.build_graph()
self.sess.run(tf.global_variables_initializer())
def build_graph(self):
self.h = tf.placeholder(tf.int32, [None])
self.t = tf.placeholder(tf.int32, [None])
self.sign = tf.placeholder(tf.float32, [None])
cur_seed = random.getrandbits(32)
self.embeddings = tf.get_variable(name="embeddings"+str(self.order), shape=[self.node_size, self.rep_size], initializer = tf.contrib.layers.xavier_initializer(uniform = False, seed=cur_seed))
self.context_embeddings = tf.get_variable(name="context_embeddings"+str(self.order), shape=[self.node_size, self.rep_size], initializer = tf.contrib.layers.xavier_initializer(uniform = False, seed=cur_seed))
# self.h_e = tf.nn.l2_normalize(tf.nn.embedding_lookup(self.embeddings, self.h), 1)
# self.t_e = tf.nn.l2_normalize(tf.nn.embedding_lookup(self.embeddings, self.t), 1)
# self.t_e_context = tf.nn.l2_normalize(tf.nn.embedding_lookup(self.context_embeddings, self.t), 1)
self.h_e = tf.nn.embedding_lookup(self.embeddings, self.h)
self.t_e = tf.nn.embedding_lookup(self.embeddings, self.t)
self.t_e_context = tf.nn.embedding_lookup(self.context_embeddings, self.t)
self.second_loss = -tf.reduce_mean(tf.log_sigmoid(self.sign*tf.reduce_sum(tf.multiply(self.h_e, self.t_e_context), axis=1)))
self.first_loss = -tf.reduce_mean(tf.log_sigmoid(self.sign*tf.reduce_sum(tf.multiply(self.h_e, self.t_e), axis=1)))
if self.order == 1:
self.loss = self.first_loss
else:
self.loss = self.second_loss
optimizer = tf.train.AdamOptimizer(0.001)
self.train_op = optimizer.minimize(self.loss)
def train_one_epoch(self):
sum_loss = 0.0
batches = self.batch_iter()
batch_id = 0
for batch in batches:
h, t, sign = batch
feed_dict = {
self.h : h,
self.t : t,
self.sign : sign,
}
_, cur_loss = self.sess.run([self.train_op, self.loss],feed_dict)
sum_loss += cur_loss
batch_id += 1
print('epoch:{} sum of loss:{!s}'.format(self.cur_epoch, sum_loss))
self.cur_epoch += 1
def batch_iter(self):
look_up = self.g.look_up_dict
table_size = 1e8
numNodes = self.node_size
edges = [(look_up[x[0]], look_up[x[1]]) for x in self.g.G.edges()]
data_size = self.g.G.number_of_edges()
edge_set = set([x[0]*numNodes+x[1] for x in edges])
shuffle_indices = np.random.permutation(np.arange(data_size))
# positive or negative mod
mod = 0
mod_size = 1 + self.negative_ratio
h = []
t = []
sign = 0
start_index = 0
end_index = min(start_index+self.batch_size, data_size)
while start_index < data_size:
if mod == 0:
sign = 1.
h = []
t = []
for i in range(start_index, end_index):
if not random.random() < self.edge_prob[shuffle_indices[i]]:
shuffle_indices[i] = self.edge_alias[shuffle_indices[i]]
cur_h = edges[shuffle_indices[i]][0]
cur_t = edges[shuffle_indices[i]][1]
h.append(cur_h)
t.append(cur_t)
else:
sign = -1.
t = []
for i in range(len(h)):
t.append(self.sampling_table[random.randint(0, table_size-1)])
yield h, t, [sign]
mod += 1
mod %= mod_size
if mod == 0:
start_index = end_index
end_index = min(start_index+self.batch_size, data_size)
def gen_sampling_table(self):
table_size = 1e8
power = 0.75
numNodes = self.node_size
print("Pre-procesing for non-uniform negative sampling!")
node_degree = np.zeros(numNodes) # out degree
look_up = self.g.look_up_dict
for edge in self.g.G.edges():
node_degree[look_up[edge[0]]] += self.g.G[edge[0]][edge[1]]["weight"]
norm = sum([math.pow(node_degree[i], power) for i in range(numNodes)])
self.sampling_table = np.zeros(int(table_size), dtype=np.uint32)
p = 0
i = 0
for j in range(numNodes):
p += float(math.pow(node_degree[j], power)) / norm
while i < table_size and float(i) / table_size < p:
self.sampling_table[i] = j
i += 1
data_size = self.g.G.number_of_edges()
self.edge_alias = np.zeros(data_size, dtype=np.int32)
self.edge_prob = np.zeros(data_size, dtype=np.float32)
large_block = np.zeros(data_size, dtype=np.int32)
small_block = np.zeros(data_size, dtype=np.int32)
total_sum = sum([self.g.G[edge[0]][edge[1]]["weight"] for edge in self.g.G.edges()])
norm_prob = [self.g.G[edge[0]][edge[1]]["weight"]*data_size/total_sum for edge in self.g.G.edges()]
num_small_block = 0
num_large_block = 0
cur_small_block = 0
cur_large_block = 0
for k in range(data_size-1, -1, -1):
if norm_prob[k] < 1:
small_block[num_small_block] = k
num_small_block += 1
else:
large_block[num_large_block] = k
num_large_block += 1
while num_small_block and num_large_block:
num_small_block -= 1
cur_small_block = small_block[num_small_block]
num_large_block -= 1
cur_large_block = large_block[num_large_block]
self.edge_prob[cur_small_block] = norm_prob[cur_small_block]
self.edge_alias[cur_small_block] = cur_large_block
norm_prob[cur_large_block] = norm_prob[cur_large_block] + norm_prob[cur_small_block] -1
if norm_prob[cur_large_block] < 1:
small_block[num_small_block] = cur_large_block
num_small_block += 1
else:
large_block[num_large_block] = cur_large_block
num_large_block += 1
while num_large_block:
num_large_block -= 1
self.edge_prob[large_block[num_large_block]] = 1
while num_small_block:
num_small_block -= 1
self.edge_prob[small_block[num_small_block]] = 1
def get_embeddings(self):
vectors = {}
embeddings = self.embeddings.eval(session=self.sess)
# embeddings = self.sess.run(tf.nn.l2_normalize(self.embeddings.eval(session=self.sess), 1))
look_back = self.g.look_back_list
for i, embedding in enumerate(embeddings):
vectors[look_back[i]] = embedding
return vectors
class LINE(object):
def __init__(self, graph, rep_size=128, batch_size=1000, epoch=10, negative_ratio=5, order=3, label_file = None, clf_ratio = 0.5, auto_save = True):
self.rep_size = rep_size
self.order = order
self.best_result = 0
self.vectors = {}
if order == 3:
self.model1 = _LINE(graph, rep_size/2, batch_size, negative_ratio, order=1)
self.model2 = _LINE(graph, rep_size/2, batch_size, negative_ratio, order=2)
for i in range(epoch):
self.model1.train_one_epoch()
self.model2.train_one_epoch()
'''
if label_file:
self.get_embeddings()
X, Y = read_node_label(label_file)
print("Training classifier using {:.2f}% nodes...".format(clf_ratio*100))
clf = Classifier(vectors=self.vectors, clf=LogisticRegression())
result = clf.split_train_evaluate(X, Y, clf_ratio)
if result['macro'] > self.best_result:
self.best_result = result['macro']
if auto_save:
self.best_vector = self.vectors
'''
else:
self.model = _LINE(graph, rep_size, batch_size, negative_ratio, order=self.order)
for i in range(epoch):
self.model.train_one_epoch()
'''
if label_file:
self.get_embeddings()
X, Y = read_node_label(label_file)
print("Training classifier using {:.2f}% nodes...".format(clf_ratio*100))
clf = Classifier(vectors=self.vectors, clf=LogisticRegression())
result = clf.split_train_evaluate(X, Y, clf_ratio)
if result['macro'] > self.best_result:
self.best_result = result['macro']
if auto_save:
self.best_vector = self.vectors
'''
self.get_embeddings()
if auto_save and label_file:
#self.vectors = self.best_vector
pass
def get_embeddings(self):
self.last_vectors = self.vectors
self.vectors = {}
if self.order == 3:
vectors1 = self.model1.get_embeddings()
vectors2 = self.model2.get_embeddings()
for node in vectors1.keys():
self.vectors[node] = np.append(vectors1[node], vectors2[node])
else:
self.vectors = self.model.get_embeddings()
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.rep_size))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,
' '.join([str(x) for x in vec])))
fout.close()

47
src/libnrl/node2vec.py Normal file
View File

@ -0,0 +1,47 @@
from __future__ import print_function
import time
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
from . import walker
class Node2vec(object):
def __init__(self, graph, path_length, num_paths, dim, p=1.0, q=1.0, dw=False, **kwargs):
kwargs["workers"] = kwargs.get("workers", 1)
if dw:
kwargs["hs"] = 1
p = 1.0
q = 1.0
self.graph = graph
if dw:
self.walker = walker.BasicWalker(graph, workers=kwargs["workers"])
else:
self.walker = walker.Walker(graph, p=p, q=q, workers=kwargs["workers"])
print("Preprocess transition probs...")
self.walker.preprocess_transition_probs()
sentences = self.walker.simulate_walks(num_walks=num_paths, walk_length=path_length)
kwargs["sentences"] = sentences
kwargs["min_count"] = kwargs.get("min_count", 0)
kwargs["size"] = kwargs.get("size", dim)
kwargs["sg"] = 1
self.size = kwargs["size"]
print("Learning representation...")
word2vec = Word2Vec(**kwargs)
self.vectors = {}
for word in graph.G.nodes():
self.vectors[word] = word2vec.wv[word]
del word2vec
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.size))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,
' '.join([str(x) for x in vec])))
fout.close()

128
src/libnrl/tadw.py Normal file
View File

@ -0,0 +1,128 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import math
import numpy as np
from numpy import linalg as la
from sklearn.preprocessing import normalize
from .gcn.utils import *
'''
#-----------------------------------------------------------------------------
# part of code was originally forked from https://github.com/thunlp/OpenNE
# modified by Chengbin Hou 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
class TADW(object):
def __init__(self, graph, dim, lamb=0.2):
self.g = graph
self.lamb = lamb
self.dim = dim
self.train()
def getAdj(self): #changed with the same data preprocessing, and our preprocessing obtain better result
'''
graph = self.g.G
node_size = self.g.node_size
look_up = self.g.look_up_dict
adj = np.zeros((node_size, node_size))
for edge in self.g.G.edges():
adj[look_up[edge[0]]][look_up[edge[1]]] = 1.0
adj[look_up[edge[1]]][look_up[edge[0]]] = 1.0
# ScaleSimMat
return adj/np.sum(adj, axis=1) #orignal way may get numerical error sometimes...
'''
A = self.g.getA()
return self.g.rowAsPDF(A)
def getT(self): #changed with the same data preprocessing method
g = self.g.G
look_back = self.g.look_back_list
self.features = np.vstack([g.nodes[look_back[i]]['feature']
for i in range(g.number_of_nodes())])
self.preprocessFeature() #call the orig data preprocessing method
return self.features.T
'''
#changed with the same data preprocessing method, see self.g.preprocessAttrInfo(X=X, dim=200, method='svd')
#seems get better result?
X = self.g.getX()
self.features = self.g.preprocessAttrInfo(X=X, dim=200, method='svd') #svd or pca for dim reduction
return np.transpose(self.features)
'''
def preprocessFeature(self): #the orignal data preprocess method
U, S, VT = la.svd(self.features)
Ud = U[:, 0:200]
Sd = S[0:200]
self.features = np.array(Ud)*Sd.reshape(200)
def save_embeddings(self, filename):
fout = open(filename, 'w')
node_num = len(self.vectors.keys())
fout.write("{} {}\n".format(node_num, self.dim))
for node, vec in self.vectors.items():
fout.write("{} {}\n".format(node,' '.join([str(x) for x in vec])))
fout.close()
def train(self):
self.adj = self.getAdj()
# M=(A+A^2)/2 where A is the row-normalized adjacency matrix
self.M = (self.adj + np.dot(self.adj, self.adj))/2
# T is feature_size*node_num, text features
self.T = self.getT() #transpose of self.features!!!
self.node_size = self.adj.shape[0]
self.feature_size = self.features.shape[1]
self.W = np.random.randn(self.dim, self.node_size)
self.H = np.random.randn(self.dim, self.feature_size)
# Update
for i in range(20): #trade-off between acc and speed, 20-50
print('Iteration ', i)
# Update W
B = np.dot(self.H, self.T)
drv = 2 * np.dot(np.dot(B, B.T), self.W) - \
2*np.dot(B, self.M.T) + self.lamb*self.W
Hess = 2*np.dot(B, B.T) + self.lamb*np.eye(self.dim)
drv = np.reshape(drv, [self.dim*self.node_size, 1])
rt = -drv
dt = rt
vecW = np.reshape(self.W, [self.dim*self.node_size, 1])
while np.linalg.norm(rt, 2) > 1e-4:
dtS = np.reshape(dt, (self.dim, self.node_size))
Hdt = np.reshape(np.dot(Hess, dtS), [self.dim*self.node_size, 1])
at = np.dot(rt.T, rt)/np.dot(dt.T, Hdt)
vecW = vecW + at*dt
rtmp = rt
rt = rt - at*Hdt
bt = np.dot(rt.T, rt)/np.dot(rtmp.T, rtmp)
dt = rt + bt * dt
self.W = np.reshape(vecW, (self.dim, self.node_size))
# Update H
drv = np.dot((np.dot(np.dot(np.dot(self.W, self.W.T),self.H),self.T)
- np.dot(self.W, self.M.T)), self.T.T) + self.lamb*self.H
drv = np.reshape(drv, (self.dim*self.feature_size, 1))
rt = -drv
dt = rt
vecH = np.reshape(self.H, (self.dim*self.feature_size, 1))
while np.linalg.norm(rt, 2) > 1e-4:
dtS = np.reshape(dt, (self.dim, self.feature_size))
Hdt = np.reshape(np.dot(np.dot(np.dot(self.W, self.W.T), dtS), np.dot(self.T, self.T.T))
+ self.lamb*dtS, (self.dim*self.feature_size, 1))
at = np.dot(rt.T, rt)/np.dot(dt.T, Hdt)
vecH = vecH + at*dt
rtmp = rt
rt = rt - at*Hdt
bt = np.dot(rt.T, rt)/np.dot(rtmp.T, rtmp)
dt = rt + bt * dt
self.H = np.reshape(vecH, (self.dim, self.feature_size))
self.Vecs = np.hstack((normalize(self.W.T), normalize(np.dot(self.T.T, self.H.T))))
# get embeddings
self.vectors = {}
look_back = self.g.look_back_list
for i, embedding in enumerate(self.Vecs):
self.vectors[look_back[i]] = embedding

260
src/libnrl/utils.py Normal file
View File

@ -0,0 +1,260 @@
# -*- coding: utf-8 -*-
import numpy as np
from scipy import sparse
# from sklearn.model_selection import train_test_split
'''
#-----------------------------------------------------------------------------
# Chengbin Hou @ SUSTech 2018
# Email: Chengbin.Hou10@foxmail.com
#-----------------------------------------------------------------------------
'''
# ---------------------------------ulits for calculation--------------------------------
def row_as_probdist(mat):
"""Make each row of matrix sums up to 1.0, i.e., a probability distribution.
Support both dense and sparse matrix.
Attributes
----------
mat : scipy sparse matrix or dense matrix or numpy array
The matrix to be normalized
Note
----
For row with all entries 0, we normalize it to a vector with all entries 1/n
Returns
-------
dense or sparse matrix:
return dense matrix if input is dense matrix or numpy array
return sparse matrix for sparse matrix input
"""
row_sum = np.array(mat.sum(axis=1)) # type: np.array
zero_rows = row_sum == 0
row_sum[zero_rows] = 1
diag = sparse.dia_matrix((1 / row_sum, 0), (mat.shape[0], mat.shape[0]))
mat = diag.dot(mat)
mat += sparse.bsr_matrix(zero_rows.astype(int)).T.dot(sparse.bsr_matrix(np.repeat(1 / mat.shape[1], mat.shape[1])))
return mat
def pairwise_similarity(mat, type='cosine'):
# XXX: possible to integrate pairwise_similarity with top_k to enhance performance?
if type == 'cosine': # support sprase and dense mat
from sklearn.metrics.pairwise import cosine_similarity
result = cosine_similarity(mat, dense_output=True)
elif type == 'jaccard':
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics.pairwise import pairwise_distances
# n_jobs=-1 means using all CPU for parallel computing
result = pairwise_distances(mat.todense(), metric=jaccard_similarity_score, n_jobs=-1)
elif type == 'euclidean':
from sklearn.metrics.pairwise import euclidean_distances
# note: similarity = - distance
# other version: similarity = 1 - 2 / pi * arctan(distance)
result = euclidean_distances(mat)
result = -result
# result = 1 - 2 / np.pi * np.arctan(result)
elif type == 'manhattan':
from sklearn.metrics.pairwise import manhattan_distances
# note: similarity = - distance
# other version: similarity = 1 - 2 / pi * arctan(distance)
result = manhattan_distances(mat)
result = -result
# result = 1 - 2 / np.pi * np.arctan(result)
else:
print('Please choose from: cosine, jaccard, euclidean or manhattan')
return 'Not found!'
return result
# ---------------------------------ulits for preprocessing--------------------------------
def node_auxi_to_attr(fin, fout):
""" TODO...
-> read auxi info associated with each node;
-> preprocessing auxi via:
1) NLP for sentences; or 2) one-hot for discrete features;
-> then becomes node attr with m dim, and store them into attr file
"""
# https://radimrehurek.com/gensim/apiref.html
# word2vec, doc2vec, 把句子转为vec
# text2vec, tfidf, 把离散的features转为vec
pass
def simulate_incomplete_stru():
pass
def simulate_incomplete_attr():
pass
def simulate_noisy_world():
pass
# ---------------------------------ulits for downstream tasks--------------------------------
# XXX: read and save using panda or numpy
def read_edge_label_downstream(filename):
fin = open(filename, 'r')
X = []
Y = []
while 1:
line = fin.readline()
if line == '':
break
vec = line.strip().split(' ')
X.append(vec[:2])
Y.append(vec[2])
fin.close()
return X, Y
def read_node_label_downstream(filename):
""" may be used in node classification task;
part of labels for training clf and
the result served as ground truth;
note: similar method can be found in graph.py -> read_node_label
"""
fin = open(filename, 'r')
X = []
Y = []
while 1:
line = fin.readline()
if line == '':
break
vec = line.strip().split(' ')
X.append(vec[0])
Y.append(vec[1:])
fin.close()
return X, Y
def store_embedddings(vectors, filename, dim):
""" store embeddings to file
"""
fout = open(filename, 'w')
num_nodes = len(vectors.keys())
fout.write("{} {}\n".format(num_nodes, dim))
for node, vec in vectors.items():
fout.write("{} {}\n".format(node, ' '.join([str(x) for x in vec])))
fout.close()
print('store the resulting embeddings in file: ', filename)
def load_embeddings(filename):
""" load embeddings from file
"""
fin = open(filename, 'r')
num_nodes, size = [int(x) for x in fin.readline().strip().split()]
vectors = {}
while 1:
line = fin.readline()
if line == '':
break
vec = line.strip().split(' ')
assert len(vec) == size + 1
vectors[vec[0]] = [float(x) for x in vec[1:]]
fin.close()
assert len(vectors) == num_nodes
return vectors
#----------------- 以下你整理到utils有问题的我都用中文写出来了没有中文的暂时没啥问题可以先不用管-----------------------
def generate_edges_for_linkpred(graph, edges_removed, balance_ratio=1.0):
''' given a graph and edges_removed;
generate non_edges not in [both graph and edges_removed];
return all_test_samples including [edges_removed (pos samples), non_edges (neg samples)];
return format X=[[1,2],[2,4],...] Y=[1,0,...] where Y tells where corresponding element has a edge
'''
g = graph
num_edges_removed = len(edges_removed)
num_non_edges = int(balance_ratio * num_edges_removed)
num = 0
#np.random.seed(2018)
non_edges = []
exist_edges = list(g.G.edges())+list(edges_removed)
while num < num_non_edges:
non_edge = list(np.random.choice(g.look_back_list, size=2, replace=False))
if non_edge not in exist_edges:
num += 1
non_edges.append(non_edge)
test_node_pairs = edges_removed + non_edges
test_edge_labels = list(np.ones(num_edges_removed)) + list(np.zeros(num_non_edges))
return test_node_pairs, test_edge_labels
def dim_reduction(mat, dim=128, method='pca'):
''' dimensionality reduction: PCA, SVD, etc...
dim = # of columns
'''
print('START dimensionality reduction using ' + method + ' ......')
t1 = time.time()
if method == 'pca':
from sklearn.decomposition import PCA
pca = PCA(n_components=dim, svd_solver='auto', random_state=None)
mat_reduced = pca.fit_transform(mat) #sklearn pca auto remove mean, no need to preprocess
elif method == 'svd':
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=dim, n_iter=5, random_state=None)
mat_reduced = svd.fit_transform(mat)
else: #to do... more methods... e.g. random projection, ica, t-sne...
print('dimensionality reduction method not found......')
t2 = time.time()
print('END dimensionality reduction: {:.2f}s'.format(t2-t1))
return mat_reduced
def row_normalized(mat, is_transition_matrix=False):
''' to do...
两个问题1sparse矩阵在该场景下比dense慢,(至少我自己写的这块代码是)
2dense矩阵测试后发现所有元素加起来不是整数似乎还是要用我以前笨方法来弥补
3)在is_transition_matrix时候需要给全零行赋值sparse时候会有点小问题不能直接mat[i, :] = p赋值
'''
p = 1.0/mat.shape[0] #probability = 1/num of rows
norms = np.asarray(mat.sum(axis=1)).ravel()
for i, norm in enumerate(norms):
if norm != 0:
mat[i, :] /= norm
else:
if is_transition_matrix:
mat[i, :] = p #every row of transition matrix should sum up to 1
else:
pass #do nothing; keep all-zero row
return mat
''' 笨方法如下'''
def rowAsPDF(mat): #make each row sum up to 1 i.e. a probabolity density distribution
mat = np.array(mat)
for i in range(mat.shape[0]):
sum_row = mat[i,:].sum()
if sum_row !=0:
mat[i,:] = mat[i,:]/sum_row #if a row [0, 1, 1, 1] -> [0, 1/3, 1/3, 1/3] -> may have some small issue...
else:
# to do...
# for node without any link... remain row as [0, 0, 0, 0] OR set to [1/n, 1/n, 1/n...]??
pass
if mat[i,:].sum() != 1.00: #small trick to make sure each row is a pdf 笨犯法。。。
error = 1.00 - mat[i,:].sum()
mat[i,-1] += error
return mat
def sparse_to_dense():
''' to dense np.matrix format 你补充下记得dtype用float64'''
import scipy.sparse as sp
pass
def dense_to_sparse():
''' to sparse crs format 你补充下记得dtype用float64'''
import scipy.sparse as sp
pass

327
src/libnrl/walker.py Normal file
View File

@ -0,0 +1,327 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import multiprocessing
import random
import time
from itertools import chain
import numpy as np
from networkx import nx
'''
#-----------------------------------------------------------------------------
# part of code was originally forked from https://github.com/thunlp/OpenNE
# modified by Chengbin Hou @ SUSTech 2018
# Email: Chengbin.Hou10@foxmail.com
# ***class BiasedWalker was created by Chengbin Hou
# ***we realize two ways to do ABRW
# 1) naive sampling (also multi-processor version)
# 2) alias sampling (similar to node2vec)
#-----------------------------------------------------------------------------
'''
def deepwalk_walk_wrapper(class_instance, walk_length, start_node):
class_instance.deepwalk_walk(walk_length, start_node)
# ===========================================ABRW-weighted-walker============================================
class BiasedWalker: # ------ our method
def __init__(self, g, P, workers):
self.G = g.G # nx data stcuture
self.P = P # biased transition probability; n*n; each row is a pdf for a node
self.workers = workers
self.node_size = g.node_size
self.look_back_list = g.look_back_list
self.look_up_dict = g.look_up_dict
# alias sampling for ABRW-------------------------------------------------------------------
def simulate_walks(self, num_walks, walk_length):
self.P_G = nx.to_networkx_graph(self.P, create_using=nx.DiGraph()) # create a new nx graph based on ABRW transition prob matrix
t1 = time.time()
self.preprocess_transition_probs() # note: we simply adapt node2vec
t2 = time.time()
print('Time for construct alias table: {:.2f}'.format(t2-t1))
walks = []
nodes = list(self.P_G.nodes())
print('Walk iteration:')
for walk_iter in range(num_walks):
print(str(walk_iter+1), '/', str(num_walks))
random.shuffle(nodes)
for node in nodes:
walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))
for i in range(len(walks)): # use ind to retrive orignal node ID
for j in range(len(walks[0])):
walks[i][j] = self.look_back_list[int(walks[i][j])]
return walks
def node2vec_walk(self, walk_length, start_node): # to do...
G = self.P_G # more efficient way instead of copy from node2vec
alias_nodes = self.alias_nodes
walk = [start_node]
while len(walk) < walk_length:
cur = walk[-1]
cur_nbrs = list(G.neighbors(cur))
if len(cur_nbrs) > 0:
walk.append(cur_nbrs[alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])
else:
break
return walk
def preprocess_transition_probs(self):
G = self.P_G
alias_nodes = {}
for node in G.nodes():
unnormalized_probs = [G[node][nbr]['weight'] for nbr in G.neighbors(node)]
norm_const = sum(unnormalized_probs)
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]
alias_nodes[node] = alias_setup(normalized_probs)
self.alias_nodes = alias_nodes
'''
#naive sampling for ABRW-------------------------------------------------------------------
def weighted_walk(self, start_node):
#
#Simulate a weighted walk starting from start node.
#
G = self.G
look_up_dict = self.look_up_dict
look_back_list = self.look_back_list
node_size = self.node_size
walk = [start_node]
while len(walk) < self.walk_length:
cur_node = walk[-1] #the last one entry/node
cur_ind = look_up_dict[cur_node] #key -> index
pdf = self.P[cur_ind,:] #the pdf of node with ind
#pdf = np.random.randn(18163)+10 #......test multiprocessor
#pdf = pdf / pdf.sum() #......test multiprocessor
#next_ind = int( np.array( nx.utils.random_sequence.discrete_sequence(n=1,distribution=pdf) ) )
next_ind = np.random.choice(len(pdf), 1, p=pdf)[0] #faster than nx
#next_ind = 0 #......test multiprocessor
next_node = look_back_list[next_ind] #index -> key
walk.append(next_node)
return walk
def simulate_walks(self, num_walks, walk_length):
#
#Repeatedly simulate weighted walks from each node.
#
G = self.G
self.num_walks = num_walks
self.walk_length = walk_length
self.walks = [] #what we all need later as input to skip-gram
nodes = list(G.nodes())
print('Walk iteration:')
for walk_iter in range(num_walks):
t1 = time.time()
random.shuffle(nodes)
for node in nodes: #for single cpu, if # of nodes < 2000 (speed up) or nodes > 20000 (avoid memory error)
self.walks.append(self.weighted_walk(node)) #for single cpu, if # of nodes < 2000 (speed up) or nodes > 20000 (avoid memory error)
#pool = multiprocessing.Pool(processes=3) #use all cpu by defalut or specify processes = xx
#self.walks.append(pool.map(self.weighted_walk, nodes)) #ref: https://stackoverflow.com/questions/8533318/multiprocessing-pool-when-to-use-apply-apply-async-or-map
#pool.close()
#pool.join()
t2 = time.time()
print(str(walk_iter+1), '/', str(num_walks), ' each itr last for: {:.2f}s'.format(t2-t1))
#self.walks = list(chain.from_iterable(self.walks)) #unlist...[[[x,x],[x,x]]] -> [x,x], [x,x]
return self.walks
'''
# ===========================================deepWalk-walker============================================
class BasicWalker:
def __init__(self, G, workers):
self.G = G.G
self.node_size = G.get_num_nodes()
self.look_up_dict = G.look_up_dict
def deepwalk_walk(self, walk_length, start_node):
'''
Simulate a random walk starting from start node.
'''
G = self.G
look_up_dict = self.look_up_dict
node_size = self.node_size
walk = [start_node]
while len(walk) < walk_length:
cur = walk[-1]
cur_nbrs = list(G.neighbors(cur))
if len(cur_nbrs) > 0:
walk.append(random.choice(cur_nbrs))
else:
break
return walk
def simulate_walks(self, num_walks, walk_length):
'''
Repeatedly simulate random walks from each node.
'''
G = self.G
walks = []
nodes = list(G.nodes())
print('Walk iteration:')
for walk_iter in range(num_walks):
# pool = multiprocessing.Pool(processes = 4)
print(str(walk_iter+1), '/', str(num_walks))
random.shuffle(nodes)
for node in nodes:
# walks.append(pool.apply_async(deepwalk_walk_wrapper, (self, walk_length, node, )))
walks.append(self.deepwalk_walk(walk_length=walk_length, start_node=node))
# pool.close()
# pool.join()
# print(len(walks))
return walks
# ===========================================node2vec-walker============================================
class Walker:
def __init__(self, G, p, q, workers):
self.G = G.G
self.p = p
self.q = q
self.node_size = G.node_size
self.look_up_dict = G.look_up_dict
def node2vec_walk(self, walk_length, start_node):
'''
Simulate a random walk starting from start node.
'''
G = self.G
alias_nodes = self.alias_nodes
alias_edges = self.alias_edges
look_up_dict = self.look_up_dict
node_size = self.node_size
walk = [start_node]
while len(walk) < walk_length:
cur = walk[-1]
cur_nbrs = list(G.neighbors(cur))
if len(cur_nbrs) > 0:
if len(walk) == 1:
walk.append(cur_nbrs[alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])
else:
prev = walk[-2]
pos = (prev, cur)
next = cur_nbrs[alias_draw(alias_edges[pos][0],
alias_edges[pos][1])]
walk.append(next)
else:
break
return walk
def simulate_walks(self, num_walks, walk_length):
'''
Repeatedly simulate random walks from each node.
'''
G = self.G
walks = []
nodes = list(G.nodes())
print('Walk iteration:')
for walk_iter in range(num_walks):
print(str(walk_iter+1), '/', str(num_walks))
random.shuffle(nodes)
for node in nodes:
walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))
return walks
def get_alias_edge(self, src, dst):
'''
Get the alias edge setup lists for a given edge.
'''
G = self.G
p = self.p
q = self.q
unnormalized_probs = []
for dst_nbr in G.neighbors(dst):
if dst_nbr == src:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)
elif G.has_edge(dst_nbr, src):
unnormalized_probs.append(G[dst][dst_nbr]['weight'])
else:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)
norm_const = sum(unnormalized_probs)
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]
return alias_setup(normalized_probs)
def preprocess_transition_probs(self):
'''
Preprocessing of transition probabilities for guiding the random walks.
'''
G = self.G
alias_nodes = {}
for node in G.nodes():
unnormalized_probs = [G[node][nbr]['weight'] for nbr in G.neighbors(node)]
norm_const = sum(unnormalized_probs)
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]
alias_nodes[node] = alias_setup(normalized_probs)
alias_edges = {}
triads = {}
look_up_dict = self.look_up_dict
node_size = self.node_size
for edge in G.edges():
alias_edges[edge] = self.get_alias_edge(edge[0], edge[1])
self.alias_nodes = alias_nodes
self.alias_edges = alias_edges
return
def alias_setup(probs):
'''
Compute utility lists for non-uniform sampling from discrete distributions.
Refer to https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/
for details
'''
K = len(probs)
q = np.zeros(K, dtype=np.float32)
J = np.zeros(K, dtype=np.int32)
smaller = []
larger = []
for kk, prob in enumerate(probs):
q[kk] = K*prob
if q[kk] < 1.0:
smaller.append(kk)
else:
larger.append(kk)
while len(smaller) > 0 and len(larger) > 0:
small = smaller.pop()
large = larger.pop()
J[small] = large
q[large] = q[large] + q[small] - 1.0
if q[large] < 1.0:
smaller.append(large)
else:
larger.append(large)
return J, q
def alias_draw(J, q):
'''
Draw sample from a non-uniform discrete distribution using alias sampling.
'''
K = len(J)
kk = int(np.floor(np.random.rand()*K))
if np.random.rand() < q[kk]:
return kk
else:
return J[kk]

258
src/main.py Normal file
View File

@ -0,0 +1,258 @@
'''
demo of using (attributed) Network Embedding methods;
STEP1: load data -->
STEP2: prepare data -->
STEP3: learn node embeddings -->
STEP4: downstream evaluations
python src/main.py --method abrw --save-emb True
by Chengbin Hou 2018 <chengbin.hou10@foxmail.com>
'''
import time
import random
import numpy as np
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
from sklearn.linear_model import LogisticRegression #to do... 1) put it in downstream.py; and 2) try SVM...
from libnrl.classify import ncClassifier, lpClassifier, read_node_label
from libnrl.graph import *
from libnrl.utils import *
from libnrl import abrw #ANE method; Attributed Biased Random Walk
from libnrl import tadw #ANE method
from libnrl import aane #ANE method
from libnrl import asne #ANE method
from libnrl.gcn import gcnAPI #ANE method
from libnrl.graphsage import graphsageAPI #ANE method
from libnrl import attrcomb #ANE method
from libnrl import attrpure #NE method simply use svd or pca for dim reduction
from libnrl import node2vec #PNE method; including deepwalk and node2vec
from libnrl import line #PNE method
from libnrl.grarep import GraRep #PNE method
#from libnrl import TriDNR #to do... ANE method
#https://github.com/dfdazac/dgi #to do... ANE method
def parse_args():
parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter, conflict_handler='resolve')
#-----------------------------------------------general settings--------------------------------------------------
parser.add_argument('--graph-format', default='adjlist', choices=['adjlist', 'edgelist'],
help='graph/network format')
parser.add_argument('--graph-file', default='data/cora/cora_adjlist.txt',
help='graph/network file')
parser.add_argument('--attribute-file', default='data/cora/cora_attr.txt',
help='node attribute/feature file')
parser.add_argument('--label-file', default='data/cora/cora_label.txt',
help='node label file')
parser.add_argument('--emb-file', default='emb/unnamed_node_embs.txt',
help='node embeddings file; suggest: data_method_dim_embs.txt')
parser.add_argument('--save-emb', default=False, type=bool,
help='save emb to disk if True')
parser.add_argument('--dim', default=128, type=int,
help='node embeddings dimensions')
parser.add_argument('--task', default='lp_and_nc', choices=['none', 'lp', 'nc', 'lp_and_nc'],
help='choices of downstream tasks: none, lp, nc, lp_and_nc')
parser.add_argument('--link-remove', default=0.1, type=float,
help='simulate randomly missing links if necessary; a ratio ranging [0.0, 1.0]')
#parser.add_argument('--attr-remove', default=0.0, type=float,
# help='simulate randomly missing attributes if necessary; a ratio ranging [0.0, 1.0]')
#parser.add_argument('--link-reserved', default=0.7, type=float,
# help='for lp task, train/test split, a ratio ranging [0.0, 1.0]')
parser.add_argument('--label-reserved', default=0.7, type=float,
help='for nc task, train/test split, a ratio ranging [0.0, 1.0]')
parser.add_argument('--directed', default=False, type=bool,
help='directed or undirected graph')
parser.add_argument('--weighted', default=False, type=bool,
help='weighted or unweighted graph')
#-------------------------------------------------method settings-----------------------------------------------------------
parser.add_argument('--method', default='abrw', choices=['node2vec', 'deepwalk', 'line', 'gcn', 'grarep', 'tadw',
'abrw', 'asne', 'aane', 'attrpure', 'attrcomb', 'graphsage'],
help='choices of Network Embedding methods')
parser.add_argument('--ABRW-topk', default=30, type=int,
help='select the most attr similar top k nodes of a node; ranging [0, # of nodes]')
parser.add_argument('--ABRW-alpha', default=0.8, type=float,
help='balance struc and attr info; ranging [0, 1]')
parser.add_argument('--TADW-lamb', default=0.2, type=float,
help='balance struc and attr info; ranging [0, inf]')
parser.add_argument('--AANE-lamb', default=0.05, type=float,
help='balance struc and attr info; ranging [0, inf]')
parser.add_argument('--AANE-rho', default=5, type=float,
help='penalty parameter; ranging [0, inf]')
parser.add_argument('--AANE-mode', default='comb', type=str,
help='choices of mode: comb, pure')
parser.add_argument('--ASNE-lamb', default=1.0, type=float,
help='balance struc and attr info; ranging [0, inf]')
parser.add_argument('--AttrComb-mode', default='concat', type=str,
help='choices of mode: concat, elementwise-mean, elementwise-max')
parser.add_argument('--Node2Vec-p', default=0.5, type=float,
help='trade-off BFS and DFS; rid search [0.25; 0.50; 1; 2; 4]')
parser.add_argument('--Node2Vec-q', default=0.5, type=float,
help='trade-off BFS and DFS; rid search [0.25; 0.50; 1; 2; 4]')
parser.add_argument('--GraRep-kstep', default=4, type=int,
help='use k-step transition probability matrix')
parser.add_argument('--LINE-order', default=3, type=int,
help='choices of the order(s), 1st order, 2nd order, 1st+2nd order')
parser.add_argument('--LINE-no-auto-save', action='store_true',
help='no save the best embeddings when training LINE')
parser.add_argument('--LINE-negative-ratio', default=5, type=int,
help='the negative ratio')
#for walk based methods; some Word2Vec SkipGram parameters are not specified here
parser.add_argument('--number-walks', default=10, type=int,
help='# of random walks of each node')
parser.add_argument('--walk-length', default=80, type=int,
help='length of each random walk')
parser.add_argument('--window-size', default=10, type=int,
help='window size of skipgram model')
parser.add_argument('--workers', default=24, type=int,
help='# of parallel processes.')
#for deep learning based methods; parameters about layers and neurons used are not specified here
parser.add_argument('--learning-rate', default=0.001, type=float,
help='learning rate')
parser.add_argument('--batch-size', default=128, type=int,
help='batch size')
parser.add_argument('--epochs', default=100, type=int,
help='epochs')
parser.add_argument('--dropout', default=0.5, type=float,
help='dropout rate (1 - keep probability)')
parser.add_argument('--weight-decay', type=float, default=0.0001,
help='weight for L2 loss on embedding matrix')
args = parser.parse_args()
return args
def main(args):
g = Graph() #see graph.py for commonly-used APIs and use g.G to access NetworkX APIs
print('\nSummary of all settings: ', args)
#---------------------------------------STEP1: load data-----------------------------------------------------
print('\nSTEP1: start loading data......')
t1 = time.time()
#load graph structure info------
if args.graph_format == 'adjlist':
g.read_adjlist(path=args.graph_file, directed=args.directed)
elif args.graph_format == 'edgelist':
g.read_edgelist(path=args.graph_file, weighted=args.weighted, directed=args.directed)
#load node attribute info------
is_ane = (args.method == 'abrw' or args.method == 'tadw' or args.method == 'gcn' or args.method == 'graphsage' or
args.method == 'attrpure' or args.method == 'attrcomb' or args.method == 'asne' or args.method == 'aane')
if is_ane:
assert args.attribute_file != ''
g.read_node_attr(args.attribute_file)
#load node label info------
#to do... similar to attribute {'key_attribute': value}, label also loaded as {'key_label': value}
t2 = time.time()
print('STEP1: end loading data; time cost: {:.2f}s'.format(t2-t1))
#---------------------------------------STEP2: prepare data----------------------------------------------------
print('\nSTEP2: start preparing data for link pred task......')
t1 = time.time()
test_node_pairs=[]
test_edge_labels=[]
if args.task == 'lp' or args.task == 'lp_and_nc':
edges_removed = g.remove_edge(ratio=args.link_remove)
test_node_pairs, test_edge_labels = generate_edges_for_linkpred(graph=g, edges_removed=edges_removed, balance_ratio=1.0)
t2 = time.time()
print('STEP2: end preparing data; time cost: {:.2f}s'.format(t2-t1))
#-----------------------------------STEP3: upstream embedding task-------------------------------------------------
print('\nSTEP3: start learning embeddings......')
print('the graph: ', args.graph_file, '\nthe # of nodes: ', g.get_num_nodes(), '\nthe # of edges used during embedding (edges maybe removed if lp task): ', g.get_num_edges(),
'\nthe # of isolated nodes: ', g.get_num_isolates(), '\nis directed graph: ', g.get_isdirected(), '\nthe model used: ', args.method)
t1 = time.time()
model = None
if args.method == 'abrw':
model = abrw.ABRW(graph=g, dim=args.dim, alpha=args.ABRW_alpha, topk=args.ABRW_topk, num_paths=args.number_walks,
path_length=args.walk_length, workers=args.workers, window=args.window_size)
elif args.method == 'attrpure':
model = attrpure.ATTRPURE(graph=g, dim=args.dim)
elif args.method == 'attrcomb':
model = attrcomb.ATTRCOMB(graph=g, dim=args.dim, comb_with='deepwalk',
num_paths=args.number_walks, comb_method=args.AttrComb_mode) #concat, elementwise-mean, elementwise-max
elif args.method == 'asne':
if args.task == 'nc':
model = asne.ASNE(graph=g, dim=args.dim, alpha=args.ASNE_lamb, epoch=args.epochs, learning_rate=args.learning_rate, batch_size=args.batch_size,
X_test=None, Y_test=None, task=args.task, nc_ratio=args.label_reserved, lp_ratio=args.link_reserved, label_file=args.label_file)
else:
model = asne.ASNE(graph=g, dim=args.dim, alpha=args.ASNE_lamb, epoch=args.epochs, learning_rate=args.learning_rate, batch_size=args.batch_size,
X_test=X_test_lp, Y_test=Y_test_lp, task=args.task, nc_ratio=args.label_reserved, lp_ratio=args.link_reserved, label_file=args.label_file)
elif args.method == 'aane':
model = aane.AANE(graph=g, dim=args.dim, lambd=args.AANE_lamb, mode=args.AANE_mode)
elif args.method == 'tadw':
model = tadw.TADW(graph=g, dim=args.dim, lamb=args.TADW_lamb)
elif args.method == 'deepwalk':
model = node2vec.Node2vec(graph=g, path_length=args.walk_length,
num_paths=args.number_walks, dim=args.dim,
workers=args.workers, window=args.window_size, dw=True)
elif args.method == 'node2vec':
model = node2vec.Node2vec(graph=g, path_length=args.walk_length, num_paths=args.number_walks, dim=args.dim,
workers=args.workers, p=args.Node2Vec_p, q=args.Node2Vec_q, window=args.window_size)
elif args.method == 'grarep':
model = GraRep(graph=g, Kstep=args.GraRep_kstep, dim=args.dim)
elif args.method == 'line':
if args.label_file and not args.LINE_no_auto_save:
model = line.LINE(g, epoch = args.epochs, rep_size=args.dim, order=args.LINE_order,
label_file=args.label_file, clf_ratio=args.label_reserved)
else:
model = line.LINE(g, epoch = args.epochs, rep_size=args.dim, order=args.LINE_order)
elif args.method == 'graphsage':
model = graphsageAPI.graphsage_unsupervised_train(graph=g, graphsage_model = 'graphsage_mean')
#we follow the default parameters, see __inti__.py in graphsage file
#choices: graphsage_mean, gcn ......
#model.save_embeddings(args.emb_file) #to do...
elif args.method == 'gcn':
model = graphsageAPI.graphsage_unsupervised_train(graph=g, graphsage_model = 'gcn') #graphsage-gcn
else:
print('no method was found...')
exit(0)
'''
elif args.method == 'gcn': #OR use graphsage-gcn as in graphsage method...
assert args.label_file != '' #must have node label
assert args.feature_file != '' #different from previous ANE methods
g.read_node_label(args.label_file) #gcn is an end-to-end supervised ANE methoed
model = gcnAPI.GCN(graph=g, dropout=args.dropout,
weight_decay=args.weight_decay, hidden1=args.hidden,
epochs=args.epochs, clf_ratio=args.label_reserved)
#gcn does not have model.save_embeddings() func
'''
if args.save_emb:
model.save_embeddings(args.emb_file + time.strftime(' %Y%m%d-%H%M%S', time.localtime()))
print('Save node embeddings in file: ', args.emb_file)
t2 = time.time()
print('STEP3: end learning embeddings; time cost: {:.2f}s'.format(t2-t1))
#---------------------------------------STEP4: downstream task-----------------------------------------------
print('\nSTEP4: start evaluating ......: ')
print('nc for node classification tasks; lp for link prediction task', args.task)
t1 = time.time()
if args.method != 'semi_supervised_gcn': #except semi-supervised methods, we will get emb first, and then eval emb
vectors = 0
if args.method == 'graphsage' or args.method == 'gcn': #to do... run without this 'if'
vectors = model
else:
vectors = model.vectors #for other methods....
del model, g
#------lp task
if args.task == 'lp' or args.task == 'lp_and_nc':
#X_test_lp, Y_test_lp = read_edge_label(args.label_file) #enable this if you want to load your own lp testing data, see classfiy.py
print('During embedding we have used {:.2f}% links and the remaining will be left for lp evaluation...'.format(args.link_remove*100))
clf = lpClassifier(vectors=vectors) #similarity/distance metric as clf; basically, lp is a binary clf probelm
clf.evaluate(test_node_pairs, test_edge_labels)
#------nc task
if args.task == 'nc' or args.task == 'lp_and_nc':
X, Y = read_node_label(args.label_file)
print('Training nc classifier using {:.2f}% node labels...'.format(args.label_reserved*100))
clf = ncClassifier(vectors=vectors, clf=LogisticRegression()) #use Logistic Regression as clf; we may choose SVM or more advanced ones
clf.split_train_evaluate(X, Y, args.label_reserved)
t2 = time.time()
print('STEP4: end evaluating; time cost: {:.2f}s'.format(t2-t1))
if __name__ == '__main__':
#random.seed(2018)
#np.random.seed(2018)
main(parse_args())

63
src/vis.py Normal file
View File

@ -0,0 +1,63 @@
import pandas as pd
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import os
def read_node_label(filename):
with open(filename, 'r') as f:
node_label = {} #dict
for l in f.readlines():
vec = l.split()
node_label[int(vec[0])] = str(vec[1:])
return node_label
def read_node_emb(filename):
with open(filename, 'r') as f:
node_emb = {} #dict
next(f) #except the head line: num_of_nodes, dim
for l in f.readlines():
vec = l.split()
node_emb[int(vec[0])] = [float(i) for i in vec[1:]]
return node_emb
# load the node label and saved embeddings
label_file = './data/cora/cora_label.txt'
emb_file = './emb/abrw.txt'
label_dict = read_node_label(label_file)
emb_dict = read_node_emb(emb_file)
if label_dict.keys() != emb_dict.keys():
print('ERROR, node ids are not matched! Plz check again')
exit(0)
#embeddings = np.array([i for i in emb_dict.values()], dtype=np.float32)
embeddings = np.array([emb_dict[i] for i in sorted(emb_dict.keys(), reverse=False)], dtype=np.float32)
labels = [label_dict[i] for i in sorted(label_dict.keys(), reverse=False)]
# save embeddings and labels
emb_df = pd.DataFrame(embeddings)
emb_df.to_csv('emb/log/embeddings.tsv', sep='\t', header=False, index=False)
lab_df = pd.Series(labels, name='label')
lab_df.to_frame().to_csv('emb/log/node_labels.tsv', header=False, index=False)
# save tf variable
embeddings_var = tf.Variable(embeddings, name='embeddings')
sess = tf.Session()
saver = tf.train.Saver([embeddings_var])
sess.run(embeddings_var.initializer)
saver.save(sess, os.path.join('emb/log', "model.ckpt"), 1)
# configure tf projector
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = 'embeddings'
embedding.metadata_path = 'node_labels.tsv'
projector.visualize_embeddings(tf.summary.FileWriter('emb/log'), config)
# type "tensorboard --logdir=emb/log" in CMD and have fun :)