## Twanvl.nl

**Machine learning for network data**

ter verkrijging van de graad van doctor aan de Radboud Universiteit Nijmegen, op gezag van de rector magnificus prof. dr. Th.L.M. Engelen, volgens besluit van het college van decanen in het openbaar te verdedigen op maandag 9 februari 2015 om 14:30 uur precies Twan van Laarhoven geboren op 1 februari 1985 Prof. dr. Tom Heskes Dr. Elena Marchiori Prof. dr. Lutgarde Buydens (voorzitter)Prof. dr. Marcel ReindersDr. Jean-Philippe Vert SIKS Dissertation Series No. TODO The research reported in this thesis has been carried out under the auspicesof the Dutch Research School for Information and Knowledge Systems (SIKS),and the institute for Computing and Information Science (iCIS) of the RadboudUniversity Nijmegen.

This research was supported by NWO grant Graphs for Multi Task Learning(612.066.927).

2014 Twan van Laarhoven Networks are a convenient and flexible representation for many datasets. Net-work data often concerns interactions or relations. These can be relations be-tween people, such as being friend; connections between neurons in the brain;or simply the notion of two things being similar to each other. But also the inter-actions between proteins, drugs, etc. can be considered as a network. Figure gives an illustration of the different kinds of network datasets that are used inthis thesis. Many more networks exist.

(a) Zachary's karate club. The nodes are people, (b) A subset of a protein–protein interaction net-and edges are associations.

work. Nodes are proteins, edges indicate physi-cal interactions.

(c) A drug–target interaction network. Square (d) A nearest neighbor network for a subset ofnodes are drugs, circles are target proteins. the MNIST digit classification dataset. Nodes areEdges indicate interaction.

digit images, and there is an edge between eachimage and its two nearest neighbors.

Figure 1.1: An illustration of different network datasets.

Chapter 1. Introduction In general, a network (also called a graph), consists of a set of nodes and a set of connections between pairs of nodes. These connections, called edges, canbe directed or undirected, and they may be associated with a weight. Sometimesthere is additional structure, for example extra information about the individualnodes.

Given a network, there are several machine learning questions that can be asked. What sets these machine learning problems apart from traditional graphtheory is that they involve some kind of inference. The goal is to look beyondthe network at hand, and discover something about the underlying structure. Ofcourse, this relies on the assumption that there is a certain underlying structure.

If this assumption is wrong, then it may be impossible to say anything aboutthe network. Additionally, we can never get certainty about the results, and wehave to settle for very likely ones.

In this thesis we look at two different such machine learning problems on In we look at finding clusters in networks. Clustering is a prototypical unsupervised machine learning problem. The goal is to split a dataset intoclusters, where the elements of the clusters are similar in some sense. In oursetting, the elements of the dataset are the nodes in a network, and similarity isdefined by the edges of that network.

When performing clustering, the underlying assumption is that there actu- ally is a cluster structure in the network. Many networks have a cluster structureat certain scales, but there are also networks without well-defined clusters.

We will use an approach based on the optimization of a quality function. We will not consider how to perform this optimization problem. Instead, this partof the thesis focuses on the more fundamental issue of choosing the right qualityfunction to optimize in the first place.

In we will look at predicting links in bipartite networks. A bipartite network is one where the nodes can be split into two sets, and the only edgesare between nodes in different sets. Many biological networks take this form,in particular the interaction network of drugs and target proteins which weconsider in this part of the thesis.

A peculiar aspect of this interaction network is that only positive interactions is known. That is, if there is an edge between two nodes, this means that theyare know that to interact. But if there is no edge, then it does not mean thatthere is no interaction, instead, it can mean that this interaction has not beenexperimentally tested.

So, there might be many missing links in the network. The goal of link prediction is to infer the likely missing links, so that experimentalists know which interactions they should test. This can be seen as a more traditionalsupervised machine learning problem, where task is to predict whether thereis an edge between any two nodes, based on features of the nodes and of thepotential edge. The training data are then the existing edges in the network inaddition to side information about the nodes.

To be able to perform this link prediction we rely on the assumption that ‘similar' nodes interact with the same or similar sets other nodes. In the case ofdrug–target interaction this is a sensible assumption. These interactions happenbecause a drug compound can attach to a protein, which depends on the three-dimensional shape of (part of) the drug molecule. And drugs with a similarchemical structure will likely have a similar shape.

**Network clustering with quality**

**What is network clustering?**

In this first part of the thesis we look at finding clusters in networks, and inparticular at an approach based on quality functions.

Clustering is the task of finding clusters in a dataset. Informally, a cluster is a set of objects that are close to each other. In the network setting, we caninterpret closeness by the connections: a cluster is a set of tightly connectednodes. Clustering is an unsupervised machine learning problem. That meansthat no labels or other annotations are used. The only input is the network itself.

This problem goes by various names: graph partitioning, graph clustering, network clustering and community detection. The former names come froma computer science field, and focus on the abstract properties of graphs. Onthe other hand, the term community detection comes from the social sciences.

In that field scientists have tried to find structure in the behavior of groups ofpeople and social animals. People form communities, groups of people that allknow each other, and chiefly interact among themselves.

Even when objects are described by a set of attributes, the clustering problem can be transformed into a graph clustering task. This is done by constructing asimilarity graph, for instance the k-nearest neighbor graph (see e.g. Brito et al.,Franti, Virmajoki, and Hautamaki, The observant reader may notice that we haven't yet defined what a ‘cluster' actually is, so far a ‘cluster' is an abstract thing. In general there are severaldifferent notions of cluster and therefore of clustering. The simplest is what wecall hard clustering. In hard clustering the goal is graph partitioning. That is, aclustering of a graph is a partition of its nodes. The clusters in such a clustering 1This is not the only possible way to cluster networks. An alternative is to consider two nodes to be close if they connect to similar sets of other nodes. But we will not treat that interpretation inthis work.

Chapter 2. What is network clustering? are non-overlapping sets nodes, and each node is in exactly one cluster.

The alternative is what we call soft clustering, where nodes can belong to more than one cluster, memberships can be fuzzy, and clusters can overlap. Wewill save the treatment of soft clustering until Chapter and work in the hardclustering setting until then.

Even if we know what a clustering is, it doesn't yet tell us what a good clus- tering is. In contrast to classification, there is no clear definition of the qualityof a clustering. One might define a quality function, which gives a score to anyclustering of a graph. But this just shifts the problem to deciding what a goodquality function is.

Several different such quality functions have been defined in the literature, and graph clustering is often performed by directly optimizing one of them. The‘optimal' clustering is then one that maximizes the qualityFinding this opti-mal clustering is a computationally intractable discrete optimization problem(at least for the quality functions for which hardness is known, see for instance(Brandes et al., Fortunato, Schaeffer, Therefore all effectiveand scalable methods are based on heuristic and approximate techniques. Werefer the reader to the surveys on this topic, e.g., Fortunato von Luxburgand Schaeffer

**Applications of network clustering**

As mentioned in the introduction, one natural application of network clusteringis in finding communities in social networks. A classic example of such a clus-tering is Zachary's karate club (Zachary, In this network the nodes are themembers of a karate club, and edges indicate their friendships. A conflict led tothis club splitting in two, which we can take to be a clustering of the network.

Nowadays, online social networks exist with millions of nodes and orders of magnitude more edges. There has been a lot of interest from social scientists inanalyzing these networks.

Similar to these social networks are the network of links on the world wide web, or the network of citations between scientific articles or authors. In thesenetworks it is possible to find communities based on areas of common interest.

Another application area is in brain connectivity. Using Magnetic Resonance Imaging (MRI), more and more detailed brain connectivity networks are beingdeveloped. In these networks the nodes are small areas of the brain, typically thevoxels observed by the MRI scanner. These voxels still contain many neurons, 2Throughout this thesis we will use maximization. Some objective functions are traditionally minimized, in those cases we negate them.

2.2. Issues with clustering and so they have no real physical meaning. But at a slightly larger scale, tightlyconnected brain regions are thought to perform a common function. Networkclustering can be used to find these regions.

At a smaller scale, there are biological interaction networks. In these net- works the nodes are things like proteins or drugs, and edges indicate knowninteractions. In human cells, proteins can form complexes, which are clusters ofdifferent proteins that physically interact to perform a common function. It ishard to observe these complexes directly, so network clustering methods can beused to find them.

Besides these kinds of data that are naturally described as networks, there are many datasets that consist of tables of objects with attributes. Even in this case,the clustering problem can be transformed into a network clustering task. Thisis done by constructing a similarity graph, for instance the k-nearest neighborgraph (see e.g. Brito et al., Franti, Virmajoki, and Hautamaki, As thename suggests, in this nearest neighbor network each node is connected to the kother nodes that are most similar to it, according to some similarity metric. Anadvantage of this network transformation compared to directly clustering thedata, is that the similar metric only has to be accurate locally. This allows oneto find clusters that have a very complicated shape in the original data space, aslong as they are sufficiently separated.

**Issues with clustering**

A recurring problem for any kind of clustering is determining the number ofclusters. For some clustering methods the number of clusters is an explicit pa-rameter, usually called When given a network, these methods will find kclusters, no more and no less.

Other methods can automatically determine the number of clusters, by se- lecting the clustering with the highest quality among all possible clusterings. Itmight seem like this neatly solves the problem of determining the number ofclusters. But this is not necessarily true.

For instance, the ‘modularity' quality function has a so called resolution limit (Fortunato and Barthélemy, Optimizing this quality function we willnever be able to observe small communities, and as the total network becomeslarger, so does the size of the smallest observable cluster. Other quality functionshave a similar resolution bias, either towards large clusters or towards smallones.

3This is a different parameter from the k in the k-nearest neighbor graph.

Chapter 2. What is network clustering? Real world networks do not describe some grand unchanging truth about the world. Rather, they contain a sample taken at a specific time, based on noisymeasurements. So there might be missing edges, as well as incorrect edges. Forexample, edges in a protein interactions are determined by a statistical signifi-cance threshold, interactions that fall just below this threshold are not included.

New experiments can change the network. Or think of a social network, wherewho is friends with who can change all the time.

Additionally, in many cases the set of nodes in the network is just a subset of all possible nodes. For example, only the proteins measured so far are includedin most interaction networks, and citation networks are often restricted to acertain research areas.

This uncertainty about networks has an important implication for network clustering. It means that methods should be robust, that is to say, not sensitiveto these errors. Ideally, a small change to the network should result in only asmall change to the clustering. Similarly, considering a larger or different subsetof nodes, should not

**Outline of this first part of the thesis**

In

**Chapter**we try to make it more precise what good graph clustering quality

functions are. We do this by investigating properties that intuitively ought to be

satisfied by any graph clustering quality function. Two axioms tailored for graph

clustering quality functions are introduced, and the four axioms introduced in

previous work on distance based clustering are reformulated and generalized

for the graph setting. We show that modularity, a standard quality function

for graph clustering, does not satisfy all of these six properties. This motivates

the derivation of a new family of quality functions, adaptive scale modularity,

which does satisfy the proposed axioms. Adaptive scale modularity has two

parameters, which give greater flexibility in the kinds of clusterings that can be

found. Standard graph clustering quality functions, such as normalized cut and

unnormalized cut, are obtained as special cases of adaptive scale modularity.

Next we take a more experimental angle. There are not many real world datasets for which a ground truth clustering is know. But we can use a randommodel to generate graphs with a known clustering structure. Results of a recentcomparative experimental assessment of methods applied to such benchmarkgraphs indicate that the two best methods use different quality functions, but asimilar local search-based optimization procedure, called the Louvain method.

This observation motivates the following research question: given the Louvainoptimization procedure, how much does the choice of the quality function in-fluence the results, and in what way? 2.3. Outline of this first part of the thesis In

**Chapter**we address this question empirically in a broad graph clustering

context, that is, when graphs are either given as such or are k-nearest neighborgraphs generated from a given (classification) dataset. We consider normalizedcut, modularity, and infomap; as well as two new objective functions. We showthat all these objective functions have a resolution bias, that is, they tend to prefereither small or large clusters. When removing this bias, by forcing the objectivefunction to generate a given number of clusters, The Louvain method achievessimilar performance across the considered objective functions on benchmarknetworks with built-in community structure. These results indicate that theresolution bias is the most important difference between objective functions ingraph clustering when using the Louvain method.

In

**Chapter**we apply these new methods to a concrete network clustering

problem. In particular, we experimentally analyze the performance of the clus-tering algorithms when applied to protein-protein interaction (PPI) networks.

We use publicly available yeast protein networks from past studies, as well asthe present BioGRID database. Furthermore, to test robustness of the methods,various types of randomly perturbed networks obtained from the BioGRID dataare also considered. Results of extensive experiments show improved or com-petitive performance over MCL, a state-of-the-art algorithm for complex detec-tion in PPI networks, in particular on BioGRID data, where the w-log-v methodintroduced in Chapter obtains excellent accuracy and robustness performance.

In

**Chapter**we gradually shift our focus from hard clustering over to soft

clustering. We will look at so called resolution-limit-free quality functions donot suffer from a resolution limit. This property was previously introduced forhard clustering, that is, graph partitioning. In this chapter we investigate theresolution-limit-free property in the context of Non-negative Matrix Factoriza-tion (NMF) for hard and soft graph clustering. To use NMF in the hard cluster-ing setting, a common approach is to assign each node to its highest membershipcluster. We show that in this case symmetric NMF is not resolution-limit-free,but that it becomes so when hardness constraints are used as part of the opti-mization. In soft clustering, nodes can belong to more than one cluster, withvarying degrees of membership. In this setting resolution-limit-free turns outto be too strong a property. Therefore we propose locality, which states thatchanging one part of the graph does not affect the clustering of other parts ofthe graph. We argue that this is a desirable property, provide conditions un-der which NMF quality functions are local, and propose a novel class of localprobabilistic NMF quality functions.

**Axioms for hard clustering**

We investigate properties that intuitively ought to be satisfied by graph clustering qual-ity functions, i.e. functions that assign a score to a clustering of a graph. Graph cluster-ing, also known as network community detection, is often performed by optimizing sucha function. Two axioms tailored for graph clustering quality functions are introduced,and the four axioms introduced in previous work on distance based clustering are re-formulated and generalized for the graph setting. We show that modularity, a standardquality function for graph clustering, does not satisfy all of these six properties. Thismotivates the derivation of a new family of quality functions, adaptive scale modularity,which does satisfy the proposed axioms. Adaptive scale modularity has two parameters,which give greater flexibility in the kinds of clusterings that can be found. Standardgraph clustering quality functions, such as normalized cut and unnormalized cut, areobtained as special cases of adaptive scale modularity.

In general, the results of our investigation indicate that the considered axiomatic framework covers existing ‘good' quality functions for graph clustering, and can be usedto derive an interesting new family of quality functions.

Following the work by Kleinberg there have been various contributions tothe theoretical foundation and analysis of clustering, such as axiomatic frame-works for quality functions (Ackerman and Ben-David, for criteria to com-pare clusterings (Meila, uniqueness theorems for specific types of clus-tering (Zadeh and Ben-David, Ackerman, Ben-David, and Loker, Carlsson et al., taxonomy of clustering paradigms (Ackerman, Ben-David, This chapter is based on van Laarhoven and Marchiori "Axioms for Graph Clustering Quality Functions", published in the Journal of Machine Learning Research.

Chapter 3. Axioms for hard clustering and Loker, and characterization of diversification systems (Gollapudiand Sharma, Kleinberg focused on clustering functions, which are functions from a dis- tance function to a clustering. He showed that there are no clustering functionsthat simultaneously satisfy three intuitive properties: scale invariance, consis-tency and richness. Ackerman and Ben-David continued on this work,and showed that the impossibility result does not apply when formulating theseproperties in terms of quality functions instead of clustering functions, whereconsistency is replaced with a weaker property called monotonicity.

Both of these previous works are formulated in terms of distance functions over a fixed domain. In this chapter we focus on weighted graphs, where theweight of an edge indicates the strength of a connection. The clustering problemon graphs is also known as network community detection.

Graphs provide additional freedoms over distance functions. In particular, it is possible for two points to be unrelated, indicated by a weight of 0. Thesezero-weight edges in turn make it natural to consider graphs over different setsof nodes as part of a larger graph. Secondly, we can allow for self loops. Selfloops can indicate internal edges in a node. This notation is used for instanceby Blondel et al. where a graph is contracted based on a fine-grainedclustering.

In this setting, where edges with weight 0 are possible, Kleinberg's impos- sibility result does not apply. This can be seen by considering the connectedcomponents of a graph. This is a graph clustering function that satisfies allthree of Kleinberg's axioms: scale invariance, consistency and richness (see Sec-tion Our focus is on the investigation of graph clustering quality functions, which are functions from a graph and a clustering to a real number ‘quality'. A no-table example is modularity (Newman and Girvan, In particular we askwhich properties of quality functions intuitively ought to hold, and which areoften assumed to hold when reasoning informally about graph clustering. Suchproperties might be called axioms for graph clustering.

The rest of this chapter is organized as follows: Section gives basic defi- nitions. Next, Section discusses different ways in which properties could beformulated.

In Section we propose an axiomatic framework that consists of six proper- ties of graph clustering quality functions: the (adaption of) the four axioms fromKleinberg and Ackerman and Ben-David (permutation invariance,scale invariance, richness and monotonicity); and two additional properties spe-cific for the graph setting (continuity and the weak locality).

3.1. Introduction Then, in Section we show that modularity does not satisfy the mono- tonicity and weak locality properties.

This result motivates the analysis of variants of modularity, leading to the derivation of a new parametric quality function in Section that satisfies allproperties. This quality function, which we call adaptive scale modularity, hastwo parameters, M and γ which can be tuned to control the resolution of theclustering. We show that quality functions similar to normalized cut and un-normalized cut are obtained in the limit when M goes to zero and to infinity,respectively. Furthermore, setting γ to 0 yields a parametric quality functionsimilar to that proposed by Reichardt and Bornholdt

**Related work**

Previous axiomatic studies of clustering quality functions have focused mainlyon hierarchical clustering and on weakest and strongest link style quality func-tions (Kleinberg, Ackerman and Ben-David, Zadeh and Ben-David,Carlsson et al., Papers in this line of work that focused also onthe partitional setting include Puzicha, Hofmann, and Buhmann, Acker-man, Ben-David, Brânzei, et al., Ackerman, Ben-David, Loker, and Sabato,Puzicha, Hofmann, and Buhmann investigated a particular classof clustering quality functions obtained by requiring the function to decomposeinto a certain additive form. Ackerman, Ben-David, Brânzei, et al. consid-ered clustering in the weighted setting, in which every data point is assigned areal valued weight. They performed a theoretical analysis on the influence ofweighted data on standard clustering algorithms. Ackerman, Ben-David, Loker,and Sabato analyzed robustness of clustering algorithms to the additionof a small set of points, and investigated the robustness of popular clusteringmethods.

All these studies are framed in terms of distance (or similarity and dissimi- larity) functions.

Bubeck and von Luxburg studied statistical consistency of clustering methods. They introduced the so-called nearest neighbor clustering and showedits consistency also for standard graph based quality functions, such as normal-ized cut, ratio cut, and modularity. Here we do not focus on properties ofmethods to optimize clustering quality, but on natural properties that qualityfunctions for graph clustering should satisfy.

Related works on graph clustering quality functions mainly focus on the so- called resolution limit, that is, the tendency of a quality function to prefer eithersmall or large clusters. In particular, Fortunato and Barthélemy provedthat modularity may not detect clusters smaller than a scale which depends on Chapter 3. Axioms for hard clustering the total size of the network and on the degree of interconnectedness of theclusters.

To mitigate the resolution limit phenomenon, the quality function may be extended with a so-called resolution parameter. For example, Reichardt andBornholdt proposed a formulation of graph clustering (therein called net-work community detection) based on principles from statistical mechanics. Thisinterpretation leads to the introduction of a family of quality functions with aparameter that allows to control the clustering resolution. In Section wewill show that this extension is a special case of adaptive scale modularity.

Traag, Van Dooren, and Nesterov formalized the notion of resolution- free quality functions, that is, not suffering from the resolution limit, and pro-vided a characterization of this class of quality functions. Their notion is essen-tially an axiom, and we will discuss the relation to our axioms in Section

**Definitions and notation**

Throughout this part of the thesis we will be talking about graphs. We willrestrict ourselves to undirected graph, but we do allow weighted edges. For-mally a (weighted) graph is a pair (V, A) of a finite set V of nodes and a functionA : V × V → R>0 of edge weights. For compactness we view A as an adjacencymatrix, and write aij = A(i, j). Edges with larger weights represent strongerconnections, so aij = 0 means that there is no edge between nodes i and j. Notethat this is the opposite of the convention used in distance based clustering. Weexplicitly allow for self loops, that is, nodes for which aii > 0.

all i, j ∈ V 0. Conversely, given a set V 0 ⊆ V, the subgraph of G = (V, A) inducedby V is the subgraph of G who's set of nodes is V 0. That is, (V 0, A0), where A0is the same as A restricted to the nodes in V 0.

If G0 = (V 0, A0) is a subgraph of G = (V, A), then the neighborhood of G0 is the set of nodes in V that have an edge to some node in V 0, together with thenodes in V. That is, neighborhood(G0, G) = V 0 ∪ {i ∈ V ∃j∈V 0(aij > 0)}. We callthe subgraph of G induced by the neighborhood of G0 the neighborhood graph ofG0 in G.

If G0 is a common subgraph of both G1 and G2, and the neighborhood graph of G0 in G1 is equal to the neighborhood graph of G0 in G2, then we say that G0is a common subgraph with consistent neighborhood.

A (hard) clustering C of a graph G = (V, A) is a partition of its nodes. That is, S C = V and for all c1, c2 ∈ C, c1 ∩ c2 6= ∅ if and only if c1 = c2. When twonodes i and j are in the same cluster in clustering C, i.e. when i, j ∈ c for somec ∈ C, then we write i ∼C j. Otherwise we write i 6∼C j.

3.3. On the form of axioms A clustering C is a refinement of a clustering D, written C v D, if for every cluster c ∈ C there is a cluster d ∈ D such that c ⊆ d.

A graph clustering quality function (or objective function) Q is a function from graphs G and clusterings of G to real numbers. We adopt the convention thata higher quality indicates a ‘better' clustering. As a generalization, we willsometimes work with parameterized families of quality functions. A single qualityfunction can be seen as a family with no parameters.

An edge between two nodes in the same cluster is called a within cluster edge, an edge between nodes in different clusters is a between cluster edge.

The strength of a node is the sum of weights of all edges incident to it, si(G) = ij. For unweighted graphs, the strength of a node is equal to its degree.

The volume of a cluster c is the sum of degrees of nodes in c, vc(G) = ij. The total volume of a graph is vV (G). For an unweighted undi- rected graph this is equal to twice the number of edges.

The within cluster weight is the total weight of within cluster edges, wc(G) = ij. For readability we leave the argument G in v and w implicit when it is clear from context.

**On the form of axioms**

There are three different ways to state potential axioms for clustering: 1. As a property of clustering functions, as in Kleinberg, For example, scale invariance of a clustering function ˆ C would be written as " ˆ C(αG), for all graphs G, α > 0". I.e. the optimal clustering is invariantunder scaling of edge weights.

2. As a property of the values of a quality function Q, as in Ackerman and Ben-David For example "Q(G, C) = Q(αG, C), for all graphs G, allclustering C of G, and α > 0". I.e. the quality is invariant under scaling ofedge weights.

3. As a property of the relation between qualities of different clustering, or equivalently, as a property of an ordering of clusterings for a particulargraph. For example "Q(G, C) > Q(G, D) ⇒ Q(αG, C) > Q(αG, D)".I.e.

the ‘better than' relation for clusterings is invariant under scaling of edgeweights.

The third form is slightly more flexible than the other two. Any quality function that satisfies a property in the second style will also satisfy the cor-responding property in the third style, but the converse is not true. Note also Chapter 3. Axioms for hard clustering that if D is not restricted in a property in the third style, then one can take C(G) = argmax Q(G, C) to obtain a clustering function and an axiom in the first style.

Most properties are more easily stated and proved in the second, absolute, style. Therefore, we adopt the second style unless doing so requires us to makespecific choices.

**Axioms for graph clustering quality functions**

Kleinberg defined three axioms for distance based clustering functions. In Ack-erman and Ben-David the authors reformulated these into four axioms forclustering quality functions. These axioms can easily be adapted to the graphsetting.

The first property that one expects for graph clustering is that the quality of a clustering depends only on the graph, that is, only on the weight of edges be-tween nodes, not on the identity of nodes. We formalize this in the permutationinvariance axiom,

**Definition 3.1**(Permutation invariance)

**.**A graph clustering quality function Q is

permutation invariant if for all graphs G = (V, A) and all isomorphisms f : V → V 0,

it is the case that Q(G, C) = Q(f(G), f(C)); where f is extended to graphs and cluster-

ings by f(C) = {{f(i) i ∈ c} c ∈ C} and f((V, A)) = (V 0, (i, j) 7→ A(f−1(i), f−1(j))).

The second property, scale invariance, requires that the quality doesn't change when edge weights are scaled uniformly. This is an intuitive axiom when onethinks in terms of units: a graph with edges in "m/s" can be scaled to a graphwith edges in "km/h". The quality should not be affected by such a transfor-mation, perhaps up to a change in units.

Ackerman and Ben-David defined scale invariance by insisting that the quality stays equal when distances are scaled. In contrast, in Puzicha, Hofmann,and Buhmann the quality should scale proportional with the scaling ofdistances. We generalize both of these previous definitions by only consideringthe relations between the quality of two clusterings.

**Definition 3.2**(Scale invariance)

**.**A graph clustering quality function Q is scale

invariant if for all graphs G = (V, A), all clusterings C1, C2 of G and all constants

α > 0, Q(G, C1) 6 Q(G, C2) if and only if Q(αG, C1) 6 Q(αG, C2). Where αG =

(V, (i, j) 7→ αA(i, j)) is a graph with edge weights scaled by a factor α.

This formulation is flexible enough for single quality functions. However, families of quality functions could have parameters that are also scale depen- 3.4. Axioms for graph clustering quality functions dent. For such families we therefore propose to use as an axiom a more flexibleproperty that also allows the parameters to be scaled,

**Definition 3.3**(Scale invariant family)

**.**A family of quality function QP param-

eterized by P ∈ P is scale invariant if for all constants P ∈ P and α > 0 there

is a P 0 ∈ P such that for all graphs G = (V, A), and all clusterings C1, C2 of G,

QP(G, C1) 6 QP(G, C2) if and only if QP0(αG, C1) 6 QP0(αG, C2).

Thirdly, we want to rule out trivial quality functions. This is done by requir- ing richness, i.e. that by changing the edge weights any clustering can be madeoptimal for that quality function.

**Definition 3.4**(Richness)

**.**A graph clustering quality function Q is rich if for all sets

V and all non-trivial partitions C∗ of V, there is a graph G = (V, A) such that C∗ is the

Q-optimal clustering of V, i.e. argmax Q(G, C) = C∗.

The last axiom that Ackerman and Ben-David consider is by far the most interesting. Intuitively, we expect that when the edges within a cluster arestrengthened, or when edges between clusters are weakened, that this does notdecrease the quality. Formally we call such a change of a graph a consistentimprovement,

**Definition 3.5**(Consistent improvement)

**.**Let G = (V, A) be a graph and C a

clustering of G. A graph G0 = (V, A0) is a C-consistent improvement of G if for all

nodes i and j, a0]ij > a

ij whenever i ∼C j and a 0[ ij whenever i 6∼C j.

We say that a quality function that does not decrease under consistent im- provement is monotonic. In previous work this axiom is often called consistency.

**Definition 3.6**(Monotonicity)

**.**A graph clustering quality function Q is monotonic

if for all graphs G, all clusterings C of G and all C-consistent improvements G0 of G it

is the case that Q(G0, C) > Q(G, C).

In the graph setting it also becomes natural to look at combining differentgraphs. With distance functions this is impossible, since it is not clear what thedistance between nodes from the two different sets should be. But for graphswe can take the edge weight between nodes not in both graphs to be zero, inwhich case the two graphs are subgraphs of a larger graph with a consistentneighborhood.

Consider adding nodes to one side of a large network, then we would not want the clustering on the other side of the network to change if there is no Chapter 3. Axioms for hard clustering direct connection. For example, if a new protein is discovered in yeast, then theclustering of unrelated proteins in humans should remain the same. Similarly,we can consider any two graphs with disjoint node sets as one larger graph.

Then the quality of clusterings of the two original graphs should relate directlyto quality on the combined graph.

In general, local changes to a graph should have only local consequences to a clustering. Or in other words, the contribution of a single cluster to the totalquality should only depend on nodes in the neighborhood of that cluster.

**Definition 3.7**(Weak locality)

**.**A graph clustering quality function Q is weakly

local if for all graphs G1 = (V1, A1), G2 = (V2, A2), and common subgraphs Gs =

(Vs, As) of G1 and G2 with a consistent neighborhood, and for all clusterings Cc, Dc

of Vs, C1 of V1 Vs and C2 of V2 Vs, if Q(G1, Cc ∪ C1) > Q(G1, Dc ∪ C1) then

Q(G2, Cc ∪ C2) > Q(G2, Dc ∪ C2).

Any quality function that has a preference for a fixed number of clusters will not be weakly local. On the other hand, a quality function that is written as asum over clusters, where each summand depends only on properties of nodesand edges in one cluster and not on global properties, is weakly local.

Ackerman, Ben-David, and Loker defined a similar locality property for clustering functions. Their definition differs from ours in three ways. First ofall, they looked at k-clustering, where the number of clusters is given and fixed.

Secondly, their locality property only implies a consistent clustering when therest of the graph is removed, corresponding to V2 = V1 ∩ Vc. They do notconsider the other direction, where more nodes and edges are added. Finally,their locality property requires only a common subgraph Gs of both graphs, notthat this subgraph has a consistent neighborhood. That means that clusteringfunctions should also give the same results if edges with one endpoint in Gs areremoved.

**Relation to resolution-limit-free quality functions**

Traag, Van Dooren, and Nesterov introduced the notion of resolution-limit-free quality functions, which is similar to weak locality. They then showed thatresolution-limit-free quality functions do not suffer from the resolution limit asdescribed by Fortunato and Barthélemy Their definition is as follows.

**Definition 3.8**(Resolution-limit-free)

**.**Call a clustering C of a graph G Q-optimal

if for all clustering C0 of G we have that Q(G, C) > Q(G, C0). Let C be a Q-optimal

clustering of a graph G1. Then the quality function Q is called resolution-limit-free if

for each subgraph G2 induced by D ⊂ C, the partition D is also Q-optimal.

3.4. Axioms for graph clustering quality functions There are three differences compared to our weak locality property. First of all, Definition refers only to the optimal clustering, not to the quality, i.e. itis a property in the style of Kleinberg. Secondly, weak locality does not requirethat G2 be a subgraph of G1. Weak locality is stronger in that sense. Thirdly,and perhaps most importantly, in the subgraph G2 induced by D ⊂ C, edgesfrom a node in D to nodes not in D will be removed. That means that whilethe induced subgraph is a subgraph of both G1 and G2, it does not necessarilyhave a consistent neighborhood. So in this sense weak locality is weaker thanresolution-limit-freedom.

The notion of resolution-limit-free quality functions was born out of the need to avoid the resolution limit of graph clustering. And indeed weak locality isnot enough to guarantee that a quality function is free from this resolution limit.

We could look at a stronger version of locality, which does not require that the subgraph Gs has a consistent neighborhood in G1 and G2. Such a stronglocality property would imply resolution-limit-freeness. However, it is a verystrong property in that it rules out many sensible quality functions. In partic-ular, a strongly local quality function can not depend on the weight of edgesentering or leaving a cluster, because that weight can be different in anothergraph containing the subgraph induced by that cluster.

The solution used by Traag, Van Dooren, and Nesterov is to use the number of nodes instead of the volume of a cluster. In this way they obtain a resolution-limit-free variant of the Potts model by Reichardt and Bornholdt whichthey call the constant Potts model. But this comes at the cost of scale invariance.

In the context of graphs, perhaps the most intuitive clustering function is findingthe connected components of a graph. As a quality function, we could write Qcoco(G, C) = 1[C = ˆ where the function ˆ Ccoco yields the connected components of a graph, and 1[x] is the indicator function that is 1 when x holds, and 0 otherwise.

This quality function is clearly permutation invariant, scale invariant, rich, and weakly local. Since a consistent change can only remove edges betweenclusters and add edges within clusters, the coco quality function is also mono-tonic.

In fact, all of Kleinberg's axioms (reformulated in terms of graphs) also hold Ccoco, which seems to refute their impossibility result. However, the im- possibility proof can not be directly transfered to graphs, because it involves amultiplication and division by a maximum distance. In the graph setting this Chapter 3. Axioms for hard clustering would be multiplication and division by a minimum edge weight, which can bezero.

Still, despite connected components satisfying all previously defined proper- ties (except for strong locality), it is not a very useful quality function. In manyreal-world graphs, most nodes are part of one giant connected component (Bol-lobás, We would also like the clustering to be influenced by the weightof edges, not just by their existence. A natural way to rule out such degeneratequality functions is to require continuity.

**Definition 3.9**(Continuity)

**.**A quality function Q is continuous if a small change

in the graph leads to a small change in the quality. Formally, Q is continuous if for

every > 0 and every graph G = (V, A) there exists a δ > 0 such that for all graphs

G0 = (V, A0), if aij − δ < a0]ij < a

ij + δ for all nodes i and j, then Q(G 0, C) − < Q(G, C) < Q(G0, C) + for all clusterings C of G.

Connected components clustering is not continuous, because adding an edge with a small weight δ between clusters changes the connected components, andhence dramatically changes the quality.

Continuous quality functions have an important property in practice, in that they provide a degree of robustness to noise. A clustering that is optimal withregard to a continuous quality function will still be close to optimal after a smallchange to the graph.

**Summary of axioms**

We propose to consider the following six properties as axioms for graph cluster-ing quality functions, 1. Permutation invariance (Definition 2. Scale invariance (Definition 3. Richness (Definition 4. Monotonicity (Definition 5. Weak locality (Definition and 6. Continuity (Definition As mentioned previously, for families of quality functions we replace scale in-variance by scale invariance for families (Definition In the next section we will show that this set of axioms is consistent by defining a quality function and a family of quality functions that satisfies all of them. Additionally, the fact that there are quality functions that satisfy onlysome of the axioms shows that they are (at least partially) independent.

For graph clustering one of the most popular quality functions is modularity(Newman and Girvan, despite its limitations (Good, de Montjoye, andClauset, Traag, Van Dooren, and Nesterov, Qmodularity(G, C) = It is easy to see that modularity is permutation invariant, scale invariant and

**Theorem 3.10.**Modularity is rich.

The proof of Theorem is in Appendix An important aspect of modularity is that volume and within weight are normalized with respect to the total volume of the graph. This ensures that thequality function is scale invariant, but it also means that the quality can changein unexpected ways when the total volume of the graph changes. This leads usto Theorem

**Theorem 3.11.**Modularity is not weakly local.

Proof. Consider the graphs which have a common subgraph Gs with nodes Vs = {a, b} that has a consistentneighborhood in G1 and G2. Note that we draw the graphs as directed graphs,to make it clear that each undirected edge is counted twice for the purposes ofvolume and within cluster weight. Now take the clusterings Cs = {{a}, {b}} andDs = {{a, b}} of Vs; C1 = {} of V1 Vs; and C2 = {{c}} of V2 Vs. Then Qmodularity(G1, Cs ∪ C1) = 1/6 > 0 = Qmodularity(G1, Da ∪ C1), Qmodularity(G2, Cs ∪ C2) = 23/50 < 24/50 = Qmodularity(G2, Da ∪ C2).

This counterexample shows that modularity is not weakly local.

Chapter 3. Axioms for hard clustering Even without changing the node set, changes in the total volume can be problematic, as shown by the following theorem.

**Theorem 3.12.**Modularity is not monotonic.

Proof. Consider the graphs and the clustering C = {{a}, {b}, {c}}. G0 is a C-consistent improvement of G, be-cause the weight of a between-cluster edge is decreased. The modularity of C inG is Qmodularity(G, C) = 1/8, while the modularity of C in G0 is Qmodularity(G0, C) =0. So modularity can decrease with a consistent change of a graph, and hence itis not a monotonic quality function.

Monotonicity might be too strong a condition. When the goal is to find a clustering of a single graph, we are not actually interested in the absolute valueof a quality function. Rather, what is of interest is the optimal clustering, andwhich changes to the graph preserve this optimum. At a smaller scaler, we canlook at the relation between two clusterings. If C is better then D on a graph G,then on what other graphs is C better then D? We therefore define a relative version of monotonicity, in the hopes that mod- ularity does satisfy this weaker version.

**Definition 3.13**(Relative monotonicity)

**.**A quality function Q is relatively mono-

tonic if for all graphs G and G0 and clusterings C and D, if G0 is a C-consistent

improvement of G and G is a D-consistent improvement of G0 and Q(G, C) > Q(G, D)

then Q(G0, C) > Q(G0, D).

**Theorem 3.14.**Modularity is not relatively monotonic.

Proof. Take the graphs and the clusterings C = {{a, b, c}, {d}} and D = {{a}, {b}, {c, d}}. G0 is a C-consistentimprovement of G, because the weight of a within cluster edge is increased. G isa D-consistent improvement of G0, because the weight of a between cluster edge 3.6. Adaptive scale modularity is decreased. However Qmodularity(G, C) = 20/121 > 16/121 = Qmodularity(G, D)while Qmodularity(G0, C) = 24/169 < 28/121 = Qmodularity(G0, D). This counterex-ample shows that modularity is not relatively monotonic.

**Adaptive scale modularity**

The problems with modularity stem from the fact that the total volume canchange when changes are made to the graph. It is therefore natural to look at avariant of modularity where the total volume is replaced by a constant M, This quality function is obviously weakly local. It is also a scale invariant familyparameterized by M. However, this fixed scale modularity quality function isnot scale invariant for any fixed scale M > 0.

We might hope that fixed scale modularity would be monotonic, because it doesn't suffer from the problem where changes in the edge weights affectthe total volume. Unfortunately, fixed scale modularity has problems when thevolume of a cluster starts to exceed M/2. In that case, increasing the weight ofwithin cluster edges starts to decrease the fixed scale modularity. Looking at acluster c with volume vc = wc + bc, ∂QM-fixed(G, C) This derivative is negative when 2vc > M, so in that case increasing the weightof a within-cluster edge will decrease the quality. Hence fixed scale modularityis not monotonic.

The above argument also suggests a possible solution: add 2vc to the nor- malization factor M. Or more generally, add γvc with γ > 2, which leads to thequality function This adaptive scale modularity quality function is clearly still permutation in- variant, continuous and weakly local. For M = 0 it is also scale invariant. Sincethe value of M should scale along with the edge weights, adaptive scale modu-larity is a scale invariant family parameterized by M. Additionally, we have thefollowing two theorems:

**Theorem 3.15.**Adaptive scale modularity is rich for all M > 0 and γ > 1.

Chapter 3. Axioms for hard clustering

**Theorem 3.16.**Adaptive scale modularity is monotonic for all M > 0 and γ > 2.

The proofs of these theorems can be found in Appendices and This shows that adaptive scale modularity satisfies all six axioms we have defined for families of graph clustering quality functions, and the six axiomsfor single quality functions when M = 0. This shows that our extended set ofaxioms is consistent.

**Relation to other quality functions**

Interestingly, in the limit as M goes to 0, the adaptive-scale quality functionbecomes similar to normalized cut (Shi and Malik, with an added constant, This 0-adaptive modularity is also scale invariant as a single quality function.

Conversely, when M goes to infinity the quality goes to 0. However, the quality function approaches unnormalized cut in behavior: lim M · QM,γ(G, C) = This expression is similar to the Constant Potts model (CPM) by Traag, Van Dooren, and Nesterov wc − γn2c .

In contrast to the quality functions discussed thus far, CPM uses the numberof nodes instead of volume to control the size of clusters. Like adaptive scalemodularity, the constant Potts model satisfies all six axioms (as a family).

As stated before, the fixed scale and adaptive scale modularity quality func- tions are a scale invariant family; they are not scale invariant for a fixed valueof M (except for M = 0). This is not a large problem in practice, since scale in-variance is often sacrificed to overcome the resolution limit of modularity (For-tunato and Barthélemy, In fact, fixed scale modularity is proportional tothe quality function introduced by Reichardt and Bornholdt with M = vV /γRB.

3.6. Adaptive scale modularity An illustration of the possible outcomes when clustering a two- clique network. Clusters are indicated by circles. In outcome (3), the verticaledges each have weight w/4, while the horizontal and diagonal ones have weightb/4.

**Parameter dependence analysis**

There has been a lot of interest in the so called resolution limit of modularity.

This problem can be illustrated with a simple graph that consists of a ring of cliques, where each clique is connected to the next one with a single edge.

We would like the clusters in the optimal clustering to correspond to the cliquesin the ring. It was observed by Fortunato and Barthélemy that, as thenumber of cliques in the ring increases, at some point the clustering with thehighest modularity will have multiple cliques per cluster.

This resolution problem stems from the fact that the behavior of modularity depends on the total volume of the graph. Both the fixed scale and adaptivescale modularity quality functions instead have a parameter M, and hence donot suffer from this problem. In fact, any weakly local quality function willnot have a resolution limit in the sense of Fortunato and Barthélemy. A similarobservation was made by Traag, Van Dooren, and Nesterov in the contextof modularity like quality functions.

In real situations graphs are not uniform as in the ring-of-cliques model.

But we can still take simple uniform problems as a building block for largerand more complex graphs, since for weakly local quality functions the rest ofthe network doesn't matter beyond the immediate neighborhood. Therefore wewill look at a simple problem with two subgraphs of varying sizes connectedby a varying number of edges. More precisely, we take two cliques each withwithin weight w, connected by edges with weight b. The total volume of this(sub)graph is then 2w + 2b.

There are three possible outcomes when clustering such a two-clique net- work: (1) the optimal solution has a single cluster; (2) the optimal solution hastwo clusters, corresponding to the two cliques; (3) the optimal solution has morethan two clusters, splitting the cliques apart. See Figure for an illustration.

Which of these outcomes is desirable depends on the circumstances.

Chapter 3. Axioms for hard clustering Another heterogeneous resolution limit model was proposed by Lancichinetti and Fortunato In this situation there are two cliques of equal size con-nected by a single edge, and a random subgraph. Now the ideal solution wouldbe to find three clusters, one for each clique and one for the random subgraph.

The optimal split of the random subgraph will roughly cut it in half, with a fixedfraction of the volume being between the two clusters (Reichardt and Bornholdt,So this model can be considered as a combination of two instances of oursimpler problem, one for the two cliques and one for the random Hence, we want outcome (2) for the cliques, and outcome (1) for the randomsubgraph.

In Figure we show which graphs give which outcomes for adaptive scale modularity with various parameter settings. The first column, γ = 0, is ofparticular interest, since it corresponds to fixed scale modularity and hence alsoto QRB and to modularity in certain graphs. In the third row we can see thatwhen 2v = 2w + 2b > M = 100 the cliques are split apart. This is precisely theregion in which monotonicity no longer holds. Overall, the parameter M hasthe effect of determining the scale; each row in this figure is merely the previousrow magnified by a factor 10. Increasing M has the effect of merging smallclusters. On the other hand, the γ parameter controls the slope of the boundarybetween outcomes (1) and (2), i.e. the fraction of edges that should be within acluster. This is most clearly seen when M = 0, while otherwise the effect of Mdominates for small clusters.

**Conclusion and open questions**

In this chapter we presented an axiomatic framework for graph clustering qual-ity functions consisting of six properties. We showed that modularity does notsatisfy the monotonicity property. This motivated the derivation of a new fam-ily of quality functions, adaptive scale modularity, that satisfies all propertiesand has standard graph clustering quality functions as special cases. Results ofan experimental parameter dependence analysis showed the high flexibility ofadaptive scale modularity. However, adaptive scale modularity should not beconsidered the solution to all the problems of modularity, but rather an exampleof how axioms can be used in practice.

An overview of the discussed axioms and quality functions can be found in Table Many more quality functions have been proposed in the literature, sothis list is by no means exhaustive. An interesting topic for future research is to 1Lancichinetti and Fortunato include edges between the cliques and the random subgraph to ensure that the entire network is connected, these edges are not relevant to the problem 3.7. Conclusion and open questions The behavior of QM,γ for varying parameter values. The graph consists of two subgraphs with w internal weight each, connected by an edgewith weigh b. Hence the volume of the total graph is 2w + 2b. In region (1)the optimal clustering has a single cluster, In region (2) (light blue) the optimalclustering separates the subgraphs. In region (3) (red, hatched) the subgraphsthemselves will be split apart.

make a survey of which existing quality functions satisfy which of the proposedproperties.

We also investigated resolution-limit-free quality functions as defined in (Traag, Van Dooren, and Nesterov, As illustrated in Section adaptive scalemodularity allows to perform clustering at various resolutions, by varying thevalues of its two parameters. However it is not resolution-limit-free.

This chapter did not address questions such as finding a best quality function Chapter 3. Axioms for hard clustering Connected components Reichardt and Bornholdt Fixed scale modularity Adaptive scale modularity Constant Potts Model Table 3.1: Overview of quality functions discussed in this chapter and the prop-erties they satisfy.

(Almeida et al., or selecting a significant resolution scale (Traag, Krings,and Dooren, The aim was to provide necessary conditions about what agood quality function is, in order to rule out and/or to improve quality func-tions. The proposed axioms and the introduction of adaptive scale modularityare an effort in this direction.

We also did not address the question of finding a clustering with the high- est quality. Finding the optimal value of quality functions such as modularityis NP-hard (Brandes et al., but several heuristic and approximation algo-rithms have been developed. One class of algorithms uses a divisive approach,see for instance Newman and Ruan and Zhang For such a tactic tobe valid, an optimal or close to optimal clustering of a subgraph should also bea near optimal clustering of the entire graph. This is ensured by weak locality.

Recently Dinh and Thai proposed polynomial-time approximation algo-rithms for the modularity maximization in the context of scale free networks. Itwould be interesting to investigate the suitability of these algorithms for adap-tive scale modularity maximization.

In this work we have only looked at non-negative weights, undirected graphs, and only at hard partitioning. An extension to graphs with negative weights, todirected graphs and to overlapping clusters remains to be investigated. Anotheropen problem is how to use these axioms for reasoning about quality functions 3.A. Proof of Theorem modularity is rich and clustering algorithms.

**Proof of Theorem**

**modularity is rich**

The proofs of richness rely on clique graphs,

**Definition 3.17**(Clique graph)

**.**Let V be a set of nodes, C be a partition of V, and

k be a positive constant. The clique graph of C with edge weight k is defined as

G = (V, A) where aij = k if i ∼C j and aij = 0 otherwise.

Proof. Let V be a set of nodes and C 6= {V} be a clustering of V. Let G = (V, A)be a clique graph of C with edge weight 1. Note that aii = 1, so any possiblecluster will have a positive volume. Let D be a clustering of G with maximalmodularity.

Suppose that there is a cluster d ∈ D that contains i, j ∈ d with i 6∼C j. Then we can split the cluster into d1 = {k ∈ d k ∼C i} and d2 = {k ∈ d k 6∼C i}.

Because there are no edges between nodes in d1 and nodes in d2, it is the casethat wd = wd + w . Both d 1 and d2 are non-empty and have a positive volume, + v )2 < v2 + v2 . Therefore Q modularity(G, D) < Qmodularity(G, D {d} ∪ {d1, d2}). So D does not have maximal modularity, which is a contradiction.

Suppose, on the other hand that all clusters d ∈ D are a subset of some cluster in C, i.e. D is a refinement of C. Then either D = C, or there are twoclusters d1, d2 ∈ D that are both a subset of the same cluster c ∈ C. In the lattercase we can combine the two clusters into d = d1 ∪ d2. The within weight of thiscombined cluster is wd = d 2 = wd + w 1 d2 . The squared volume of 1 d2 c 2. So this changes increases the modularity by Qmodularity(G, D {d1, d2} ∪ {d}) − Qmodularity(G, D) = 2 d1 d2 (vV − c 2)/v2 > which contradicts the assumption that D has maximal modularity. Thereforethe only optimal clustering of G is C. Note that the above inequality only holdswhen c 2 = vc < vV , which is the case because C 6= {V}.

When C = {V}, a clique graph will not work; because both {V} and the clustering that assigns half the nodes to one cluster, and half to another havemodularity equal to 0. In this case, instead define G = (V, A) by aij = 1 if i 6= jand 0 if i = j. Then the modularity for C is q(G, {V}) = 0. Any cluster d ina clustering D will have vd = d ( V − 1) and wd = d ( d − 1). Therefore thecontribution of this cluster to the total quality is − d ( V − d )/( V 2( V − 1)), Chapter 3. Axioms for hard clustering which is negative when d < V . So the modularity of any clustering other than{V} will be negative, hence {V} is the only optimal clustering.

Since for every C we can construct a graph where C is the only optimal clustering, modularity is rich.

**Proof of Theorem**

**adaptive scale modularity is rich**

Denote by fC(d) the largest fraction of any cluster from C that is contained in acluster d.

For any clustering D we have that And since fC(d) 6 1 for all clusters d, we also have that X fC(d) 6 D .

**Lemma 3.18.**For a clique graph of C it is the case that wd/vd 6 fC(d).

Proof. Given a cluster d and a clique graph G of C with weight k > 0, the volumeof d is and the within cluster weight is k c ∩ d 2.

k c ∩ d c fC(d) = vdfC(d).

And hence wd/vd 6 fC(d).

**Lemma 3.19.**Let G be the clique graph of a clustering C with weight k, and let 0 <

β < 1 be a constant. Then d/vd − β) = (1 − β) C if D = C, while d/vd − β) < (1 − β) C − if D 6= C, where = min(β, 1 − β, 1/ V )/2.

3.B. Proof of Theorem adaptive scale modularity is rich Proof. Suppose that D = C, then for every cluster c ∈ C, wc = vc = k c 2, and so − β = (1 − β) C .

Otherwise, D 6= C. Assume by contradiction that − β > (1 − β) C − min(β, 1/ V )/2.

< C − β C − 6 C − β D .

Since β > 0, this implies that D < C + 1.

Additionally, since fC(d) 6 1 for all clusters d ∈ D, (1 − β)( C − 1) <(1 − β) C − Since β < 1, this implies that D > C − 1. Hence D = C .

Suppose that fC(d) < 1 for some d ∈ D, which implies that c ∩ d < c .

Because edges are discrete, this can only happen when c ∩ d 6 c − 1 for allclusters c. And the size of clusters is bounded by c 6 V . Hence fC(d) 6( V − 1)/ V = 1 − 1/ V . And since for all other clusters d0, fC(d0) 6 1, we then Chapter 3. Axioms for hard clustering <(1 − β) C − which is a contradiction. Hence, it must be the case that fC(d) = 1 for allclusters d ∈ D. By the definition of fC this means that for every d there is acluster c ∈ C such that c ∩ d = c , and therefore c ⊆ d. Since the clusters aredisjoint and D = C , this implies that D = C. Which is a contradiction, so d/vd − β) < (1 − β) C − .

When M = 0, the adaptive scale modularity reduces to wd/(γvd) − D /γ2, and the above lemma is enough to prove richness. For non-zero values of M, wecan get ‘close enough' by choosing large enough edge weights. This is formal-ized in the following lemma.

**Lemma 3.20.**Let d be a cluster in a clustering of a clique graph of C with weight k.

Then

− β − βM/k 6 q(d)/β 6 − β + 2β2M/k, denotes the contribution of d to the M-adaptive modularity.

Proof. Since clusters are non-empty, and in a clique graph aii = k, it follows that 3.B. Proof of Theorem adaptive scale modularity is rich vd > wd > k. So βMwd + vdwd − βv2 β2M(βM + 2vd) − β2M2wd/vd − βMwd (βM + vd)(βM + 2vd) And since wd 6 vd, β2M(βM + 2vd) − β2M2wd/vd − βMwd Combining these lemmas yields the proof of the general theorem: Proof. Given a clustering C. Define β = 1/γ. If γ > 1 then 0 < β < 1. Pickk > 3 V β2M/ where is defined as in Lemma Let G be the clique graph of C with weight k. Let D 6= C be a clustering of Chapter 3. Axioms for hard clustering G. Then by Lemmas and (wd/vd − β + 2β3M/k) 6(1 − β) C + 2 D β3M/k − 6(1 − β) C + 2 V β2M/k − <(1 − β) C − V β2M/k 6(1 − β) C − C β2M/k (wc/vc − β + β2M/k) Hence the quality is maximal for C. Since there is a clique graph and k for everyclustering, adaptive scale modularity is rich.

**Proof of Theorem**

**adaptive scale modularity is**

monotonic

monotonic

Proof. Given a constants M > 0 and γ > 2, a graph G and a clustering C ofG. Let c ∈ C be any cluster. Writing the volume of c as vc = wc + bc, thecontribution of this cluster to the quality of G is q(wc, bc) where The partial derivatives of q are M2 + (γ − 2)M(w + b) + γb(M + γw + γb) γwM + (w + b)(M + γ2w) This means that q is a monotonically non-decreasing function in w and a non-increasing function in b.

For any graph G0 that is a C-consistent change of G, it holds that w0c > So q(w0c, b0c > q(wc, bc). And therefore QM,γ(G0, C) > QM,γ(G, C). So adaptive scale modularity is monotonic.

**Comparing quality functions for**

**network clustering: The resolution**

**bias matters most**

Results of a recent comparative experimental assessment of methods for network commu-nity detection applied to benchmark graphs indicate that the two best methods use dif-ferent objective functions but a similar local search-based optimization procedure, calledthe Louvain method.

This observation motivates the following research question: given the Louvain op- timization procedure, how much does the choice of the objective function influence theresults, and in what way? In this chapter we address this question empirically in abroad graph clustering context, that is, when graphs are either given as such or are k-nearest neighbor graphs generated from a given dataset. We consider normalized cut,modularity, and infomap; as well as two new objective functions. We show that all theseobjective functions have a resolution bias, that is, they tend to prefer either small or largeclusters. When removing this bias, by forcing the objective function to generate a givennumber of clusters, the Louvain method achieves similar performance across the con-sidered objective functions on benchmark networks with built-in community structure.

These results indicate that the resolution bias is the most important difference betweenobjective functions in graph clustering with the Louvain method.

Spectral clustering is an alternative to local search optimization, and has been used to optimize the popular normalized cut and modularity objectives. We show experimen- This chapter is based on van Laarhoven and Marchiori "Graph clustering with local search optimization: The resolution bias of the objective function matters most", published in Physi-cal Review E. The software used in this chapter is available on Chapter 4. Comparing quality functions for network clustering tally that the Louvain method often achieves superior performance compare to spectralclustering on various benchmark, real-life and k-nearest neighbor graphs. These results,the flexibility of the Louvain method and its efficiency, provide arguments in favor ofthis optimization method.

Due to the intrinsic difficulty of the problem, graph clustering has been tack-led by many researchers, yielding a vast amount of heuristic and approximatemethods as well as interesting experimental and theoretical results. We refer thereader to the surveys on this topic, e.g., (Fortunato, von Luxburg, Schaeffer, Many methods for graph clustering are based on optimizing aglobal objective or quality function. The ‘optimal' clustering is then the one thatmaximizes the quality (throughout this chapter we will use maximizations; someobjective functions are traditionally minimized, in those cases we negate them).

This discrete optimization problem is computationally intractable (at least forthe quality functions for which hardness is known, see for instance (Brandeset al., Fortunato, Schaeffer, Therefore all effective and scalablemethods are based on heuristic and approximate techniques.

One can distinguish two main classes of heuristic for optimizing clustering quality functions. The first one is based on relaxing the discrete cluster labels tocontinuous variables, and solving the resulting problem with spectral methods.

To convert the continuous clustering to a discrete one, a separate step is used,usually k-means clustering (see e.g. the review by von Luxburg, Thisprincipled spectral approach is only possible for some quality functions, such asnormalized cut (Shi and Malik, or modularity (Newman, The other class of optimization methods is directly based on (local heuris- tic) discrete optimization. The goal is to find, among all partitions of the dataset, the best one according to a given quality function (see e.g. the review byFortunato, Although heuristic in nature, this latter approach has broaderapplicability since any quality function can be used.

A central issue in network community detection is the resolution limit of quality functions, which has been investigated from multiple perspectives, inparticular for modularity (Arenas, Fernández, and Gómez, Lambiotte,Lancichinetti and Fortunato, Reichardt and Bornholdt, Traag,Van Dooren, and Nesterov, In particular, Fortunato and Barthélemy showed that modularity optimization is unable to detect small clusters; Good,de Montjoye, and Clauset showed that the modularity function exhibitsextreme degeneracies such that the globally maximal modularity partition is 4.1. Introduction typically hidden among an exponential number of structurally dissimilar, highmodularity solutions.

An experimental study by Lancichinetti and Fortunato showed that the common spectral methods are far from optimal for the purpose of graphcommunity detection on benchmark graphs. In their review, the two best meth-ods are those of Blondel et al. and Rosvall and Bergstrom Bothof these methods use a similar local search optimization procedure, here calledthe Louvain method, after the university of Louvain. This method is based onmoving nodes between clusters, and constructing a clustering bottom-up. Inprinciple this procedure can be used with any graph clustering quality function.

Since good results are obtained with at least two different quality functions, thisraises the following questions: In how far does the clustering result of this opti-mization method depend on the quality function that is being optimized? Andin what way does the choice of the quality function influence the results? In order to address these questions we consider five quality functions, namely normalized cut, modularity, infomap, and two novel simple quality functions.

These novel quality functions are designed in such a way that (1) clusterings arebetter if they contain more within cluster edges, and (2) clusters should not betoo small or too large. First, we analyze the resolution bias of these functions, byshowing that their optimum is achieved for clusterings consisting of either rela-tively small or relatively large communities. Next, we apply the Louvain methodto these quality functions on benchmark graphs. Results indicate that diverseperformance is achieved across the different types of quality functions. We in-troduce a procedure to automatically control the resolution bias of a qualityfunction. In this way we force the method to output a fixed number of clustersfor each quality function. Results of experiments show that the resolution biasplays a central role for the difference in performance of the quality functions.

When the resolution bias is ‘removed' by fixing the number of clusters, the per-formance of the Louvain method across these quality functions becomes muchmore similar.

Spectral clustering is a principled alternative to local search based optimiza- tion, and has been used to optimize the popular normalized cut and modularityquality functions (Shi and Malik, Newman, We show experimen-tally that the Louvain method often achieves superior performance comparedto spectral clustering on various benchmark, real-life and k-nearest neighborgraphs. These results confirm the findings reported by Lancichinetti and For-tunato also for k-nearest neighbor graphs. In general these results, theflexibility of the Louvain method and its efficiency, provide arguments in favorof this optimization method.

The chapter is structured as follows. In Section we present the five quality Chapter 4. Comparing quality functions for network clustering functions, analyze their resolution bias, and introduce a procedure for control-ling the size of clusters. In Section we describe the Louvain optimizationmethod. In Section we apply this method to the quality functions. We showthat the resolution bias is the most important difference between the qualityfunctions, and that the Louvain method has difficulties in optimizing specificquality functions. Furthermore, we compare the Louvain method to spectralclustering on benchmark and real-life networks and k-nearest neighbor graphs.

Conclusions are reported in Section In this chapter we use the same notation defined in Section To briefly recap, a clustering C is a partition of nodes V of a graph. That is, a set of disjoint sets of nodes which we call clusters, that together cover all nodes.

So, every node is in exactly one cluster.

A quality function Q assigns a quality to such a clustering of a graph.

To simplify the notation, we will keep the graph argument G = (V, A) im- plicit everywhere, that is, we assume that there is some specific graph underconsideration. We denote the total volume of this graph by M = vV .

As a shorthand, we then define the normalized volume of a cluster as v vc/M, and the normalized within weight as w b c = wc /M.

**The considered quality functions**

Different quality functions have been proposed for graph clustering. Perhapsthe most well known is normalized cut (Shi and Malik, which is definedas Maximizing this quality function minimizes the number of between cluster edges,called the cut size.

Another common quality function is modularity, introduced by Girvan and Newman We already mentioned this quality function in the previouschapter. With the shorthands of normalized volume and normalized withinweight it can be written as 4.2. Quality functions Table 4.1: Quality functions considered in this chapter.

c∈C b c − bc ) c∈C b c(1 − bc) c∈C bc − b c) Both of these quality functions can be written as a sum over all clusters, for some function q. This means that it makes sense to look at the quality of justa subset of the clusters, or of the clustering of just a subset of the nodes.

A notable quality function that does not follow this pattern is infomap (Ros- vall and Bergstrom, This quality function is based on the length of a codefor paths through the graph. In addition to a sum of per-cluster scores, infomapalso include a global term based on the probability of an edge exiting a where h(x) = −x log(x). The original infomap quality function contains anadditional term, Qinfomap(Rosvall and Bergstrom, (C) = Qinfomap(C) + which is needed to make the quality function correspond to a code length. How-ever, since this last term is the same for all clusterings, we don't include it. Inaddition, without this extra term, the quality of the trivial clustering with onecluster is exactly 0.

There are many more possible quality functions that could be used for graph clusterings. Some considerations for designing such functions are that: 1This expression for the infomap quality function is based on the source code of the tools pro- vided by Rosvall et.al., Chapter 4. Comparing quality functions for network clustering 1. All else being equal, clusters are better if they contain more within cluster 2. Clusters should not be too small or too large.

For the first consideration, we can look at the ratio between the volume of acluster and the number of within edges. For the second consideration, we usea weight function f(v bc ) where f(0) = f(1) = 0, while f(x) > 0 for 0 < x < 1.

Combining these ingredients leads to an quality function of the form Two simple functions that fit the criteria for f are the parabola f(x) = x(1 − x)and the function h(x) = −x log(x). They give the quality function Qw-log-v(C) = − b c log(bc ).

Of course, there are infinite other possibilities. We focus on these two becausethe former is similar to modularity, while the latter resembles infomap whilebeing much simpler.

Table lists all the different graph clustering quality functions that we will consider in this chapter. Many other quality functions have been discussed inthe literature, see Fortunato for an overview. Many of them do not applyin our setting, because they assign a score to a single cluster, not to a clustering.

Therefore it is not clear how the cluster scores should be combined into a scorefor a clustering. When the number of clusters is fixed one could use the sumof scores, but when the number of clusters is allowed to vary this will often notgive a good quality function.

It was shown by Fortunato and Barthélemy that modularity has a reso-lution limit, in the sense that it tends to combine small communities into largerones. Specifically, in a network which has L edges, there is a characteristic num-ber of edges, such that communities with less than pL/2 edges are not visible.

Kumpula et al. have generalized this result by showing that the graphclustering framework introduced by Reichardt and Bornholdt also has a 4.2. Quality functions Figure 4.1: Model graph for showing the resolution limits. The circles representstrongly connected ‘modules' with q − r internal edges, while the lines representr edges each.

resolution threshold. The model contains a parameter by which this thresholdcan be tuned, but no a priori principle is known to select the proper value. Theyconclude that single global optimization criteria do not seem capable for detect-ing all communities if their size distribution is broad (Kumpula et al., In the sequel we show that the other clustering quality functions here con- sidered have resolution limits. In fact, these are not just limits, but a generalbias towards certain cluster sizes. For example, the w-log-v quality function hasa resolution limit at smaller cluster sizes, and it always leads to smaller clustersthan modularity.

Consider a graph that has n densely connected modules, which are loosely connected in a ring (Fortunato and Barthélemy, Figure illustratessuch a graph. The modules themselves could be single nodes, cliques or othersubgraphs, we are only interested in their volume. In particular, imagine eachmodule having q − r internal edges, and connected to both of its neighbors withr edges each. The volume of a single module X is then vX = 2q, while thevolume of the entire graph is M = 2nq.

A cluster Xm consisting of m adjacent modules will have normalized volume = m/n. And since all but 2r of the edges will be within the cluster, the normalized within weight will be w = (m − r/q)/n. By symmetry, we would expect all clusters in the optimal clustering C∗ to have the same size (assumingm divides n). There will then be n/m such clusters. So, the total modularity ofthis clustering is Qmodularity(C∗) = This expression has a maximum at m = pnr/q. So, the larger the graph, or the Chapter 4. Comparing quality functions for network clustering The resolution limits of graph clustering quality functions as a function of the graph size. We used r = 1 and q = 46, which corresponds tomodules that are cliques with 10 nodes. The resolution limit of the parabolaquality function is the same as that of modularity.

less dense the connections in each module, the larger the clusters that are found.

The parabola quality function reaches a maximum at the same point.

For the w-log-v quality function Qw-log-v(C∗) = − The optimum is at m = W(enr/q)r/q where W is the Lambert W function.

The normalized and ratio cut quality functions behave differently, QNCut(C∗) = − This expression has no maximum value, it increases as m gets larger. Since thesize of a cluster can not be larger than n, the actual optimum is at m = n, i.e.

when all modules are in a single cluster.

Finally, for infomap there is no analytical expression for the optimum cluster size, but it can be easily calculated numerically. Figure shows the cluster sizeof the optimal clustering as a function of the number of modules. This optimalsize was found by numerical optimization of the quality in terms of the clustersize m. For this figure we have used modules with r = 1 and q = 46, whichcorresponds to 45 internal edges, i.e. cliques with 10 nodes. Other module sizes 4.2. Quality functions give qualitatively similar results. For each of the quality functions, the optimalcluster size depends only on the ratio r/q and the number of modules n.

Note that in this figure, for the w-log-v and infomap quality functions the theoretically optimal clustering always has less than 1 clique per cluster, whichin practice means that the cliques are perfectly clusterable. To actually see theresolution limit in action for these quality functions, the number of cliques mustbe very large. For example for the value of the w-log-v quality function for m = 2overtakes the value for m = 1 when n > 291 ≈ 1027.

The resolution bias discussed in this section reflects preferences towards cer- tain sizes of clusters, in a situation where all vertices are similar. There are otherbiases that come into play when the graph is less uniform and when the sizes ofclusters will differ (Lancichinetti and Fortunato,

**Controlling the size of clusters**

Several generalizations of modularity have been proposed that allow for con-trol over the size of the clusters (Reichardt and Bornholdt, Angelini et al.,Lambiotte, Each of these quality functions has a parameter thatcontrols the trade-off between the size of the clusters and the strength of edgeswithin clusters. For example, the quality function introduced by Reichardt andBornholdt is, in our notation, In this chapter we also consider other quality functions besides modularity, and hence we would like to add similar size-control parameters to them. Inthe previous section we have shown that the size of the clusters depends onthe size of the graph. Often this dependency is implicit, through the use ofthe normalized volume and normalized within weight. This dependence can beused to control the cluster sizes.

The idea is to embed the graph in a larger graph, with total volume αM, and thereby change the optimal clustering. Since quality functions such as modular-ity are a sum over clusters in the form of we can look at the contribution tothe modularity of a clustering C of the original graph. Denote this contributionby b c /α − (bc /α)2 modularity(C) − (α − 1) Chapter 4. Comparing quality functions for network clustering The optimal clustering does not change when the quality function is multi- plied by a constant. Therefore, embedding within a larger graph is equivalentto adding a term to the quality function, Q+(C, β) = Q(C) + β This holds also for the parabola and the w-log-v quality functions.

On the other hand, the normalized cut quality function does not depend on the size of the graph at all. Despite this, we can still use to adjust thatquality function.

In this way we get a family of quality functions parameterized by β for each original quality function. Note also that Q+ is equivalent to Q γ = 1 + β; and it is equivalent to QNL introduced in (Lambiotte, witht = 1/(1 + β).

By adjusting the parameter β, the size of the clusters can be controlled. A negative value of β corresponds to embedding the graph in a larger one, so itwill lead to fewer larger clusters. A positive β will lead to more and smallerclusters. However, we have to be careful with large positive values of β, sincethat punishes within cluster edges, instead of rewarding them.

Since a large part of the difference between quality functions lies in the dif- ferent preferred cluster sizes, this added flexibility might be enough to get rid ofmuch of these differences. Suppose, for example that the number of clusters isknown. Then we can use binary search to look for a value of β that leads to thedesired number of clusters. The resolution bias of the quality function is then nolonger important, since by fixing the number clusters we also fix their averagesize.

**The Louvain method**

The optimization procedure that we use is the local search method developedby Blondel et al. which is usually called the Louvain method, after theuniversity of Louvain. It was initially proposed for optimizing modularity, butthe same method can also be used for any other graph clustering quality func-tion. The method is very fast, and can deal with millions of nodes in seconds.

We will briefly describe the algorithm here.

Initially, each node is assigned to a singleton cluster. Then, iteratively, nodes are moved between clusters as long as the quality improves. For each node,only moves to neighboring cluster are considered; where neighboring clustersare those clusters that contain neighboring nodes. The nodes are visited in arandom order.

The most expensive part of the algorithm is recomputing the value of the quality function. For quality functions that are written as a sum over the clusters,as in this computation can be done efficiently, because only two terms ofthe sum change when a node is moved between clusters.

Because the quality increases with each move, eventually a local optimum will be reached. However, the clusters in this local optimum will often be toosmall. It is just that they can not be improved by moving single nodes. We willcall the clusters found at this point small clusters. The next step is to repeat theoptimization procedure, but this time moving entire small clusters instead ofsingle nodes. Effectively, we are then clustering a condensed graph, where eachnode in the condensed graph is a small cluster.

The step of moving small clusters is again repeated until convergence. The clusters at that point become the new small clusters. At some point no smallclusters will be moved, and then the algorithm stops.

Several variations to this basic recipe are possible. For instance, if the clusters become too large, one could apply the clustering algorithm from scratch to onlythe nodes in a single cluster. This might lead to a better optimum. However,we do not find this step to improve the results in our experiments. Anotherimprovement is to simply run the algorithm several times with different randomseeds, and to pick the best solution.

**Community detection benchmarks**

We consider the LFR graph generator by Lancichinetti, Fortunato, and Radicchiwhich constructs networks with built-in community structure. In thisbenchmark, the size of each cluster is drawn from a power-law distribution; asis the degree of each node. These benchmarks are specifically constructed toclosely resemble real world graphs. Indeed, it has previously been observedthat real world graphs also have such a power-law degree distribution (Clauset,Shalizi, and Newman, The LFR model has several parameters. The most important is the mixing parameter µ, which controls the fraction of edges that are between clusters.

Essentially this is the amount of noise in the graph. If µ = 0 all edges are withincluster edges, if µ = 1 all edges are between nodes in different clusters.

Other parameters control the number of nodes, the distributions of cluster sizes, the distribution of degrees, etc. If something is known about the targetgraph, then these parameters should be chosen to match that graph. However,in this chapter we do not try to match any particular graph. We therefore follow Chapter 4. Comparing quality functions for network clustering modularity (spectral) Mixing parameter µ modularity (spectral) Mixing parameter µ Figure 4.3: Normalized mutual information as a function of the mixing param-eter, for various quality functions; on the Small 1000 and Big 5000 datasets. Theerror bars indicate standard deviation.

the settings used by Lancichinetti and Fortunato They describe fourbenchmarks. Two with ‘small clusters' of between 10 and 50 nodes, and twowith ‘large clusters' of between 20 and 100 nodes. Each graph has either 1000 or5000 nodes in total.

To measure the quality of a clustering, we compare it to the ground truth with the normalized mutual information (NMI) metric (Danon et al., where I is mutual information and H is entropy. Figure shows the normal-ized mutual information as a function of the mixing parameter for the differentclustering quality functions. We used the Louvain method to optimize all qual-ity functions. We did not include normalized cut, since without adjustment thisquality function always leads to a single cluster. We only show the results forthe benchmark with 1000 nodes and small clusters and the benchmark with 5000nodes and large clusters. The results for the other two benchmarks are similar.

For comparison, besides the Louvain method, we also include two spectral clustering methods in our experiments. First of all a simple method that approx-imately maximizes the normalized cut quality by solving a generalized eigen-value problem for the graph Laplacian, and then finds discrete clusters usingk-means (Shi and Malik, Secondly the method of Newman whichuses eigenvectors of the modularity matrix. Based on these eigenvectors a few(two or three) clusters are found, which are then recursively subdivided un-til the optimal modularity is reached. The clusterings are further optimizedby a Kernighan-Lin style algorithm (Kernighan and Lin, which movessingle nodes around in a similar fashion to the first iteration of the Louvainalgorithm. These two methods represent the complete opposite approach toclustering. Whereas the Louvain method uses a greedy search to grow clus-ters from the bottom up, these spectral methods use a smooth approximation torepeatedly subdivide the graph.

Optimizing the w-log-v and infomap quality functions always leads to a perfect recovery of the clustering up to µ = 0.65 on the small dataset and µ = 0.7on the big dataset. For higher µ, infomap suddenly gives a clustering withnormalized mutual information 0. This is the clustering where all nodes are putinto a single cluster. Notice the large standard deviation on the Big 5000 dataset.

In some cases the optimizer finds the trivial clustering, and in other cases it findsa clustering comparable to that found with the w-log-v quality function.

The w-log-v quality function does not have the instability of infomap, and the performance normalized mutual information decreases more gradually. Theother two quality functions, modularity and parabola, perform worse. As weshow next, this mainly due to the failure to recover the right number of clusters.

When the true number of clusters is known, we can adjust the quality func- tion to get the desired number of clusters, as described in Section In thiscase it is also possible to use spectral clustering to optimize the normalized cutquality function, which requires the number of clusters as an input parameter.

Chapter 4. Comparing quality functions for network clustering ncut (spectral)modularity (spectral) Mixing parameter µ ncut (spectral)modularity (spectral) Mixing parameter µ Normalized mutual information as a function of the mixing pa- rameter, when the number of clusters has been fixed to the actual number ofclusters. We show results for the Small 1000 and Big 5000 datasets. The errorbars indicate standard deviation.

Figure shows the results on the same benchmark graphs when forcing thenumber of clusters to be equal to the number of clusters in the ground truth.

The behavior of the different quality functions is now very similar. However, with the normalized cut quality function we are still unable to find the rightclustering. This is due only to the optimizer, because when normalized cut is optimized with spectral clustering, the correct clustering is found.

**Quality functions versus optimization**

One might wonder in how far the results of these experiments depend on thequality function, and how much they depend on how that function is being opti-mized. To distinguish between the two, we compare the quality of the clusteringfound by the Louvain algorithm to the quality of the ground truth clustering.

If the quality is higher at the ground truth clustering, then this indicates thatoptimizer has failed to find a good enough clustering. On the other hand, if thequality of the ground truth clustering is lower, then the optimizer has found aclustering that is ‘better than the ground truth' according to the quality function.

That means that this quality function is unsuitable in this situation. As a base-line, we also compare with a clustering obtained on a randomly rewired graphwith the same degree distribution.

Figure shows the value of the quality functions for different mixing parameters. We can see that the quality of the ground truth crosses that ofthe randomly rewired graph at different points for different quality functions.

Beyond this point, there is little hope of recovering the true clustering, sincethe graph has then no more cluster structure than a random one. We can alsosee cases where the optimizer fails, such as with the w-log-v quality function atµ = 0.75. Here the ground truth has a higher quality than the clustering foundby the optimizer. Repeating the optimization 20 times leads to a slightly betteroptimum, but not yet to the ground truth. Even more repetitions can furtherimprove performance, but only slightly.

The parabola quality function shows a different picture. The clustering found by the optimizer has a lower quality than the ground truth in many cases. Thismeans that the Louvain method often fails to find the optimal clustering orone close to it. The clustering that is found instead has too few clusters. InSection we showed that the parabola function has the same resolution biasas modularity, while optimizing modularity with the Louvain method does notgive a clustering with a lower quality than the ground truth. This means that theresolution bias does not tell the whole story. Another important aspect of theresults seem to be how easy the quality function is to optimize with the Louvainmethod.

With the infomap quality function, the clustering found for the randomly rewired graph always has quality 0. This corresponds to a clustering where allnodes belong to the same cluster. This clustering is always a possible one, butit is not always found by the optimizer. For example, at µ = 0.7, the optimizersometimes finds an infomap quality that is greater than 0. In these cases the Chapter 4. Comparing quality functions for network clustering Mixing parameter µ Mixing parameter µ Mixing parameter µ Mixing parameter µ Figure 4.5: The quality of clustering, as a function of the mixing parameter onthe Small 1000 dataset. ‘random' is the quality reached by the optimizer on arandomly rewired graph, ‘truth' is the quality of the ground truth clustering.

Louvain is the value reached by the optimizer, and Louvain×20 is the best qual-ity out of 20 restarts. The quality functions shown are: modularity (top left),parabola (top right), w-log-v (bottom left), and infomap (bottom-right). Notethat the quality values are on an essentially arbitrary scale, and it makes nosense to compare values for different quality functions.

Table 4.2: The normalized mutual information for optimizing the various qual-ity functions on real world networks. The best results are indicated in boldface.

The Louvain method is used for optimization except for the results marked with(sp.), where spectral clustering is used. In the first part of the table the numberof clusters is forced equal to the true number, which is shown in the ‘clusters'column. In the second part the number of clusters is left free, in parenthesis isthe number of clusters that are found for each method.

modularity (spectral)

**58.8%**(4)

**56.0%**(5)

**58.8%**(4)

**92.4%**(12)

**92.4%**(12)

modularity (spectral)

**58.8%**(4)

**52.2%**(2)

optimizer is stuck in a local maximum that is not globally optimal.

**Real world community graphs**

We next applied the optimizer to several small real-world networks. We onlylooked at networks for which some kind of ground truth clustering is know.

For other networks, often only a modularity score is reported in the literature.

But since we use several different quality functions, this makes no sense in ourcontext. The networks we considered are: • Zachary's karate club (Zachary, • Football: A network of American college football games (Girvan and New- • Political books: A network of books about US politics (Krebs, The clusters are left-wing, right-wing and neutral books.

Chapter 4. Comparing quality functions for network clustering infomapncut (spectral)mod. (spectral) Noise variance σ2 Figure 4.6: Accuracy of the clustering methods on the two moons dataset, as afunction of the variance of the noise.

• Political blogs: Hyperlinks between weblogs on US politics (Adamic and In each case, we force the number of clusters found by the methods to be the same as the number of clusters in the dataset. The first part of Table givesthe results of these experiments. In most experiments the spectral methods givethe best results. We believe that this is due to the small number of clusters.

The difference is especially large for the Political Blogs dataset, which is thelargest dataset in this experiment. Since the spectral methods start with a singlelarge cluster, the final solution with just two or three clusters is a relatively closeto this starting point. In contrast, the Louvain method starts from singletonclusters, which are gradually merged.

The second part of the table gives results when the number of clusters is not fixed. In these experiments the results are more varied. Observe that for thefootball dataset the modularity and parabola quality functions no longer findthe same clustering as the other quality functions, instead giving fewer clusters.

This is due to the biases of these quality functions.

**Artificial nearest neighbor data**

We now consider the applicability of the Louvain method to clustering nearestneighbor graphs. We follow the setup from Bühler and Hein First we ran experiments on the two moons dataset. This dataset consists of points on two half-circles, that are offset from each other, embedded in a d dimensional space, and have added Gaussian noise.

For each point xi in the dataset we add edges to its k-nearest neighbors with exp(−4kxi − xjk2/kxi − xki where xk is the k-nearest neighbor of x i. To make the graph symmetric, we take the maximum weight over the two edge directions, aij = max(a0 , a0 ).

In our experiments we used n = 2000 points of dimension d = 100, and k = 10 neighbors of each point. The optimizer is restricted to finding 2 clusters.

We evaluate the performance by looking at the leave-one-out accuracy. That is,the fraction of points that have the same label as the majority of the other nodesin the same cluster. Figure shows the accuracy as a function of the varianceof the noise.

The results are in some ways opposite to those on the LFR benchmark. For these K nearest neighbor graphs, the modularity and parabola quality functionsoutperform w-log-v and infomap. We conjecture that this has to do with the res-olution bias of the methods. In the LFR benchmarks we search for more clusters,around 40, compared to only 2 in this experiment. Thus, the quality functionsthat are biased towards larger clusters will perform better here. However, at themoment we have no proof or additional evidence to support this conjecture.

**Real world nearest neighbor datasets**

We used the same construction of a nearest neighbor graph outlined in the pre-vious paragraph also on real-world and UCI datasets. In each case, we forcethe number of clusters to be the same as the number of classes in the dataset.

Table contains the results of these experiments. Since this is a classificationtask, we have also measured the performance with leave one out accuracy in-stead of normalized mutual information. The LOO accuracy is the fraction ofpoints that would be correctly classified if the most common label among allother points in the same cluster is used as that cluster's label. Table containsthe LOO accuracy results.

On the iris dataset all methods except spectral modularity optimization achieve the same high accuracy. This can be explained by the small size of the datasetand the relatively easy classification task. The iris dataset was previously usedin a comparison of different multi-resolution methods (Granell, Gómez, andArenas, the accuracy reported in that paper is the same 96% that wefound. On the MNIST and USPS datasets, the Louvain method significantlyoutperforms spectral clustering. These datasets have many classes, many fea-tures and are not completely balanced. On the other hand, on the ringnormdataset spectral normalized cut optimization perform much better than other Chapter 4. Comparing quality functions for network clustering Table 4.3: The normalized mutual information of various methods on real worlddatasets. The best results are indicated in boldface. In the first part of the tablethe number of clusters is forced equal to the true number, which is shown inthe ‘clusters' column. In the second part the number of clusters is left free, inparenthesis is the number of clusters that are found for each method.

modularity (sp.) 22.7% modularity (sp.) (382) methods.Overall, as on the two-moons dataset, the parabola quality functiongives the best results.

The second part of the Table shows that, when the number of clusters is not fixed, the w-log-v quality function has the highest or close to the highest ac-curacy in all cases. But this is merely because the w-log-v quality function has anoptimum with the most clusters, and the accuracy is nearly always higher withsuch a more fine grained clustering. On the other hand, the normalized mutualinformation is higher when the number of clusters is closer to the true numberof clusters. In this regard, the modularity and parabola quality functions givethe best results.

The results of our investigation show that the choice of quality function mattersfor graph clustering with the Louvain method. The quality function has two Table 4.4: The classification accuracy of various methods on real world datasets.

The best results are indicated in boldface. In the first part of the table the numberof clusters is forced equal to the true number, which is shown in the ‘clusters'column. In the second part the number of clusters is left free, in parenthesis isthe number of clusters that are found for each method.

modularity (sp.) 31.0% modularity (sp.) 79.1% important main effects.

First of all we have shown that different graph clustering quality functions have different resolution biases. These form the largest difference between thedifferent quality functions. Our experiments show that on benchmark networkgraphs with built-in community structure, when controlling the number of clus-ters, the clusterings found with different quality functions are very similar.

However, when the number of clusters is not fixed, the resolution bias has alarge influence on the performance of the method.

Secondly, some quality functions are easier to optimize with the Louvain method than others. For example, in the experiments on the LFR benchmarks,the clustering found with the parabola quality function is often not optimal forthat quality function. In addition, optimizing other quality functions such asnormalized cut turns out to be very hard. For that quality function spectralmethods are more suitable.

For nearest neighbor graphs, the Louvain method often significantly outper- forms spectral clustering, while never performing significantly worse. Whenthe number of clusters is fixed to a small number, the modularity and parabolaquality functions give the best results. When the number of clusters is not fixed,the w-log-v quality function finds the most clusters, and has the best accuracy.

Chapter 4. Comparing quality functions for network clustering But the modularity and parabola quality functions stay close to the true numberof clusters, and they give the best NMI.

The way we adjust the number of clusters, by embedding the graph in a larger one, works best when we want to decrease the number of clusters. Forsome quality functions, in particular normalized cut, we instead wish to increasethe number of clusters. Other adjustments to the quality function might workbetter in that case, for example adding a term to directly reward the number ofclusters. Such an adjustment will lead to another family of quality functions,perhaps with different resolution bias characteristics.

The parabola and modularity quality functions have the same resolution bias, pnr/q. However, the behavior of the two quality functions on the LFR bench-marks differ significantly. This is in part due to the inability of the Louvainmethod to find the optimal clustering for the parabola quality function, but itseems that the quality functions also differ in other ways. An avenue for fu-ture work is to improve the resolution bias model to show how these qualityfunctions differ.

In this chapter we have only considered undirected graphs. Each of the quality functions can be adapted to directed graphs by using a variation of thevolume that is based on the indegree or outdegree of nodes, or on a combinationof the two. In (Rosvall and Bergstrom, the infomap quality function isdefined based on the outdegree and on edges leaving a cluster. It is not clearwhat the advantages are of directly using undirected graphs for clustering, orhow the clustering of a graph should differ from the clustering of its transpose.

**Finding Protein Complexes, an**

**application of network clustering**

Unraveling the community structure of real-world networks is an important and chal-lenging problem. Recently, it has been shown that methods based on optimizing a clus-tering measure, in particular modularity, have a resolution bias, e.g. communities withsizes below some threshold remain unresolved. This problem has been tackled by incorpo-rating a parameter in the method which influences the size of the communities. Methodsincorporating this type of parameter are also called multi-resolution methods. In thischapter we consider fast greedy local search optimization of a clustering quality functionwith two different quality functions incorporating a resolution parameter: modularityand a function we introduced in a recent work, called w-log-v. We analyze experimen-tally the performance of the resulting algorithms when applied to protein-protein inter-action (PPI) networks. Specifically, publicly available yeast protein networks from paststudies, as well as the present BioGRID database, are considered. Furthermore, to testrobustness of the methods, various types of randomly perturbed networks obtained fromthe BioGRID data are also considered. Results of extensive experiments show improvedor competitive performance over MCL, a state-of-the-art algorithm for complex detec-tion in PPI networks, in particular on BioGRID data, where w-log-v obtains excellentaccuracy and robustness performance.

This chapter is based on van Laarhoven and Marchiori "Robust community detection methods with resolution parameter for complex detection in protein protein interaction networks",which received the best paper award at the 7th IAPR international conference on Pattern Recognitionin Bioinformatics.

Chapter 5. Finding Protein Complexes, an application of network clustering The development of advanced high-throughput technologies and mass spec-trometry has boosted the generation of experimental data on protein-protein in-teraction and shifted the study of protein interaction to a global, network level.

In particular, it has been shown that groups of proteins interacting more witheach other than with other proteins, often participate in similar biological pro-cesses and often form protein complexes performing specific tasks in the cell.

Detecting protein complexes, consisting of proteins sharing a common function,is important, for instance for predicting a biological function of uncharacter-ized proteins. To this aim protein-protein interaction (PPI) networks have beenused as a convenient graph-based representation for the comparative analysisand detection of (putative) protein complexes (X. Li et al., A PPI networkis a graph where nodes are proteins and edges represent interactions betweenproteins.

Detecting protein complexes in a PPI network can be formalized as a graph- clustering problem. Clustering amounts to divide data objects into groups (clus-ters) in such a way that objects in the same cluster are more similar to eachother than to objects in the other clusters. Since clustering is an ill-posed andcomputationally intractable problem, many methods have been introduced, inparticular for graph-clustering (see e.g. the recent review by Fortunato, Effective methods for graph-clustering contain a parameter whose tuning affectsthe community structure at multiple resolution scales. These methods are alsocalled multi-resolution methods (see e.g. Lambiotte, The resolution pa-rameter(s) can be used in two main ways: as a parameter to be tuned; or asa way to generate clusterings at multiple resolution scales, which can then beused to analyze the clustering behavior of objects across multiple resolutions(Lewis et al., or to ensemble the results to produce a consensus clustering(Ronhovde and Nussinov, In Chapter the resolution bias of state-of-the-art community detection methods has been analyzed, and a simple yet effective quality function wasintroduced. Results indicated that the Louvain method, based on greedy localsearch optimization, is robust to the choice of the clustering quality function,when a multi-resolution parameter is added to the quality function.

The goal of this chapter is to investigate experimentally the performance of such multi-resolution methods when applied to PPI networks, with respect todata generated from different laboratory technologies as well as with respect torandom removal or shuffling of edges in the network. This latter investigationis motivated by the fact that PPI data are still not fully reliable, with the poten-tial inclusion of both false positive and false negative interactions (see e.g. the 5.1. Introduction discussion in X. Li et al., Specifically, we consider the Louvain methodfor optimizing two different quality functions incorporating a resolution param-eter: modularity (Girvan and Newman, and a the w-log-v introduced inChapter To analyze their performance we consider the yeast Saccharomyces cere- visiae, which is a well studied model organism for higher eukaryotes withseveral protein interaction data generated from diverse laboratory technologies.

Specifically, we consider six PPI networks from past studies and the present Bio-GRID curated database of protein interactions (Stark et al., In order toassess robustness with respect to random perturbations of the graph, we gener-ate a large collection of networks using the BioGRID data, by either removing orby adding a percentage of randomly selected edges, or by randomly shufflingedges while keeping the original degree of each node.

Results of the experiments indicate improved performance of modularity and w-log-v over MCL (the Markov Cluster Algorithm) (Van Dongen, a state-of-the art method for community detection in PPI networks based on stochasticflow in graphs. MCL was found to achieve best overall performance in yeastPPI networks (Brohee and van Helden, and competitive performance withmethods for overlapping community detection in PPI networks (Nepusz, Yu,and Paccanaro, In particular best performance is achieved by w-log-v on the BioGRID data, and excellent robustness on randomly perturbed versions of this network. SincePPI networks are known to be noisy with respect to the presence of both falsepositive and false negative interactions, the high robustness shown by the pro-posed algorithm substantiates its effectiveness on this type of data.

**Related work**

A vast literature on protein complex detection with PPI networks exists (seee.g. the review X. Li et al., Previous related works on multi resolutionalgorithms for clustering PPI networks either apply an algorithm multiple timeswith different values of the resolution parameter in order to investigate howproteins cluster at different resolution scales, e.g. Lewis et al. or tune theresolution parameter in order to choose a best setting for the considered type ofnetworks, e.g. Brohee and van Helden and Nepusz, Yu, and PaccanaroHere we aim at investigating thoroughly effectiveness and robustness oftwo such algorithms by means of an extensive experimental analysis.

In Brohee and van Helden a comparative assessment of clustering algorithms for PPI networks was conducted. In particular, robustness was ana-lyzed, with respect to alterations (addition and/or removal of randomly selected Chapter 5. Finding Protein Complexes, an application of network clustering edges) of a test graph which was constructed using a number of yeast complexesannotated in the MIPS database, by linking each pair of proteins belonging tothe same complex. The considered algorithms with parameters tuned on thetest graph, were then applied to various yeast datasets. Results showed thatMCL with inflation (resolution) parameter value equal to 1.8 was performingbest on the considered datasets. Robustness of MCL when applied to yeast PPInetworks has previously also been analyzed in Pu et al. According totheir results, MCL is rather robust across different networks and with respectto missing or noisy information on protein-protein associations. Here we showthat greedy local search optimization of a clustering quality function (e.g. w-log-v) incorporating a resolution parameter achieves improved robustness (andaccuracy) on the BioGRID data.

In Chapter we showed that the Louvain method works well for finding clustersin networks. This optimization method is independent of the quality functionthat is optimized. Hence we are essentially free to choose the quality functionto best fit the application. In this chapter we will limit ourselves to two of thequality functions that were discussed in the previous chapter. The first is thepopular modularity (Girvan and Newman, Here C denotes a clustering, i.e. a set of clusters. For a particular cluster c ∈ C, its volume is vc = ij, i.e. the sum of the weight of edges incident to nodes in c, which is equivalent to the sum of degrees. Based on the volume wedefine the normalized volume as v = vc/vV , where V is the set of all nodes.

ij is the within cluster volume, and b c = wc/vV is its normalized variant.

The second quality function we consider is w-log-v, which was introduced in Chapter An advantage of this quality function over modularity is that itallows more diverse cluster sizes. Because the sizes of protein complexes candiffer widely, we believe that this is a useful property. The w-log-v qualityfunction is defined as b c log(bc ).

Using either of the above quality functions directly for the task of clustering a PPI network is not advisable. Both quality functions were designed for commu-nity detection; and communities are usually relatively large, much larger than protein complexes. Therefore, optimizing these quality functions will lead toa clustering with a small number of large clusters. This inability to find smallclusters is termed the resolution limit of clustering (Fortunato and Barthélemy, To overcome the resolution limit, we add a parameter to the quality functions modularity (C) = modularity(C) − α w-log-v (C) = w-log-v(C) − α By increasing the parameter α, the clustering is punished for within clusteredges, and hence the optimal clustering will have smaller clusters. Alternatively,by decreasing the parameter α, the clustering is rewarded for within clusteredges, so the optimal clustering will then have larger clusters.

In Section we showed that, because the overall scale of the quality func- tion is irrelevant for optimization, the modification is equivalent to assumingthat the overall volume of the graph is different. For modularity, the adjust-ment corresponds to assuming that the graph has volume (1 − α)v(V), while forw-log-v it corresponds to assuming the volume is e−αv(V). This equivalent in-terpretation provides some intuition for the resolution parameter: when we tuneα to find smaller clusters, the quality function is equivalent to that for findingthe clusters in a smaller graph.

**PPI networks**

We downloaded a set of protein interactions from version of the BioGRIDdatabase (Stark et al., This database contains a collection of protein inter-actions from different sources, and discovered with different methods. In thiswork we only consider interactions found by physical experiments, not thosebased on genetics.

The BioGRID also contains in full several datasets from high throughput experimental studies, including Uetz et al. Ho et al. Gavin, Bosche,et al. Gavin, Aloy, et al. Krogan et al. and Collins et al.

We consider these subnetworks as separate datasets in our experiments.

These datasets are generated with different experimental techniques: the Collins 1This version was released on April 25th, 2012 Chapter 5. Finding Protein Complexes, an application of network clustering Table 5.1: Sizes of the different datasets. The last column lists the number ofMIPS complexes that are (partially) contained in each dataset.

Gavin, Bosche, et al. Gavin, Aloy, et al. BioGRID, all physical (Collins et al., Krogan (Krogan et al., and Gavin (Gavin, Aloy, et al.,datasets include the results of TAP tagging experiments only, while theBioGRID dataset contains a mixture of TAP tagging, Y2H and low-throughputexperimental results (Nepusz, Yu, and Paccanaro, Table lists the sizesof these datasets.

For validation, we compare clusters with the complexes from the MIPS database(Mewes et al., which we take as the gold standard. MIPS specified a hier-archy of complexes and subcomplexes. Since we deal only with non-overlappingclustering, we only include (sub)complexes at the bottom of this hierarchy. Andto avoid degenerate cases, we include only complexes with at least 3 proteins.

We also exclude the complexes in category 550, since these are unconfirmed.

They were found with computational clustering methods, using as input thesame high throughput datasets that we consider.

In addition to the complexes from MIPS, we also use a set of complexes derived from the Gene Ontology annotations of the Saccharomyces GenomeDatabase (Cherry et al., This dataset was created and also used by (Ne-pusz, Yu, and Paccanaro, To compare clusters found by a method to either of these sets of gold stan- dard complexes, we use the overlap score (Brohee and van Helden, ω(A, B) = A B .

We consider a cluster to match a complex if their overlap score is at least 0.25.

This threshold is also used in other works, e.g. (Nepusz, Yu, and Paccanaro, 2We used the latest version at the time of writing, which was released on May 18th, 2006 Chua et al., When the cluster and complex have the same size, amatch then corresponds to the intersection containing at least half of the nodesin the complex and cluster.

Based on this matching we define precision as the fraction of clusters that are matched to any complex. Conversely, we define recall as the fraction ofcomplexes that are matched to any cluster. Note that we use the terminologyfrom other works such as Chua et al. These notions differ from the morestandard definitions of precision and recall, because a cluster can match morethan one complex and vice versa.

It is clearly possible to achieve a high precision or a high recall with a de- generate clustering. For example, by returning just a single easy to find clusterthat matches a complex, the precision will be 1 at the cost of a low recall. Andby returning all possible (overlapping) clusters, the recall will be 1 at the cost ofa low precision. We therefore use the F1 score, which is the harmonic mean ofprecision and recall, as a trade-off between the two scores.

For each of the methods, we include only clusters that contain at least 3 proteins. As a result, not all proteins will be in a cluster. We call the fraction ofproteins that are in a cluster the coverage of a clustering.

The precision and recall as defined above depend heavily on the chosen threshold; and when few complexes are matched, the scores are very sensi-tive to noise. Therefore, we also look at the positive predictive value (PPV) andcluster-wise sensitivity scores (Brohee and van Helden, which are baseddirectly on the size of the intersection between complexes and clusters, maxB∈C∗ A ∩ B B∈C∗ maxA∈C A ∩ B where C is the set of predicted clusters and C∗ is the set of gold standard com-plexes. Note that the asymmetry between the denominators is to account for thecase of overlapping clusters.

**Precision vs. recall**

We took the BioGRID all physical dataset, and computed the precision and recallfor a wide range of settings of the resolution control parameter α. These resultsare shown in Figure (left). For comparison we also include results with theMCL algorithm for different settings of the inflation parameter. For readabilitywe have applied smoothing in the form of merging points that are very closetogether.

The first thing that we observe is that despite smoothing, the figure is very noisy in some places. This is not very surprising considering how precision and Chapter 5. Finding Protein Complexes, an application of network clustering (a) Precision vs. Recall (b) Sensitivity vs. PPV Precision vs. Recall (left) and Sensitivity vs. PPV (right) on the BioGRID dataset.

recall are calculated. Consider a small change in the clustering, such as removinga protein from a cluster. This change might cause the cluster to no longer matcha particular complex. If there are no other clusters that matched that complex,then the recall goes down, otherwise it stays the same. Similarly, if this areno other complexes matching the cluster, then the precision goes down. Whileobviously the change in the two scores is related, the relation is not monotonic,one can change while the other does not.

As the resolution control parameter α goes up, the methods find more clus- ters; and as a result the recall goes up while the precision goes down. However,after a certain point many of the clusters will become too small, and they will beremoved before matching. This decreased coverage causes the recall to go downagain.

To get a less noisy picture, we have also plotted the PPV and sensitivity scores, in Figure (right). The overall trend in this plot is the same as for theprecision and recall: the w-log-v method slightly dominates modularity opti-mization, which in turn has significantly better results than MCL.

The best parameter settings according to the F1 score are α = 2.8 for w-log-v, α = 0.97 for modularity, and inflation 2.7 for MCL. We will use these settingsfor the remainder of the experiments. As discussed in Section the parameterα corresponds to assuming a different volume of the graph. The optimal set-ting for w-log-v corresponds to considering a graph with 16 times fewer edges,while the optimal setting for modularity corresponds to 33 times fewer edges.

The difference between the two quality functions comes from their inherent res- olution bias, by default w-log-v has a bias towards smaller clusters compared tomodularity, and therefore the quality function needs less adjustment.

**Networks from Single Studies**

We next compare the scores on the subnetworks from single studies. The resultsof this experiment are shown in Table The "MIPS method" is based on thegold standard complexes, but including only proteins that occur in the datasetunder investigation. It represents the best possible scores.

On most datasets w-log-v has the best recall and F1 score, except on the datasets from Gavin et al. Gavin, Bosche, et al. and Gavin, Aloy, et al.

where MCL performs significantly better. The precision of modularityoptimization is often slightly better than that for w-log-v optimization. This isdue to the fact that with the settings chosen in the previous paragraph, we findmore clusters with w-log-v optimization. Hence, in general recall will be higherat the cost of lower precision.

**Randomly perturbed graphs**

To further test the robustness of the methods, we applied them to randomlyperturbed networks. We performed three different experiments, all starting fromthe BioGRID network.

1. Removing a randomly chosen subset of the interactions.

2. Randomly adding new spurious interactions between pairs of proteins.

3. Randomly rewire a subset of the edges, while maintaining the degree of each node. Note that such a move both removes an observed interactionand adds a new spurious one.

We varied the amount of edges affected by each type of perturbation. Each experiment was repeated 10 times with different seeds for the random num-ber generator, we calculated the mean and standard deviation of the F1 scoreacross these repetitions. The results are shown in Figure When edges areremoved, the performance of all methods degrades similarly. On the other hand,the Louvain method results are much more robust to the addition of extra edgesthan MCL. Also note that the standard deviation is much larger with the MCLmethod. That means that for some rewired graphs the method gives reasonablygood results, while for others the result is very bad. Unsurprisingly, the experi-ment with rewired edges sits somewhere in between the two other experiments.

Chapter 5. Finding Protein Complexes, an application of network clustering Table 5.2: Results of applying the different methods to subnetworks for singlestudies. The best result for each dataset is highlighted in bold.

Uetz et al. (2000) Gavin et al. (2002) Gavin et al. (2006) Krogan et al. (2006) Collins et al. (2007) BioGRID, all physical Removed edges (%) Removed edges (%) Rewired edges (%) Rewired edges (%) F1 score when a fraction of the edges is added (top), removed (middle) or rewired (bottom) at random in the BioGRID dataset. Error barsindicate standard deviation, measured over 10 runs. The left plots use MIPScomplexes as the gold standard, the right plots use SGD complexes.

Chapter 5. Finding Protein Complexes, an application of network clustering Because of the incompleteness of both the PPI data and of knowledge on truecomplexes, care must be taken in the interpretation of the results. The "MIPSmethod", that is, the best possible method based on the MIPS complexes, coversonly a small part of the proteins present in each of the datasets. Conversely, notall MIPS complexes are covered by the datasets, so the recall is always smallerthan the precision. In general, results show that for each dataset, the majority ofclusters induced by the intersection of that dataset with the complexes in MIPS,match a complex; with a percentage varying between 85% and 98%. Thesevalues provide upper bounds on the maximum precision and recall achievableon the considered dataset.

On all datasets the algorithms obtain precision smaller than recall: the differ- ence of these values provides information on the fraction of clusters matchingmore than one complex. For instance on the Uetz dataset, there is almost aone-to-one correspondence between clusters and matched complexes (e.g. 4.6%precision and 6.2% recall for w-log-v), while on the BioGRID this relation isclearly one-to-many (e.g. 5.9% precision and 22.6% recall for w-log-v).

There are complexes that are matched by only one method: specifically, 6 complexes are matched only by w-log-v, 5 only by modularity, and 4 only byMCL. Comparing w-log-v and MCL, there are 18 complexes found only by w-log-v and 4 found only by MCL. This is not too surprising, since MCL has arather low recall. An example of a complex detected by w-log-v and not byMCL is the Signal recognition particle (SRP) complex, consisting of six proteins,one of the complexes involved in Transcription and/or in the Nucleus.

The improved performance of w-log-v on the BioGRID data appears mainly due to its capability to generate a large number of clusters matching multiplecomplexes (high recall). Nevertheless, Figure shows that the precision vs.

recall curve of w-log-v dominates the curve of the other two methods.

Robustness of a community detection method is and important issue also in the context of PPI networks, since they are known to contain a high amountof false negative and false positive interactions. Indeed, limitations of experi-mental techniques as well as the dynamic nature of protein interaction are re-sponsible for the high rate of false-positives and false-negatives generated byhigh-throughput methods. For instance, Y2H screens have false negative ratesin the range from 43% to 71% and TAP has false negative rates of 15%-50%, andfalse positive rates for Y2H could be as high as 64% and for TAP experimentsthey could be as high as 77% (Edwards et al., Results show that w-log-vachieves best robustness the under random addition, removal and rewiring of apercentage of edges in the BioGRID network. Such high robustness substantiates the effectiveness of w-log-v on this type of data.

This chapter analyzed the performance of two fast algorithms on PPI networksthat optimize in a greedy way a clustering quality function with resolution pa-rameter. An extensive experimental analysis was conducted on PPI data fromprevious studies as well as on the present BioGRID database. Results indi-cated improved performance of the considered algorithms over a state-of-the-artmethod for complex detection in PPI networks, in particular with respect torobustness. These results indicate that the considered algorithms provide anefficient, robust and effective approach for protein complex discovery with PPInetworks. Interesting issues for future work include the assessment of the al-gorithms' robustness with respect to tailored models of false positive and falsenegative interactions which are present in data generated by specific technolo-gies, as well as the extension of the considered methods to detect overlappingclusters of high quality.

**Axioms for soft clustering and**

**Non-negative Matrix Factorization**

Many graph clustering quality functions suffer from a resolution limit, the inability tofind small clusters in large graphs. So called resolution-limit-free quality functions donot have this limit. This property was previously introduced for hard clustering, that is,graph partitioning.

We investigate the resolution-limit-free property in the context of Non-negative Ma- trix Factorization (NMF) for hard and soft graph clustering. To use NMF in the hardclustering setting, a common approach is to assign each node to its highest membershipcluster. We show that in this case symmetric NMF is not resolution-limit-free, but thatit becomes so when hardness constraints are used as part of the optimization. In softclustering, nodes can belong to more than one cluster, with varying degrees of member-ship. In this setting resolution-limit-free turns out to be too strong a property. Thereforewe introduce locality, which roughly states that changing one part of the graph doesnot affect the clustering of other parts of the graph. We argue that this is a desirableproperty, provide conditions under which NMF quality functions are local, and proposea novel class of local probabilistic NMF quality functions for soft graph clustering.

Graph clustering, also known as network community detection, is an importantproblem with real-life applications in diverse disciplines such as life and socialsciences (Schaeffer, Fortunato, Graph clustering is often performedby optimizing a quality function, which is a function that assigns a score to a This chapter is based on"Local quality functions for graph clustering with Non-negative Matrix Factorization", accepted for publication in Physical Review E.

Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization clustering. During the last decades, many such functions (and algorithms tooptimize them) have been proposed. However, relatively little effort has beendevoted to the theoretical foundation of graph clustering quality functions, e.g.

Ackerman and Ben-David, In this chapter we try to provide a contribu-tion in this direction by studying desirable locality properties of NMF qualityfunctions for hard and soft graph clustering.

We focus on the resolution-limit-free property, a property of hard graph clus- tering, recently introduced by Traag, Van Dooren, and Nesterov Infor-mally this property states that a subset of an optimal clustering in the originalgraph should also be an optimal clustering in the induced subgraph containingonly the nodes in the subset of clusters. As the name suggests, resolution-limit-free quality functions do not suffer from the so-called resolution limit, thatis, the inability to find small clusters in large graphs. In the seminal work byFortunato and Barthélemy it was shown that modularity (Newman andGirvan, a popular quality function used for network community detec-tion, has a resolution limit, in the sense that it may not detect clusters smallerthan a scale which depends on the total size of the network and on the degreeof interconnectedness of the clusters.

Our goal is to investigate resolution-limit-freeness and other locality proper- ties of Non-negative Matrix Factorization (NMF) graph clustering quality func-tions. NMF (Paatero and Tapper, Lee and Seung, is a popular ma-chine learning method initially used to learn the parts of objects, like humanfaces and text documents. It finds two non-negative matrices whose productprovides a good approximation to the input matrix. The non-negative con-straints lead to a parts-based representation because they allow only additive,not subtractive, combinations. Recently NMF formulations have been proposedas quality functions for graph clustering, see for instance the surveys Wang et al.

and T. Li and C. H. Q. Ding We consider symmetric and asymmetric NMF formulations based on Eu- clidean loss and a Bayesian NMF quality function recently proposed by Psorakiset al. which can automatically determine the number of clusters.

The resolution-limit-free property is stated in the setting of hard clustering, where a clustering is a partition of the nodes. In contrast, NMF produces a softclustering. Nodes have varying degrees of memberships of each clusters, andthe clusters can overlap. To use NMF in the hard clustering setting, a commonapproach is to assign each node to its highest membership cluster.

In Section we show that hard clustering based on NMF in this way is, in general, not resolution-limit-free. For symmetric NMF we show that resolution-limit-freeness can be obtained by using orthogonality constraints as part of theoptimization, and that the resulting function is strongly linked to the Constant 6.1. Introduction Potts Model (CPM). CPM was introduced by Traag, Van Dooren, and Nesterovas the simplest formulation of a (non-trivial) resolution-limit-free method. It isa variant of the Potts model by Reichardt and Bornholdt We argue in Section that in the soft clustering setting, resolution-limit- freeness is a too strong property and propose an alternative desirable localityproperty for soft graph clustering. We characterize an interesting class of localquality functions and show that symmetric and asymmetric NMF belong to thisclass. We show that Bayesian NMF is not local in general and that it suffers froma resolution limit. In Section we introduce a novel class of probabilistic NMFquality functions that are local, and hence do not suffer from a resolution limit.

**Related work**

The notion of resolution limit was introduced in Fortunato and Barthélemywhich detected a limitation of modularity, considered a state-of-the-artmethod for community detection. In Chapter we showed empirically thatthe resolution limit is the most important difference between quality functionsin graph clustering optimized using a fast local search algorithm, the Louvainmethod (Blondel et al., Traag, Van Dooren, and Nesterov intro-duced the notion of resolution-limit-free objective functions, which provides themotivation of this study.

Other local properties of quality functions for clustering have been consid- ered in theoretical studies but mainly in the hard setting, for distance basedclustering in Ackerman, Ben-David, and Loker and for graph cluster-ing in Chapter of this thesis. Locality as defined in Ackerman, Ben-David,and Loker is a property of clustering functions, therein defined as func-tions mapping a data set and a positive integer k to a partition of the data intok clusters. This notion of locality was used together with other properties tocharacterize linkage based clustering. The weak locality property consideredin Chapter is part of an axiomatic study of quality functions for hard graphclustering. It states that local changes to a graph should have only local conse-quences to a clustering. It is slightly weaker than the locality property consid-ered in this study, which corresponds more closely to the property there calledstrong locality.

**Definitions and Notation**

In this chapter we reuse the definition of graphs defined in Section But thenotion of clustering will be different.

Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization As before, a (weighted) graph is a pair (V, A) of a finite set V of nodes and a function A : V × V → R>0 of edge weights; and we use the abbreviation aij =A(i, j). Edges with larger weights represent stronger connections, so aij = 0means that there is no edge between nodes i and j.

all i, j ∈ V 0.

Different clustering methods use different notions of a ‘cluster' and of a ‘clus- tering'. For instance, in symmetric NMF a clustering is a matrix of membershipcoefficients; while in non-symmetric NMF there are two such matrices. Somemethods also have additional parameters for each cluster. In this chapter weallow different types of ‘cluster' for different methods, but we use a commondefinition of ‘clustering'.

Formally, each of these types of clusters can be specified by an injective function C from sets of nodes to sets of things which we call clusters. Fora set of nodes s, for every cluster c ∈ C(s) we call s the support of c, writ-ten as supp(c) = s. The set of all clusters with support on a subset of V isC∗(V) = S C(s). In this paper we consider four types of clusters, which will be introduced in the next section.

A clustering of V is a multiset of clusters with support on a subset of V.

Note that we use multisets instead of sets, to allow a clustering to contain twoidentical copies of the same cluster. For brevity, we also say that C is a clusteringof a graph G if C is a clustering of the nodes of G. If, in a slight abuse of notation,we define the support of a clustering as the union of the support of all clustersin that clustering, then the clusterings of V are those multisets of clusters forwhich the support is a subset of V.

Note that this general definition implies that for certain clusterings the clus- ters can overlap, and some nodes can be in no cluster at all. We believe thatthis is a reasonable definition, because if we allow nodes to be in more than onecluster, there is little reason to not also allow them to be in less than one cluster.

Additionally, if C and D are clusterings of G, then their multiset sum C ] D is also a clustering of as is any subclustering (submultiset) of C. And if G isa subgraph of G0, then C and D are also clusterings of G0.

The symmetric difference of two clusterings is denoted C 4 D, and is defined as the symmetric difference of multisets, that is C 4 D = (C D) ∪ (D C).

As in the rest of this thesis, graph clustering is cast as an optimization prob- lem. The objective that is being optimized is the clustering quality function, whichis a function from graphs G and clusterings of G to real numbers. In this thesiswe take the convention that the quality is maximized.

1C ] D denotes multiset sum, the multiplicity of c in C ] D is the sum of multiplicities of c in C and of c in D.

6.2. Non-negative Matrix Factorization Given a clustering quality function Q, and a clustering C of some graph G.

We say that C is Q-optimal if Q(G, C) > Q(G, C0) for all clusterings C0 of G.

**Non-negative Matrix Factorization**

At its core, Non-negative Matrix Factorization (Paatero and Tapper, Leeand Seung, decomposes a matrix A as a product A ≈ WHT , where allentries in W and H are non-negative. For graph clustering the matrix A is theadjacency matrix of a graph. For undirected graphs the adjacency matrix issymmetric, in which case it makes sense to decompose it as A ≈ HHT . Note thatsuch a symmetric factorization has to be enforced explicitly, since the optimalnon-symmetric factorization of a symmetric matrix does not necessarily haveW = H (Catral et al., The columns of W and H can be interpreted as clusters. To fit with the definitions from the previous paragraph we need to take a slightly differentview. In the case of symmetric NMF, a cluster with support s is a functionthat assigns a positive real number to each node in s, so C SymNMF(s) = R>0.

Equivalently, for a fixed set of nodes, we can represent a cluster as a vector ofnon-negative numbers with an entry for each node in V, such that the entriesfor the nodes not in s are zero, that is, C∗ R>0. For a cluster c we denote this vector as hc, and a multiset of such vectors can be seen as amatrix H. The support of c then coincides with the standard notion of supportof the vector hc, that is, the set s of nodes for which the entry is non-zero. Thisrepresentation of clusters in terms of a non-negative vector hc is more standardand more convenient than the one in terms of a function from s to positive realnumbers, and we use it in the rest of the paper.

For non-symmetric NMF, a cluster is a tuple c = (wc, hc) of two such vectors.

R>0, with supp((wc, hc)) = supp(wc) ∪ supp(hc).

For Bayesian NMF (Psorakis et al., each cluster also contains a βc parame-ter. That is, C∗ A common notion to all NMF methods is that they predict a value for each edge. For symmetric NMF with per cluster membership vector hc this predic- tion can be written as ˆ For asymmetric NMF with cluster memberships wc and hc we can write ˆ The optimization problem then tries to ensure that ˆ aij ≈ aij. Different meth- ods can have different interpretations of the ‘≈' symbol, and they impose dif-ferent regularizations and possibly additional constraints. Perhaps the simplestNMF quality function for undirected graphs uses Euclidean distance and no Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization QSymNMF(G, C) = −

**Resolution-limit-free functions for hard clustering**

Before we investigate the resolution limits of NMF, we will first look at tradi-tional ‘hard' clustering, where each node belongs to exactly one cluster. In thissetting a cluster is simply a subset of the nodes, and its support is the clusteritself. That is, Chard(s) = s. There is the additional non-overlapping or orthogo-nality constraint on clusters: In a valid hard clustering C of V, each node i ∈ Vis in exactly one cluster ci ∈ C. For symmetric NMF we may formulate theseconstraints as for all c, d ∈ C, c 6= d, and for all i ∈ V.

Traag, Van Dooren, and Nesterov introduced a locality property of clustering quality functions, and called the functions that satisfy this propertyresolution-limit-free. Their definition is as follows.

**Definition 6.1**(Resolution-limit-free)

**.**Let C be a Q-optimal clustering of a graph

G1. Then the quality function Q is called resolution-limit-free if for each subgraph G2

induced by D ⊂ C, the partition D is a Q-optimal clustering of G2.

Thus in the setting of hard clustering, a quality function is resolution-limit- free if any subset of clusters from an optimal clustering is also an optimal clus-tering on the graph that contains only the nodes and edges in those clusters.

NMF has been extended with a post-processing step to yield a hard cluster- ing. This is done by assigning each node to the cluster with the largest member-ship coefficient.

We can now ask if NMF with this post-processing is resolution-limit-free. In Figure we give a counterexample that answers this question negatively forthe NMF based methods of Psorakis et al. and C. Ding, He, and Simon This counterexample consists of two cliques and one almost-clique. Addi- tionally there is a node with unclear membership. When the entire graph isconsidered, its membership of one cluster is slightly higher, when one cliqueand its incident edges are removed, its membership of another cluster is slightly 6.3. Resolution-limit-free functions for hard clustering Figure 6.1: A counterexample that shows that NMF quality functions are notresolution limit free. When considering the entire graph, the first (solid blue)clustering is optimal. When considering only the gray nodes, the second (dashedred) clustering is optimal. The membership of the middle node is very unclear,it belongs to two clusters to almost the same degree. When another part of acluster changes this can tip the balance one way or the other.

higher. This difference is very small. For example, with C. Ding, He, and Si-mon's method in the optimal clustering of the large graph, the disputed nodebelongs to the second and third cluster with membership coefficients 0.2306 and0.2311 respectively; while in the smaller subgraph the membership coefficientsare 0.2284 and 0.2607.

Traag, Van Dooren, and Nesterov showed that the Constant Potts model (CPM) is the simplest formulation of any (non-trivial) resolution-limit-free method. The CPM quality function Qcpm(G, C) can be formulated as (aij − γ)1[ci = cj], where 1[ci = cj] is 1 if nodes i and j belong to the same cluster, and 0 otherwise.

Symmetric NMF and CPM are closely related. This can be shown with a tech- nique similar to that used by C. Ding, He, and Simon to link symmetricNMF and spectral clustering.

**Theorem 6.2.**Symmetric NMF is an instance of CPM with γ = 1/2 and orthogonality

constraints relaxed.

The proof of this theorem can be found in Appendix The CPM is resolution-limit-free. Therefore in order to perform hard clus- tering using symmetric NMF it is preferable to act on the quality function, forinstance by enforcing orthogonality as done in (C. Ding, He, and Simon, C. Ding, T. Li, et al., instead of assigning each node to the cluster with thehighest membership coefficient.

Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization Three cliques sharing two nodes each. The obvious clustering consists of three overlapping clusters, with the three central nodes in two clusterseach. The white nodes are not in the support of the solid blue clusters.

**Resolution-limit-free functions for soft clustering**

We could still try to directly adapt Definition to the soft clustering setting,by defining what a graph induced by a subclustering is. The obvious idea is toinclude all nodes in the support of the subclustering. So for a clustering C of G,the graph G0 induced by D ⊆ C would contain only the nodes which are in atleast one cluster in D, that is, V 0 = supp(D), and all edges between these nodesfrom the original graph.

However, in contrast to the hard clustering case, an optimal soft clustering might have clusters in C D that overlap with clusters in D. This makes thenotion of resolution-limit-free too restrictive, since it effectively disallows anyinteresting uses of overlapping clusters.

Consider the graph with three overlapping 5-cliques shown in Figure In an NMF style method such as (C. Ding, He, and Simon, the optimalclustering of this graph will have three overlapping clusters, corresponding tothe three cliques. The subgraph introduced by the support of the solid blueclusters includes just the dark nodes, but neither cluster covers both nodes inci-dent to the dashed edge. Therefore, with these two clusters the prediction ˆ this edge will be 0. But the optimal clustering of this subgraph would have anon-zero prediction for this edge. In other words, the optimal clustering for theinduced subgraph is not the same as the solid blue clustering, even the supportof the clusters is not the same. Hence no NMF method is resolution-limit-free inthis sense.

An alternative approach is to only consider subclusterings with disjoint sup- port in the definition of resolution-limit-free. That is, with supp(D) ∩ supp(C D) = ∅. Unfortunately this variant has the opposite problem: the condition 6.4. Resolution-limit-free functions for soft clustering almost never holds. So, many quality functions would trivially satisfy this vari-ant of resolution-limit-freeness. For example, the optimal clusterings in NMFmethods based on a Poisson likelihood will always have overlapping clusterscovering every edge, so the disjointness condition only holds when the graphhas multiple connected components.

Clearly we need a compromise.

The resolution-limit-free property looks at the behavior of a clustering qualityfunction on graphs of different sizes. Intuitively a quality function suffers froma resolution limit if optimal clusterings at a small scale depend on the size of theentire graph.

As shown in the previous paragraph we can not just zoom in to the scale of any subclustering D by discarding the rest of the graph.

But if we let go of only considering the optimal clustering, it does become possible to zoom in only partially, leaving the part of the graph covered byclusters that overlap clusters in D intact. If D is an optimal clustering of theoriginal graph, then it should be a ‘locally optimal' clustering of the smallergraph in some sense.

We take this to mean that if a clustering D is better than some other clustering D0 on the original graph, then the same holds on the smaller graph, as long asD and D0 induce the same zoomed in graph.

It then makes sense to not only consider zooming in by discarding the rest of the graph, but also consider arbitrary changes to the rest of the graph, as wellas arbitrary changes to clusters not overlapping with D or D0.

More precisely, if one subclustering D is better than another subclustering D0 on a subgraph GS of some graph G1, and one changes the graph to G2 in sucha way that the changes to the graph and to the clustering are disjoint from thissubgraph GS, then D will stay a better clustering than D0. This idea is illustratedin Figure To formalize this idea we introduce the notion of agreement. We say that two clusterings C1 of G1 and C2 of G2 agree on a common subgraph GS = (VS, AS) ofG1 and G2 if supp(C1 4 C2) ∩ VS = ∅. Note that this subgraph can be the small-est subgraph containing supp(D) and supp(D0). This leads to the followingdefinition.

**Definition 6.3**(Locality)

**.**A clustering quality function Q is local if for all graphs

G1, G2, and common subgraphs GS of G1 and G2, for all clusterings C1 of G1 and C2 of

G2 that agree on GS, and clusterings D, D0 of GS, it is the case that Q(G1, C1 ] D) >

Q(G1, C1 ] D0) if and only if Q(G2, C2 ] D) > Q(G2, C2 ] D0).

Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization Figure 6.3: An example illustrating locality. Between the left and right side, thedashed part of the clustering and the dashed part of the graph changes. The topand bottom clusterings differ only on the constant part (red/blue), and thesedifferences don't overlap with changing clusters (dashed). Therefore if the topclustering has a higher quality than the bottom clustering on the left graph, thenthe same must hold on the right graph. Formally, the dark gray nodes are inthe common subgraph GS, the light gray nodes are in supp(C1 ∩ C2). The blueclustering is D, the red clustering D0, the solid black clusters are in both C1and C2 and the dashed clusters are in only one of C1 and C2. Since the dashedclusters don't cover the dark gray nodes, the black clusterings agree on the darkgray subgraph.

Locality as defined in Ackerman, Ben-David, and Loker differs from our definition because it is a property of clustering functions. The weak local-ity property considered in Chapter differs from the definition in this chapterbecause it also enforces that the graphs agree ‘on the neighborhood' of the com-mon subgraph. Here we instead require agreement between overlapping clus-ters. There we also briefly discussed and dismissed a ‘strong locality' property,which is the hard clustering equivalent of Definition Even in the case of hard clustering locality and resolution-limit-free are not equivalent. For hard clustering, locality implies resolution-limit-freeness, butthe converse is not true.

**Theorem 6.4.**If a hard clustering quality function is local, then it is resolution-limit-

free.

**Theorem 6.5.**If a hard clustering quality function is resolution-limit-free then it is not

necessarily local.

The proofs of these theorems are in Appendix and 6.4. Resolution-limit-free functions for soft clustering

**Characterizing local quality functions**

Many quality functions can be written as a sum with a term for each edge,characterizing a goodness of fit, a term for each node, controlling the amount ofoverlap, and a term for each cluster, indicating some kind of complexity penalty.

There might also be a constant term not actually depending on the clustering,and so not affecting the optimum. We call such quality functions additive.

**Definition 6.6.**A qualify function is additive if it can be written as

Q(G, C) = qgraph(G) + qnode {c ∈ C i ∈ supp(c)} qedge aij, {c ∈ C i, j ∈ supp(c)} for some functions qgraph, qclus, qnode, qedge.

Note that qnode can depend on all clusters that contain node i, and qedge can depend on all clusters that contain the edge ij.

**Theorem 6.7.**If a quality function is additive, then it is local.

The proof of this theorem can be found in Appendix The converse of Theorem does not hold; not all local quality functions are additive. Forexample, any monotonic function of a local quality function is also local.

Another example are quality functions that use higher order interactions, that is, it includes terms not only for nodes and edges, but also for triangles andlarger structures. For instance, the clique percolation method (Palla et al., finds clusters which are cliques. That method is local, but it is not additive. Wecould imagine including higher-order terms in the definition of additivity, qtriangle(aij, aik, ajk, {c ∈ C i, j, k ∈ supp(c)}), and so on. But for most purposes the edge term is sufficient; and the localquality functions that we consider in this chapter are all additive in the sense ofDefinition Additivity provides additional insight into how quality functions behave: the quality is composed of the goodness-of-fit of a the clustering to nodes and edges(and perhaps larger structures), together with a cost term for each cluster. ByTheorem it also gives us a convenient way to prove that a certain quality Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization function is local, while locality can more convenient if we want to reason aboutthe behavior of a quality function.

For symmetric NMF, ˆ aij can be written as a sum over clusters that contain c∈C s.t. i,j∈supp(c) As a consequence, NMF quality functions without regularization, such as QSymNMF,are additive. Therefore these quality functions are local.

Many regularization terms can also be encoded in an additive quality func- tion. For example the L2 term h2 is a sum over clusters and inde- pendent of the graph, and so it fits in qclus.

**Fixed number of clusters**

The question of automatically finding the right number of clusters is still notfully solved. Therefore in most NMF based clustering methods the number ofclusters k is specified by the user.

For most quality functions, if they are optimized directly without taking this restriction into account, then the number of clusters will tend to infinity. So wesomehow need to fix the number of clusters.

The most direct way to incorporate this restriction of a fixed number of clusters is by adding it as a constraint to the quality function. That is, useQ(G, C, k) = Q(G, C) + 1[ C = k]∞. Strictly speaking this is not a function tothe real numbers. But we never need the fact that Q is such a function, all weneed is that the quality of different clusterings can be compared. Unfortunately,encoding a fixed k restriction in the quality function violates locality.

Take two clusterings C and D of a graph G, with a different number of clusters. Let C0, D0 and G0 be copies of C, D and G on a disjoint set of nodes,and let k be C + D . Then the quality Q(G ∪ G0, D ] C0, k) is finite, whileQ(G ∪ G0, D ] D0, k) is infinite. On the other hand, Q(G ∪ G0, C ] C0, k) is infinite,while Q(G ∪ G0, C ] D0, k) is finite. This contradicts locality.

Instead, we need to consider the restriction on the number of clusters as separate from the quality function. In that case the definition of locality can beused unchanged.

Equivalently, if we call a clustering consisting of k clusters a k-clustering, then we can extend the definitions of locality to take the restricted number ofclusters into account. This approach is also used by Ackerman, Ben-David, andLoker 6.4. Resolution-limit-free functions for soft clustering If we call a function Q(G, C, k) for graphs G, clusterings C and number of clusters k a fixed-size quality function, then this leads to the following fixed-sizevariant of locality.

**Definition 6.8**(Fixed size locality)

**.**A fixed-size quality function Q is fixed-size

local if for all graphs G1, G2 and a common subgraph GS, for all k1-clusterings C1 of

G1 and k2-clusterings C2 of G2 that agree on GS, and m-clustering D of GS and m0-

clusterings D0 of GS, it is the case that Q(G1, C1]D, k1+m) > Q(G1, C1]D0, k1+m0)

if and only if Q(G2, C2 ] D, k2 + m) > Q(G2, C2 ] D0, k2 + m0).

Every local quality function that does not depend on k is fixed-size local when combined with a constraint that the number of clusters must be k. Andso NMF with a fixed number of clusters is fixed-size local.

**Varying number of clusters**

Psorakis et al. formulated a Bayesian formulation of NMF for overlap-ping community detection that uses automatic relevance determination (ARD)(V. Y. F. Tan and C. Févotte, to determine the number of clusters. Theirquality functions can be written as QBayNMF(G, C) = − cb − (a − 1) log βc where each cluster is a triple c = (wc, hc, βc) of two vectors and a scalar, and κis a constant. ARD works by fixing the number of clusters to some upper bound.

In the optimal clustering many of these clusters c will be empty, that is, havesupp(c) = ∅.

This quality function is not additive, for two reasons. First of all, there is the term 2 V log βc for each cluster, which stems from the half-normal priors onW and H. This term depends on the number of nodes. Secondly, the κ termactually depends on the number of clusters and the number of nodes, since itcontains the normalizing constants for the hyperprior on β, as well as constantfactors for the half-normal priors. For a fixed graph and fixed number of clustersthe κ term can be ignored, however.

As a result, Psorakis et al.'s method is also not local, as the following coun- terexample shows: Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization

**Theorem 6.9.**QBayNMF is not local.

Proof. Consider a graph G1, consisting of a ring of n = 10 cliques, where eachclique has m = 5 nodes, and two edges connecting it to the adjacent cliques.

We follow Psorakis et al., and use hyperparameters a = 5 and b = 2. This choice is not essential, similar counterexamples exist for other hyperparametervalues. As might be hoped, the QBayNMF-optimal clustering C1 of this graph thenputs each clique in a separate cluster, with a small membership for the directlyconnected nodes in adjacent cliques.

This clustering is certainly better than the clustering C2 with 5 clusters each consisting of two cliques, and 5 empty clusters.

However, on a larger graph with two disjoint copies of G1, the clustering with two copies of C2 is better than the clustering with two copies of C1.

But by locality we would have QBayNMF(G1 ∪ G0 > QBayNMF(G1 ∪ as well as QBayNMF(G1 ∪ G01, C2 ] C01 > QBayNMF(G1 ∪ G01, C2 ] C02 , where the primed variables indicate copies with disjoint nodes. So QBayNMF isnot local.

In the above counterexample things don't change if one uses a ring of 20 cliques instead of two disjoint rings of 10 cliques. This is closer to the originalcharacterization of the resolution limit by Fortunato and Barthélemy Ina ring of 20 cliques, the solution with 10 clusters is better than the solution with20 clusters. But it is harder to show that this violates locality.

**NMF as a probabilistic model**

NMF can be seen as a maximum likelihood fit of a generative probabilisticmodel. The quality function that is optimized is then the log likelihood of themodel conditioned on the observed graph, Q(C, G) = log P(C G).

One assumes that there is some underlying hidden cluster structure, and the edges in the graph depend on this structure. The clustering structure inturn depends on the nodes under consideration. So, by Bayes rule we maydecompose P(C G) as P(C V, A) = P(A C, V)P(C V)P(V)/P(V, A).

The terms P(V) and P(V, A) are constant given the graph, so the quality functionbecomes Q(C, G) = log P(A C, V) + log P(C V) + κ, 6.5. NMF as a probabilistic model where κ = log P(V) − log P(V, A) is a constant. The first term is the likelihoodof the edges given the clustering, the second factor is the prior probability of aclustering for a certain set of nodes.

To make the above general formulation into an NMF model, one assumes that the edge weights are distributed independently, depending on the productof the membership matrices. Then a prior is imposed on the membership coef-ficients. Usually a conjugate prior is used, which for Gaussian likelihood has ahalf-normal distribution, and for Poisson likelihood has a gamma distribution.

So the simplest symmetric Gaussian NMF method would be aij ∼ N( ˆaij, 1) Which leads to the quality function This is a regularized variant of symmetric NMF discussed previously.

Such a model implicitly assumes a fixed number of clusters; and the corre- sponding quality function will not be local if the number of clusters is not fixed.

Intuitively, this happens because the model has to ‘pay' the normalizing con-stant of the prior distribution for each hci, the number of which is proportionalto the number of clusters.

The method of Psorakis et al. also stems from a probabilistic model. They use a Poisson likelihood and a half-normal prior. Note that these are not conjugate.

For finding the maximum likelihood solution conjugacy is not important. Us-ing a conjugate prior becomes important only when doing variational Bayesianinference or Gibbs sampling (Cemgil, To determine the number of clusters, Psorakis et al. put a gamma hyperprior on the inverse variance β. This allows a sharply peaked distribution on wc and Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization hc when the support of a cluster is empty. The model is aij ∼ Poisson( ˆaij) hci ∼ HN(1/ β ) wci ∼ HN(1/ β ) βc ∼ Gamma(a, b) As shown in Section the corresponding quality function is not local.

The problems stem from the priors on W, H and β, which depend on the numberof nodes and clusters. We will next try to find a different prior that is local.

**A local prior**

To get a local quality function from a probabilistic model, that doesn't assume afixed number of clusters, we clearly need a different prior. The approach we takewill be to construct an additive quality function, which is local by Theorem First assume as above that the likelihoods of the edges are independent and depending on the product of membership degrees, that is P(A C, V) = aij). This fits nicely into the fourth term, qedge, of an additive quality Without loss of generality we can split the prior into two parts. First the sup- port of each cluster is determined, and based on this support the membershipcoefficients are chosen. If we define S = {supp(c) c ∈ C}, then this means that P(C V) = P(C V, S)P(S V).

Just like C, S should be seen as a multiset, since multiple clusters can have thesame support. A reasonable choice for the first term P(C V, S) is to assume thatthe clusters are independent, and that the membership coefficients inside eachcluster are also independent, so where δ is the Kronecker delta, which forces hci to be zero for nodes not ins. The logarithm of P(C V, S) is a sum of terms that depend only on a singlecluster, so it can be encoded in the qclus term of an additive quality function.

6.5. NMF as a probabilistic model Now consider P(S V). If we know nothing about the nodes, then the two simplest aspects of S we can look at are: (1) how many clusters cover each node,and (2) how many nodes are in each cluster. The only local choice for (1) isto take the number of clusters that cover node i, ni = #{s ∈ S i ∈ s}, beindependent and identically distributed according to some f(ni). While for (2),the probability of a cluster s ∈ S must be independent of the other clusters. Andsince we have no information about the nodes, the only property of s we can useis its size. This suggests a prior of the form where ni = {s ∈ S i ∈ s} is the number of clusters covering node i. The termf(ni) is local to each node, and can be encoded in qnode. The term g( s ) is localto each cluster, and can therefore be encoded in qclus. The normalizing constantZ depends only on V, and so it can be encoded in qgraph.

If we take f(ni) = 1[ni = 1] and g( s ) = ( s − 1)!, then the prior on S is exactly a Chinese Restaurant Process (Pitman, If we relax f, then we geta generalization where nodes can belong to multiple clusters. Another choiceis f(ni) = 1[ni = 1] and g( s ) = 1. Then the prior on S is the flat prior overpartitions, which is commonly used for hard clustering.

Yet another choice is to put a Poisson prior on either the number of clusters per node or the number of nodes per cluster. That is, take f(ni) = λni /(ni!)e−λfor some constant λ, or do the same for g. This parameter allows the user totune the number or size of clusters that are expected a-priori.

To summarize, we obtain a local quality function of the form log f( {c ∈ C i ∈ supp(c)} ) c∈C i∈supp(c) which has four independent parts: a score for a node being in a certain num-ber of clusters, a score for the size of each cluster, a prior for each non-zeromembership coefficient, and the likelihood of an edge aij given the ˆ Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization Figure 6.4: Two possible clusterings in a subgraph of a ring of cliques. In thefirst clustering (D1, blue), the two cliques are in separate clusters, and there isa third cluster for the edge between them. In the second clustering (D2, red)two cliques are put into a single cluster. A third possibility is to include themiddle edge in a cluster together with one of the two cliques. A clustering ofthis entire subgraph will also include two clusters covering the connecting edges(C, dotted).

The discrete nature of this quality function makes it harder to optimize. It is not clear if the multiplicative gradient algorithm that is commonly employed forNMF (Lee and Seung, can be adapted to deal with a prior on the supportof clusters. On the other hand, it might become possible to use discrete opti-mization methods, such as the successful Louvain method used for modularitymaximization.

**Analysis of the quality functions on two types of graphs**

We will now investigate the local quality function proposed in the previoussection.

First consider the original resolution limit model (Fortunato and Barthélemy, which consists of a ring of cliques. Two possible clusterings of a part ofsuch a ring are illustrated in Figure If a quality function is local, then we know that if D1 ] C is a better clustering than D2 ] C in this subgraph, then D1 will also be better than D2 as part of alarger graph. In other words, if the cliques are clustered correctly in a smallring, than this is true regardless of the number of cliques in the ring (unless aclustering with very large clusters is suddenly better).

We have performed experiments with the prior from the previous section, to 6.5. NMF as a probabilistic model see what the optimal clustering will be in practice. We use a Poisson likelihood,a half normal prior on the supported membership coefficients (with precisionβ = 1), a Poisson prior on the number of clusters-per-node (with λ = 1) and aflat prior on the number of nodes per cluster. To find the optimal clustering weuse a general purpose optimization method, combined with a search over thepossible supports of the clusters.

The left part of Figure shows that, as expected, the optimal solution is always to have one cluster per clique when using the local quality function.

For comparison we also looked at the simpler non-local NMF method withouta prior on the support. In that case the optimal solution depends strongly onthe prior on membership coefficients β. If β is small, then there is a penaltyfor every zero in the membership matrix, and hence a penalty on the numberof clusters that increases with the number of nodes. If β is large enough, thenthe probability density p(0) > 1, and this penalty becomes a ‘bonus'. In thatcase adding even an empty cluster would improve the quality, and the optimalclustering has an infinite number of clusters.

The method of Psorakis et al. has the same resolution limit problem, but to an even larger extent. To automatically determine the number of clusters, thismethod keeps the actual number of clusters fixed to a large upper bound, forwhich the authors take the number of nodes. This means that there are verymany clusters which will be empty in the optimal solution. For these emptyclusters, the parameter βc becomes very large. And as said in the previousparagraph, this results in a bonus for empty clusters. Hence the method willtend to maximize the number of empty clusters, which results in a few largeclusters actually containing the nodes. For this experiment we used the priorβc Gamma(5, 2), as is also done in the code provided by Psorakis et al. Note thatthe jaggedness in the plot is due to the fact a ring of n cliques can not alwaysbe divided evenly into m clusters of equal size. Between 24 and 50 cliques, theoptimal number of clusters is always 8 or 9.

The right part of Figure shows the influence of the parameter λ of the Poisson prior that we put on the number of clusters per node. When λ becomessmaller, it becomes a priori more likely for a node to be in only a single cluster,or in fact, to be in no cluster at all. It actually requires a quite strong prior toget two cliques to merge into one cluster, when using 5-cliques, we need λ to besmaller than approximately 10−5.

A ring of cliques is not a realistic model of real world graphs, since on most graphs the clustering is not as clear cut as it is there. The clustering problemcan be made harder by removing edges inside the cliques, which are then nolonger cliques, and better called modules; or by adding more edges between themodules.

Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization 10−10 10−20 10−30 10−40 10−50 Number of cliques Optimal cluster size (average number of cliques per cluster) in a ring of n 5-cliques. Left: varying the number of cliques. Right: varying the λparameter of the Poisson prior on the number of clusters per node. The numberof cliques in the ring doesn't matter because of locality.

We consider such a generalization, where there are two modules connected by zero or more edges. We then generated random modules and random be-tween module edges. The two modules are either clustered together in one bigcluster, or separated. In Figure we show simulation results of such a morerealistic situation. As we can see, as the number of between module edges in-creases, or the number of within module edges decreases, it becomes more likelyto combine the two modules into one cluster. At the threshold between the twosituations, the number of between module edges is roughly equal to the numberof within module edges. This matches the notion of a strong community, whichis defined by Radicchi et al. as a set of nodes having more edges insidethe cluster than edges leaving the cluster. A theoretical justification of theseempirical results is beyond the scope of this work.

To our knowledge, this work is the first to investigate resolution-limit-free andlocal NMF quality functions for graph clustering. We gave a characterization ofa class of good (i.e. local) additive quality functions for graph clustering thatprovides a modular interpretation of NMF for graph clustering. The definitionsof locality and of additive quality functions are general, and can also be appliedto other soft clustering methods. We proposed the class of local probabilisticNMF quality functions. The design and assessment of efficient algorithms for 6.A. Proof of Theorem Within module edges Varying the number of within and between module edges. The modules each have 10 nodes. A Poisson prior on the number of clusters pernode (λ = 1) was used. We consider two possible clusterings: (a) A solutionwith three clusters, two clusters for the two modules and one cluster for thebetween module edges. And (b) the solution with a single cluster containingall nodes. The color in the plot indicates which clustering has a higher quality.

In the dark region, the clustering (a) with three clusters is better. In the lightregion, the solution (b) with a single cluster is better. Results are the averageover 10 random graphs with the given number of edges.

optimizing these quality functions remains to be investigated.

Results of this chapter provide novel insights on NMF for hard clustering, on the resolution limit of Bayesian NMF for soft clustering, and on the beneficialrole of a local prior in probabilistic formulations of NMF.

**Proof of Theorem**

Proof. Recall that in symmetric NMF, ˆa is defined as ˆaij = orthogonality constraints, any two nodes i and j are either in the same cluster,in which case ˆ aij = 1, or they are in different clusters, in which case ˆ Symmetric NMF is given by the optimization problem SymNMF(G, C) = − 2 Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization Expanding the square shows that this is equivalent to With orthogonality constraints this is equivalent to which is the CPM objective with γ = 1/2.

**Proof of Theorem**

Proof. Let Q be a local hard cluster quality function, and C be a Q-optimalclustering of a graph G1 = (V1, A1). Consider the subgraph G2 induced byD ⊂ C.

Let C1 = C D and C2 = ∅, and let GS = G2. Because C is a partition of V1, we have that supp(C1) is disjoint from GS, and so C1 and C2 agree on GS.

Then for each clustering D0 of G2 we have Q(G1, C1 ] D) > Q(G1, C1 ] D0) because C1 ∪ D = C is an optimal clustering of G1. By locality it follows thatQ(G2, C2 ] D) > Q(G2, C2 ] D0).

So D is a Q-optimal clustering of G2.

**Proof of Theorem**

Proof. Consider the following hard clustering quality function Q(G, C) = max c + min c .

For each graph G = (V, A), the clustering C = {V} is the single Q-optimal cluster-ing, with quality 2 V . Since there are no strict subsets of C the quality functionQ is trivially resolution-limit-free.

Now consider the graphs G1 with nodes {1, 2, . . , 7} and G2 with nodes {1, 2, . . , 6}, both with no edges. These graphs have a common subgraph GS withnodes {1, 2, . . , 6}. Take the clusterings D = {{1, 2, 3, 4}, {5}, {6}}, D0 = {{1, 2, 3}, {4, 5, 6}},C1 = {{7}} and C2 = {}. Then Q(G1, C1 ] D) = 5 > 4 = Q(G1, C1 ] D0), whileQ(G2, C2 ] D) = 5 < 6 = Q(G2, C2 ] D0).

So Q is not local.

This counterexample is illustrated in Figure 6.D. Proof of Theorem additive quality functions are local Q(G1, C1 ] D) = 5 Q(G1, C1 ] D0) = 4 Q(G2, C2 ] D) = 5 Q(G2, C2 ] D0) = 6 Figure 6.7: The counterexample from the proof of Theorem

**Proof of Theorem**

**additive quality functions are local**

Proof. Let Q be an additive quality function. Let G1 G2 and GS = (V, A) begraphs such that GS is a subgraph of both G1 and G2.

Let C1 be a clustering of G1, C2 a clustering of G2 and, D, D0 clusterings of GS such that C1 and C2 agree on GS.

Let E = C1 ∩ C2. Then for every node i ∈ supp(C1 C2), we have i / ∈ supp(D) and i / {c ∈ C1 ] D i ∈ supp(c)} = {c ∈ C1 ] D0 i ∈ supp(c)} = {c ∈ C1 i ∈ supp(c)}.

Conversely, for every node i / ∈ supp(C1 C2), we have {c ∈ C1 ] D i ∈ supp(c)} = {c ∈ E ] D i ∈ supp(c)}.

Q(G1, C1 ] D) − Q(G1, C1 ] D0) qnode {c ∈ E ] D i ∈ supp(c)} qnode {c ∈ E ] D0 i ∈ supp(c)} qedge aij, {c ∈ E ] D i, j ∈ supp(c)} qedge aij, {c ∈ E ] D0 i, j ∈ supp(c)}, and similarly for G2 and C2 in place of the G1 and C1.

Chapter 6. Axioms for soft clustering and Non-negative Matrix Factorization Which implies that Q(G1, C1 ] D) − Q(G1, C1 ] D0) = Q(G2, C2 ] D) − Q(G2, C2 ] D0).

Q(G1, C1 ] D) > Q(G1, C1 ] D0) Q(G2, C2 ] D) > Q(G2, C2 ] D0).

In other words, Q is local.

**Link prediction in biological**

**The link prediction problem**

In the second part of this thesis we turn our attention to another machine learn-ing problem on networks. In particular, we will look at link prediction in bipar-tite networks.

Bipartite networks are a class of networks where the nodes fall into two groups, and edges only appear between nodes of different groups. While meth-ods for general networks can be used in this setting, more specialized methodswill have an advantage.

In addition, other data is often available in practice. For instance one type of nodes might be proteins, for which the protein sequence is known. By usingthis side information, better predictions are possible. A general method canoften not take advantage of this side information, but a specialized method can.

Previously, in Chapter we looked at a protein–protein interaction network.

Several other networks exist that describe interactions between two types ofentities. For instance between proteins and drug compounds (Yamanishi, Araki,et al., between proteins and ligands (Jacob and J.-P. Vert, betweenenzymes and metabolites (Faulon et al., and so on. Of these, we will focuson the interaction between drug compounds and target proteins. We call thisthe drug–target interaction network.

The in silico prediction of interaction between drugs and target proteins is a core step in the drug discovery process for identifying new drugs or noveltargets for existing drugs, in order to guide and speed up the laborious andcostly experimental determination of drug–target interaction (Haggarty et al., Drug–target interaction data are available for many classes of pharmaceuti- cally useful target proteins including enzymes, ion channels, GPCRs and nuclearreceptors (Hopkins and Groom, Several publicly available databases have Chapter 7. The link prediction problem been built and maintained, such as KEGG BRITE (Kanehisa et al., Drug-Bank (Wishart et al., GLIDA (Okuno et al., SuperTarget and Mata-dor (Günther et al., and BRENDA (Schomburg et al., and ChEMBL(Overington, containing drug–target interaction and other related sourcesof information, like chemical and genomic data.

A property of the current drug–target interaction databases is that they con- tain a rather small number of drug–target pairs which are experimentally vali-dated interactions. This motivates the need for developing methods that predicttrue interacting pairs with high accuracy.

Recently, machine learning methods have been introduced to tackle this problem. They can be viewed as instances of the more general link predictionproblem, see Lü and Zhou for a recent survey of this topic. These methodsare motivated by the observation that similar drugs tend to target similar pro-teins (Schuffenhauer et al., Klabunde, This property was shown forinstance for chemical (Martin, Kofron, and Traphagen, and side effect sim-ilarity (Campillos et al., and motivated the development of an integratedapproach for drug–target interaction prediction (Jaroch and Weinmann, A desirable property of this approach is that it does not require the 3D structureinformation of the target proteins, which is needed in traditional methods basedon docking simulations (Cheng et al., The current state-of-the-art for the in silico prediction of drug–target inter- action is formed by methods that employ similarity measures for drugs andfor targets in the form by kernel functions, like Bleakley and Yamanishi Jacob, Hoffmann, et al. Wassermann, Geppert, and Bajorath Ya-manishi, Araki, et al. and Yamanishi, Kotera, et al. Kernels providea way to abstract from the details of the side information. A method for link pre-diction does not have to know what information the kernel represents, just thatit gives similarities between nodes of the same class. By using kernels, multiplesources of information can also be easily combined for performing prediction(Schölkopf, Tsuda, and J. P. Vert,

**Drug–target interaction data**

We used four drug–target interaction networks in humans involving enzymes,ion channels, G-protein-coupled receptors (GPCRs) and nuclear receptors; firstanalyzed by Yamanishi, Araki, et al. We worked with the datasets pro-vided by these authors, in order to facilitate benchmark comparisons with thecurrent state-of-the-art algorithms that do the same. We use these datasets as 7.1. Drug–target interaction data Table 7.1: The number of drugs and target proteins, their ratio, and the numberof interactions in the drug–target datasets from Yamanishi, Araki, et al. they are without adding new interactions from source Table listssome properties of the datasets.

Aside from an interaction network, the datasets each also include two ker- nels. One for the drug compounds and one for the target proteins.

Amino acid sequences of the target (human) proteins were obtained from the KEGG GENES database (Kanehisa et al., Sequence similarity betweenproteins was computed using a normalized version of Smith–Waterman score(Smith and Waterman, This score is computed by sequence alignment;sequences that are more similar when aligned get a higher score. We denote theresulting kernel by Kgenomic,d, for genomic similarity on drugs.

Drug–target interaction information was retrieved from the KEGG BRITE (Kanehisa et al., BRENDA (Schomburg et al., SuperTarget (Güntheret al., and DrugBank (Wishart et al., databases. Chemical structuresof the compounds was derived from the DRUG and COMPOUND sections inthe KEGG LIGAND database (Kanehisa et al., The chemical structuresimilarity between compounds was computed using SIMCOMP (Hattori et al.,SIMCOMP tries to compute the similarity of two small molecules bygraph matching. Since graph matching is an NP hard problem, SIMCOMP usesan approximation based on a randomized search, which leads to the problemthat the resulting matrix is neither symmetric nor positive definite. So it isnot a kerIf we call the matrix produced by SIMCOMP S, then we canmake it symmetric with Ssym = (S + ST )/2, and add a small multiple of theidentity matrix to enforce the positive definite property. This resulted in a kernelKchemical,t, for the chemical similarity of targets.

1These datasets are publicly available at 2This similarity matrix is the one available online. The preprocessing to turn it into a kernel is Chapter 7. The link prediction problem One can distinguish between prediction for ‘known' drug compounds or targets,for which at least one interaction is present in the training set; and predictionfor ‘unseen' or ‘new' drug compounds or targets, for which no interaction isavailable in the training set. This results in four possible settings for predict-ing drug-target interaction, depending on whether the drug compounds and/ortargets are known or unseen (Yamanishi, Araki, et al., That is, 1. To predict whether an unseen drug compound will interact with an unseen target protein. We call this the ‘unseen' setting.

2. To train a model to predict with which known drugs a previously unseen target protein will interact. We call this the ‘unseen target' setting.

3. To train a model to predict with which known targets a previously unseen drug will interact. We call this the ‘unseen drug' setting.

4. To find new putative interactions between drugs and targets already in the dataset. We call this the ‘pairs' setting.

Formally we are given a set D = {d1, d2, . . , dn } of drug compounds and a set T = {t1, t2, . . , tn } of target proteins. There is also a set of known interactions between drugs and targets. If we consider these interactions as edges, thenthey form a bipartite network. We can characterize this network by the nd × ntadjacency matrix Y. That is, yij = 1 if drug di interacts with target tj, andyij = 0 otherwise. A method can then use the matrix Y of known interactions,and the kernels defined previously.

Instead of looking at binary classification, it is more useful to look at a rank- ing of all possible interactions. So the task is to rank all drug–target pairs (di, tj)such that highest ranked pairs are the most likely to interact.

**Outline of this part of the thesis**

In

**Chapter**we show that a simple machine learning method that uses the

drug–target network as the only source of information is capable of predicting

true interaction pairs with high accuracy. Specifically, we introduce interaction

profiles of drugs (and of targets) in a network, which are binary vectors speci-

fying the presence or absence of interaction with every target (drug) in that net-

work. We define a kernel on these profiles, called the Gaussian Interaction Pro-

file (GIP) kernel, and use a simple classifier, (kernel) Regularized Least Squares

(RLS), for prediction drug–target interactions. We test the effectiveness of RLS

7.3. Outline of this part of the thesis with the GIP kernel on the four drug–target interaction networks discussed inthis chapter. The proposed algorithm achieves area under the precision-recallcurve (AUPR) up to 92.7, significantly improving over results of state-of-the-artmethods. Moreover, we show that also using the kernels based on chemical andgenomic information further increases accuracy, with a neat improvement onsmall datasets. These results substantiate the relevance of the network topol-ogy (in the form of interaction profiles) as source of information for predictingdrug–target interactions.

**Chapter**extends this method to predict interactions with ‘unseen' drug

compounds or target proteins. That is, drug compounds (or target proteins)for which there are no experimentally validated interactions in the dataset. Weshow that a simple weighted nearest neighbor procedure is highly effective forthis task. We integrate this procedure into a recent machine learning method fordrug–target interaction we developed in previous work. Results of experimentsindicate that the resulting method predicts true interactions with high accuracyalso for new drug compounds and achieves results comparable or better thanthose of recent state-of-the-art algorithms.

**Chapter**focuses on the crucial issue of data bias. We show that the four

datasets used in this part contain a bias because of the way they have been con-structed: all drug compounds and target proteins have at least one interactionand some of them have only a single interaction. We show that this bias can beexploited by prediction methods to achieve an optimistic generalization perfor-mance as estimated by cross-validation procedures, in particular leave-one-outcross validation. We discuss possible ways to mitigate the effect of this bias, bychanging the validation procedure or the datasets. In general, results indicatethat the data bias should be taken into account when assessing the general-ization performance of machine learning methods for the in silico prediction ofdrug–target interactions.

**Drug–target interaction prediction**

**with Gaussian Interaction Profiles**

The in silico prediction of potential interactions between drugs and target proteins is ofcore importance for the identification of new drugs or novel targets for existing drugs.

However, only a tiny portion of all drug–target pairs in current datasets are experi-mentally validated interactions. This motivates the need for developing computationalmethods that predict true interaction pairs with high accuracy.

We show that a simple machine learning method that uses the drug–target network as the only source of information is capable of predicting true interaction pairs withhigh accuracy. Specifically, we introduce interaction profiles of drugs (and of targets)in a network, which are binary vectors specifying the presence or absence of interactionwith every target (drug) in that network. We define a kernel on these profiles, called theGaussian Interaction Profile (GIP) kernel, and use a simple classifier, (kernel) Regular-ized Least Squares (RLS), for prediction drug–target interactions. We test comparativelythe effectiveness of RLS with the GIP kernel on four drug–target interaction networksused in previous studies. The proposed algorithm achieves area under the precision-recall curve (AUPR) up to 92.7, significantly improving over results of state-of-the-artmethods. Moreover, we show that using also kernels based on chemical and genomic in-formation further increases accuracy, with a neat improvement on small datasets. Theseresults substantiate the relevance of the network topology (in the form of interactionprofiles) as source of information for predicting drug–target interactions.

This chapter is based on van Laarhoven, Nabuurs, and Marchiori "Gaussian inter- action profile kernels for predicting drug–target interaction", published in Bioinformatics. Thesoftware, datasets and supplementary results for this chapter are available on Chapter 8. Interaction prediction with Gaussian Interaction Profiles In the previous chapter we have introduced the drug–target interaction predic-tion problem. Several settings were discussed. In this chapter we focus on the‘pairs' setting, where both the drugs and targets are known. That is, we useknown interactions for predicting novel ones.

We want to analyze the relevance of the topology of drug–target interaction networks as source of information for predicting interactions. We do this byintroducing a kernel that captures the topological information. Using a simplemachine learning method we then compare this kernel to kernels based on othersources of information.

Specifically, we start from the assumption that two drugs that interact in a similar way with the targets in a known drug–target interaction network, willalso interact in a similar way with new targets. We formalize this propertyby describing each drug with an interaction profile, a binary vector describingthe presence or absence of interaction with every target in that network. Theinteraction profile of a target is defined in a similar way. From these profiles weconstruct the Gaussian Interaction Profile kernel.

We show that interaction profiling can be effectively used for accurate pre- diction of drug–target interaction. Specifically, we propose a simple regular-ized least square algorithm incorporating a product of kernels constructed fromdrug and target interaction profiles. We test the predictive performance of thismethod on four drug–target interaction networks in humans involving enzymes,ion channels, GPCRs and nuclear receptors. These experiments show that us-ing only information on the topology of the drug–target interaction, in the formof interaction profiles, excellent results are achieved as measured by the areaunder the precision-recall curve (AUPR) (Davis and Goadrich, In partic-ular, on three of the four considered datasets the performance is superior to thebest results of current state-of-the-art methods which use multiple sources ofinformation.

We further show that the proposed method can be easily extended to also use other sources of information in the form of suitable kernels. Results ofexperiments where also chemical and genomic information on drugs and targetsis included show excellent performance, with AUPR score of 91.5, 94.3, 79.0 and68.4 on the four datasets, achieving an improvement of 7.4, 13.0, 12.3 and 7.2 overthe best results reported in Bleakley and Yamanishi A thorough analysisof the results enable us to detect several new putative drug–target interactions,see 8.2. Gaussian Interaction Profile Kernel An illustration of the construction of interaction profiles from a drug–target interaction network. Circles are drugs, squares are targets. In thisexample the interaction profile of target t1 indicates that it interacts with drugsd1 and d2, but not with d3, d4 or d5.

**Gaussian Interaction Profile Kernel**

Our method is based on the assumption that drugs exhibiting a similar patternof interaction and non-interaction with the targets of a drug–target interactionnetwork are likely to show similar interaction behavior with respect to new tar-gets. We use a similar assumption on targets. We therefore introduce the (target)interaction profile yd of a drug d i to be the binary vector encoding the presence or absence of interaction with every target in the considered drug–target network.

This is nothing more than row i of the adjacency matrix Y. Similarly, the (drug)interaction profile yTt of a target protein tj is a vector specifying the presence or absence of interaction with every drug in the considered drug–target network.

The interaction profiles generated from a drug–target interaction network canbe used as feature vectors for a classifier. Figure illustrates the constructionof interaction profiles.

Following the current state-of-the-art for the drug–target interaction predic- tion problem, we will use kernel methods, and hence construct a kernel fromthe interaction profiles. This kernel does not include any information beyondthe topology of the drug–target network.

One of the most popular choices for constructing a kernel from a feature vector is the Gaussian kernel, also known as the Radial Basis Function (RBF) Chapter 8. Interaction prediction with Gaussian Interaction Profiles kernel. This kernel is, for drugs di and dj, KGIP,d(di, dj) = exp(−βdkyd − y k2).

A kernel for the similarities between target proteins, KGIP,t, can be defined anal-ogously. We call these kernels Gaussian Interaction Profile (GIP) kernels.

The parameter βd controls the kernel bandwidth. We set That is, we normalize the parameter by dividing it by the average number ofinteractions per drug. With this choice the kernel values become independentof the size of the dataset. In principle the new bandwidth parameters ˜βd and ˜βtcould be set with cross-validation, but in this chapter we simply use ˜βd = ˜βt = 1.

There are other ways to construct a kernel from interaction profiles. For ex- ample, Basilico and Hofmann propose using the correlation of interactionprofiles. We have performed brief experiments with these other kernels, whichshow that Gaussian Interaction Profile kernels consistently outperform kernelsbased on correlation or inner products. The detailed results of these experimentsare included in Supplementary Table S1.

**Integrating Chemical and Genomic Information**

To combine the interaction profile kernel with the chemical and genomic kernels,we use a simple weighted average, Kd = αdKchemical,d + (1 − αd)KGIP,d Kt = αtKgenomic,t + (1 − αt)KGIP,t.

For the reported results of our evaluation we use simply the unweighted average, for both drugs and targets, i.e. αd = αt = 0.5. In Section wefurther analyze the effect of these parameters on the predictive performance ofthe method.

**The RLS-avg classifier**

In principle we could use the Gaussian Interaction Profile kernels with any ker-nel based classification or ranking algorithm. We choose to use a very basicclassifier, the (kernel) Regularized Least Squares (RLS) classifier. While LeastSquares is primarily used for regression, when a good kernel is used it has 8.5. RLS-Kron classifier classification accuracy similar to that of Support Vector Machines (Rifkin andKlautau, Our own experiments confirm this finding. In the RLS classifier,the predicted values ˆy with a given kernel K have a simple closed form solution, ˆy = K(K + σI)−1y, where σ is a regularization parameter. Higher values of σ give a smoother result,while for σ = 0 we get ˆy = y, and hence no generalization at all. The value ˆy isa real valued score, which we can interpret as a confidence.

The RLS classifier is sensitive to the encoding used for y. Here we use 1 for encoding interacting pairs and 0 for non-interacting ones. Brief experimentshave shown that the classifier is not sensitive to this choice, as long as the valueused for non-interactions is close to 0. Using a value very different from 0,like −1, would place too much weight on non-interactions. The classifier wouldthen try to avoid predicting pairs that look like non-interactions, rather thanpredicting pairs that look like interactions.

In the previous sections we defined kernels on drugs and kernels on target proteins. There are several ways in which we can use kernels in both of thesedimensions. Following other works, like Bleakley and Yamanishi andXia et al. a simple and effective approach is to apply the classifier foreach drug independently using, only the target kernel; and also for each targetindependently using only the drug kernel. Then the final score for a drug–targetpair is a combination of the two outputs.

Here we use the average of the output values, and denote the resulting method by RLS-avg. Observe that in the formulation of the RLS classifier weuse, performing independent prediction amounts to replacing the vector y withthe matrix Y, hence the prediction of RLS-avg is d(Kd + σI)−1Y + 2 t(Kt + σI)−1YT T .

Note this model is slightly different from using the Kronecker sum kernel (Kashima, Oyama, et al., Because regularization is performed for drugsand targets separately in the RLS-avg method, rather than jointly.

A better alternative is to combine the kernels into a larger kernel that directlyrelates drug–target pairs. This is done with the Kronecker product kernel (Basil-ico and Hofmann, Ben-Hur and Noble, Oyama and Manning, Hue and J.-P. Vert, The Kronecker product Kd ⊗ Kt of the drug and targetkernels is K((di, tj), (dk, tl)) = Kd(di, dk)Kt(tj, tl).

Chapter 8. Interaction prediction with Gaussian Interaction Profiles With this kernel we can make predictions for all pairs at once, YT ) = K(K + σI)−1 vec(YT ), where vec(YT ) is the a vector of all interaction pairs, created by stacking thecolumns of YT . We call this method RLS-Kron.

Using the Kronecker product kernel directly would involve calculating the inverse of an ndnt × ndnt matrix, which would take O((ndnt)3) operations, andwould also require too much memory. We use a more efficient implementationbased on eigendecompositions, previously presented in Raymond and Kashima and Kt = VtΛtVt be the eigendecompositions of the two kernel matrices. Since the eigenvalues(vectors) of a Kronecker product are theKronecker product of eigenvalues(vectors), for our Kronecker product kernelwe have simply K = Kd ⊗ Kt = VΛVT , where Λ = Λd ⊗ Λt and V = Vd ⊗ Vt.

The matrix that we want to invert, K + σI has these same eigenvectors V, andeigenvalues Λ + σI. Hence K(K + σI)−1 = VΛ(Λ + σI)−1VT .

To efficiently multiply this matrix with vec(YT ) we can use a further property ofthe Kronecker product, namely that (A ⊗ B) vec(X) = vec(BXAT ). Combiningthese facts we get that the RLS prediction is d ⊗ Λt)(Λd ⊗ Λt + σI)−1 vec(Vt YT Vd).

So, to make a RLS prediction using the Kronecker product kernel we onlyneed to perform the two eigendecompositions and some matrix multiplications,bringing the runtime down to O(n3 + n3 t ). The efficiency of this computation could be further improved yielding a quadratic computational complexity byapplying recent techniques for large scale kernel methods for computing thetwo kernel decompositions (Kashima, Idé, et al., Wu et al., In order to assess globally the performance of our method, we compare it againstcurrent state-of-the-art algorithms. To the best of our knowledge, the best resultson these datasets obtained so far are those reported by Bleakley and Yamanishiwhere the Bipartite Local Models (BLM) approach was introduced. These results were achieved by combining the output scores of the Kernel RegressionMethod (KRM) (Yamanishi, Araki, et al., and BLM by taking their maxi-mum value. We briefly recall these methods here.

In the KRM method, drugs and targets are embedded into a unified space called the ‘pharmacological space'. A regression model is learned between thechemical structure (respectively, genomic sequence) similarity space and thispharmacological space. Then new potential drugs and targets are mapped intothe pharmacological space using this regression model. Finally, new drug–targetinteractions are predicted by connecting drugs and target proteins that are closerthan a threshold in the pharmacological space.

The BLM method is similar to our RLS-avg method. In the BLM method, the presence or absence of a drug–target interaction is predicted as follows. First,the target is excluded, and a training set is constructed consisting of two classes:all other known targets of the drug in question, and the targets not known tointeract with that drug. Second, a Support Vector Machine that discriminatesbetween the two classes is constructed, using the available genomic kernel forthe targets. This model is then used to predict the label of the target, and hencethe interaction or non-interaction of the considered drug–target pair. A similarprocedure is applied with the roles of drugs and targets reversed, using thechemical structure kernel instead. These two results are combined by taking themaximum value.

In order to compare the performance of the methods, we performed systematicexperiments simulating the process of bipartite network inference from biologi-cal data on four drug–target interaction networks. These experiments are doneby full leave-one-out cross-validation (LOOCV) as follows. In each run of themethod, one drug–target pair (interacting or non-interacting) is left out by set-ting its entry in the Y matrix to 0. Then we try to recover its true label using theremaining data.

Note that when leaving out a drug–target pair the Y matrix changes, and therefore the GIP kernel has to be recomputed.

We also performed a variation of these experiments using 5 trials of 10-fold cross-validation. We recomputed the GIP kernels for each fold, also for 10-foldcross-validation. So no information about the removed interactions was leakedin this way.

The results can be found in Supplementary Table S2; we observed no large differences compared to the results obtained using LOOCV.

Chapter 8. Interaction prediction with Gaussian Interaction Profiles Table 8.1: Results on the drug target interaction datasets. The AUC and AUPRscores are normalized to 100. For each dataset, the the highest AUC/AUPRscore is marked in boldface.

(d) Nuclear Receptor Precision–recall curves for the RLS-Kron method. The red dotted line corresponds to using only the chemical and genomic kernels. The greendashed line corresponds to using only the GIP kernels. The blue solid linecorresponds to the average of the two types of kernels. On all datasets theaverage kernel shows a small improvement over either other kernel type alone.

In all experiments we have chosen the values for the parameters in an unin- formative way. In particular, we set the regularization parameter σ = 1 for bothRLS methods; and as stated before, we set the kernel bandwidths ˜βd = ˜βt = 1for both the drug and target interaction profile kernels.

We assessed the performance of the methods with the following two quality measures generally used in this type of studies: AUC and AUPR. Specifically,we computed the ROC curve of true positives as a function of false positives,and considered the area under the ROC curve (AUC) as quality measure (see Chapter 8. Interaction prediction with Gaussian Interaction Profiles for instance Fawcett, Furthermore, we considered the the precision–recallcurve (Raghavan, Jung, and Bollmann, that is, the plot of the ratio oftrue positives among all positive predictions for each given recall rate. Thearea under this curve (AUPR) provides a quantitative assessment of how well,on average, predicted scores of true interactions are separated from predictedscores of true non-interactions. For this task, because there are few true drug–target interactions, the AUPR is a more significant quality measure than theAUC, as it punishes much more the existence of false positive examples foundamong the best ranked prediction scores (Davis and Goadrich, Table contains the results for the two RLS-based classifiers, RLS-avg and RLS-Kron, each with three different kernel combinations: Using only the Gaussian Interaction Profile kernels, i.e. Kd =KGIP,d and Kt = KGIP,t, corresponding to αd = αt = 1.

Using only the chemical structure and genomic sequence sim-ilarity, so Kd = Kchemical,d and Kt = Kgenomic,t, corresponding toαd = αt = 0.

Using the average of the two types of kernels, correspondingto αd = αt = 0.5.

For comparison, we have also included in the table as BY09 (auc) and BY09 (aupr) the best results from the combined BML and KRM methods from Bleak-ley and Yamanishi For the GPCR and Nuclear Receptor datasets, themethod with the highest AUC is the same as the one with the highest AUPR,therefore it is included only once, as BY09.

Using only the GIP kernel, our Kronecker product RLS method has AUPR scoresof 88.5, 92.7, 71.3 and 61.0 on the Enzyme, Ion Channel, GPCR and NuclearReceptor datasets respectively. These results are superior to the results fromusing only the chemical and genomic kernels.

Overall the RLS-Kron and RLS-avg methods have comparable AUC scores.

However, the RLS-Kron has a better AUPR when using the GIP kernel, and aworse AUPR when using the chemical and genomic kernels. We believe that thisproblem is due to the poor quality of the chemical similarity kernel, to whichthe RLS-Kron method is more sensitive.

Note also that the RLS-avg method is comparable to Bleakley and Yaman- ishi's bipartite local model (BLM) approach. The differences are that whereaswe use a RLS classifier, they use Support Vector Machines; and whereas we usethe average to combine results, they use the maximum value. It is therefore not surprising that when using the chemical and genomic kernels the results of theRLS-avg method are very similar to their results.

In all cases the best results are obtained when the GIP kernels are combined with the chemical and genomic kernels. With the RLS-Kron method we thenobtain AUPR scores of 91.5, 94.3, 79.0 and 68.4 on the four datasets, which is animprovement of 7.4, 13.0, 12.3 and 7.2 over the best results reported by Bleakleyand Yamanishi Figure shows the precision–recall curves for the RLS-Kron method. Compared to other methods, the RLS-Kron method with theaverage kernels achieves a good precision also at higher recall values, especiallyon the larger datasets (Enzyme and Ion Channel).

In the previous section we have shown that using a mix of the GIP kernels andthe chemical and genomic kernels gives results superior to either type of kernelalone. In order to determine the relative importance of the network topologycompared to chemical and sequence similarity, we have investigated the changein prediction performance when varying the parameters αd and αt between 0(chemical/genomic kernels only) and 1 (interaction profiles kernels only). Forcomputational reasons we have used 10-fold cross-validation instead of leave-one-out.

In Figure we have plotted the AUPR and AUC scores on the GPCR dataset for the different parameter values. Lighter colors correspond to higher values.

Because of space limitations, plots for the other datasets are included in Supple-mentary Figures S1 and S2. For all datasets the optimal AUPR is obtained using Figure 8.3: AUPR and AUC scores for the GPCR dataset with different weight-ings of the kernels. Lighter colors are better. For all datasets αd = αt = 0.5 givesnear optimal results.

Chapter 8. Interaction prediction with Gaussian Interaction Profiles a mix of the drug and target kernels. Using the parameters αd = αt = 0.5, as wedid in the previous section, seems to be a good choice across the datasets. Alsonote that the choice of αd is more important than the choice of αt. This seemsto indicate that the sequence similarity for targets is more informative than thechemical similarity for drugs. A similar observation was also made in Bleakleyand Yamanishi The poor performance of the RLS-Kron method when us-ing only chemical and genomic kernels that we observed in the previous sectionappears to be due entirely to this uninformative chemical similarity.

On the larger datasets (Enzyme and Ion Channel) the optimal AUC is ob- tained with αd = 1, while that choice gives the worst results on the smallerdatasets. This can be explained by noting that when there are few drugs, thereis less information available for each entry of GIP target kernel, and hence thiskernel will be of a lower quality. We have confirmed this hypothesis by test-ing different sized subsets of the Ion Channel dataset, where we observe thesame effect on small subsets. The full results of that experiment are available inSupplementary Figure S3.

**New predicted interactions**

In order to analyze the practical relevance of the method for predicting noveldrug–target interactions, we conducted an experiment similar to that describedby Bleakley and Yamanishi We ranked the non-interacting pairs accord-ing to the scores computed for leave-one-out cross-validation experiments. Weestimate the most highly ranked drug–target pairs as most likely to be putativeinteractions. A list of the top 20 new interactions predicted for each of the fourdata sets can be found in Supplementary Tables S3–S6.

Table lists the top 10 new interactions predicted for the GPCR dataset.

We have looked up these predicted interactions in ChEMBL (Overington, (version 9), DrugBank (Wishart et al., and the latest online version of KEGGDRUG (Kanehisa et al., A significant fraction of the predictions (4 out of10) is found in one or more of these databases. One should bear in mind that alarge fraction of the interactions in these databases are already included in thetraining data, and hence are not counted as new interactions. Moreover thesedatabases are incomplete, so if a predicted interaction is not present in one of theused databases, this does not necessarily mean it does not exist. For this dataset,we started with only 635 known drug–target interactions and 20550 drug–targetpairs not known to interact. Of these 20550 we selected 10 as putative drug–target interaction, and found that at least 4 of them are experimentally verified.

These findings support the practical relevance of the proposed method.

We compared the newly predicted interactions generated by RLS-Kron-avg Table 8.2: The top 10 new interactions predicted in the GPCR dataset, 4 havebeen confirmed. Interactions that appear in the ChEMBL database are markedwith "[C]", interactions in Drugbank are marked with "[D]", and interactions inKegg are marked with "[K]". The NN column gives the similarity to the nearestdrug interacting with the same target, and to the nearest target interacting withthe same drug.

**DRD3: dopamine receptor D3**

**ADRB2: beta-2 adrenergic receptor**

Clonidine hydrochloride ADRA1B: alpha-1B adrenergic receptor GRM4: glutamate receptor, metabotropic 4 ADRA2C: alpha-2C adrenergic receptor

**ADRB2: beta-2 adrenergic receptor**

GRM7: glutamate receptor, metabotropic 7

**DRD1: dopamine receptor D1**

DRD5: dopamine receptor D5 Carboprost tromethamine PTGIR: prostaglandin I2 receptor (IP) and those generated by Bleakley and Yamanishi here referred to as BY09.

Specifically, given a dataset, for each method we extracted from its top x newpredictions those that have been experimentally validated (that is, that could befound in ChEMBL, DrugBank or KEGG DRUG). Table contains a summaryof the results for x = 20, 50, 80. Looking at the top 20 predictions it seems thatthe two methods perform best on different datasets. For the top 50 and top80 predictions, the results indicate the capability of RLS-Kron-avg to predictsuccessfully more new interactions than BY09.

We then compared the resulting two sets of confirmed new predictions among the top 50, by looking at common predictions and at interactions uniquely pre-dicted by only one of the two methods. The results for the four datasets can befound in Supplementary Tables S7–S10.

Chapter 8. Interaction prediction with Gaussian Interaction Profiles Table 8.3: The number of highly ranked new interactions that are found inat least one of the three considered databases (ChEMBL, DrugBank or KEGGDRUG).

On the Enzyme dataset BY09 and RLS-Kron-avg both successfully predicted 15 new interactions, with 10 common predictions. On the Ion Channel dataset,BY09 and RLS-Kron-avg successfully predicted 14 and 12 new interactions, re-spectively, of which only 1 interaction was predicted by both methods. AlthoughBY09 found slightly more confirmed interactions they were less diverse, since11 of them involve interactions between (different types of) the voltage-gatedsodium channel alpha subunit target and only 2 drugs: Prilocaine and Tocainide.

On the other hand, RLS-Kron-avg found interactions 4 different classes of tar-gets and 10 different drugs. On the GPCR dataset, BY09 and RLS-Kron-avgsuccessfully predicted 22 and 28 new interactions, respectively, with 14 commonpredictions. Finally, on the Nuclear Receptor dataset, BY09 and RLS-Kron-avgsuccessfully predicted 15 and 20 new interactions, respectively. Among them,13 were in common.

In general, the two methods seem to differ in the type of new predictions made. While there is always an overlap of new interactions between the twomethods, there is also always a subset of new interactions which RLS-Kron-avgcan successfully predict but BY09 fails to predict and vice-versa. Moreover, thereseems to be a slight tendency of BY09 to generate new successful predictionsthat are less diverse than those generated by RLS-Kron-avg. However, we werenot able to identify any differential biological bias of the methods towards thedetection of specific types of interactions.

A closer inspection shows that many of the predicted interactions are not verysurprising. For example, the GPCR dataset contains the interaction betweenClozapineand Dopamine receptor D1. The drug Loxapineis very similar toClozapine, and it is therefore to be expected that our method also predictsLoxapineto interact with Dopamine receptor D1. An analogous thing happenswith very similar target proteins. In order to provide a quantitative measureof how surprising these predictions are, we computed the similarity of a thedrug and target in an interaction pair to their Nearest Neighbor (NN), that is,the most similar drug (with respect to chemical structure similarity) and target(with respect to sequence similarity) in the training set, respectively. These sim-ilarities, which we call surprise scores, are listed in the NN column of Table An inspection of the surprise scores shows that the majority of the drug–targetpairs predicted by our method consist of a drug and a target very similar to adrug and a target already known to interact, and therefore they are not verysurprising. This phenomenon is common to any computational approach thatuses similarity between objects for inferring interaction.

To assess the ability of our method to also predict more surprising interac- tions, we have looked specifically at the predicted interactions where there is nosimilar drug interacting with the same target or similar target interacting withthe same drug in the dataset. We pick a threshold value and consider drugs(targets) to be dissimilar if their chemical (genomic) similarity is less than thisthreshold. We have used the threshold 0.5 for the chemical similarity and 0.25for the genomic similarity.

When only these ‘surprising' pairs are considered, we find, as expected, that fewer of them are present in the ChEMBL, DrugBank and KEGG databases.

But we still find more interactions among the highly ranked ‘surprising' pairscompared to to those that are ranked lower. For example, on the GPCR dataset,89 of the 500 highest ranked pairs were surprising, and 10 of them (11%) werefound in one of the databases. See the online Suplementary Material for details.

We have presented a new kernel that leads to good predictive performance asmeasured by AUPR on the task of predicting interactions between drugs andtarget proteins. An interesting aspect of our Gaussian Interaction Profile kernelis that it uses no properties beyond the interactions themselves. This means thatknowing the sequence of proteins and chemical structure of drugs is perhapsnot as important for this task as previously thought. For example, on the Ion Chapter 8. Interaction prediction with Gaussian Interaction Profiles Channel dataset our method with only the GIP kernel has an AUC score of 98.6and an AUPR score of 92.7, which improves upon the state-of-the-art, whileusing less prior information.

Besides the GIP kernel we have also introduced the RLS-Kron algorithm that combines a kernel on drugs and a kernel on targets using the Kronecker product.

Compared to previous methods that do prediction with the two kernels inde-pendently and then combine the results, this new method represents a small butconsistent improvement.

By combining the GIP kernel with chemical and genomic information we get a method with excellent performance. This method has AUPR scores of91.5, 94.3, 79.0 and 68.4 on four datasets of drug–target interaction networksin humans, representing an average improvement of 10 points over previousresults. The AUPR is a particularly relevant metric for this problem, becauseit is very sensitive to the correctness of the highest ranked predictions. Thelarge improvement in AUPR suggests that the top ranked putative drug–targetinteractions found by our method are more likely to be correct than those foundwith previous methods.

A limitation of all machine learning methods for finding new drug–target interactions is that they are sensitive to inherent biases contained in the trainingdata. It would be interesting to try and analyze the bias of existing datasets ofdrug–target interaction, but this is out of the scope of the present chapter.

Note also that the datasets by Yamanishi, Araki, et al. used in this chapter do not include any singletons: each drug interacts with at least onetarget, and each target interacts with at least one drug. This property could affectthe cross-validation results, by allowing a limited form of cheating. However,the experiments in Section show that our method also works when testedin other ways. In Chapter we further investigate this bias in the dataset.

A further limitation of the approach used in this chapter is that it can only be applied to detect new interactions for a target or a drug for which at leastone interaction has already been established. Therefore, biologists can use themethod as guidance for extending their knowledge about the interaction of adrug or of a target, not for discovering interactions of a new drug or target (thatis, one for which no interaction is known). In particular, our method is usefulfor experimentalist to aid in experimental design and interpretation, especiallyin solving problems related to drug-target selectivity and polypharmacology(Metz and Hajduk, Merino et al., In the next chapter (Chapter weaddress this issue, and extend the method to work for drugs and target proteinswith no known interactions.

There are several ways in which the result might further be improved. So far we have used uninformative choices of the parameters: ˜β = 1, σ = 1 and α = 0.5. Of these choices we have only investigated the last one. Perhaps withtuning of the other parameters better predictions are possible, although one hasto be careful not to over-fit them to the data.

Another avenue for improvement is in using more information about drugs and targets. Since combining the GIP kernel with chemical and genomic kernelsleads to a better predictive performance, perhaps adding different information inthe form of additional kernels would yield further improvements. These kernelscould be interaction profile kernels based on other types data, such as protein–protein interaction networks. Similarly, for each pair of interacting drug andtarget more information is known beyond the fact they interact. For example,the type of interaction, the binding strength, the mechanism of discovery andits uncertainty might all be known. In this chapter we have made no use of thisadditional information, nor did we attempt to predict the type or strength ofinteractions.

**Interaction prediction for new drugs**

**with weighted nearest neighbor**

In silico discovery of interactions between drug compounds and target proteins is ofcore importance for improving the efficiency of the laborious and costly experimentaldetermination of drug–target interaction. Drug–target interaction data are available formany classes of pharmaceutically useful target proteins including enzymes, ion chan-nels, GPCRs and nuclear receptors. However, current drug–target interaction databasescontain a small number of drug–target pairs which are experimentally validated in-teractions. In particular, for some drug compounds (or targets) there is no availableinteraction. This motivates the need for developing methods that predict interactingpairs with high accuracy also for these ‘unseen' drug compounds (or targets). We showthat a simple weighted nearest neighbor procedure is highly effective for this task. Weintegrate this procedure into a recent machine learning method for drug–target interac-tion we developed in previous work. Results of experiments indicate that the resultingmethod predicts true interactions with high accuracy also for unseen drug compoundsand achieves results comparable or better than those of recent state-of-the-art algorithms.

In this chapter we generalize the applicability of the method introduced in theprevious chapter to so-called unseen drug compounds, that is, drug compoundsfor which no interactions are known. The method, hereafter referred to as just This chapter is based on van Laarhoven and Marchiori "Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile", publishedin PLoS ONE. The software and datasets used in this chapter is available at Chapter 9. Interaction prediction for new drugs GIP, uses known interactions of a drug for predicting novel ones by means ofa regularized least square algorithm incorporating a product of kernels con-structed from drug compound and target interaction profiles. We propose asimple weighted nearest neighbor algorithm, called WNN, for constructing aninteraction score profile for a new drug compound using chemical and interac-tion information about known compounds in the dataset. The WNN methodcan be used as a stand-alone algorithm for predicting interactions for unseendrug compounds. It can also be directly incorporated into the GIP method forhandling unseen drug compounds. We call the resulting combination WNN-GIP. The methods can be directly adapted to handle also unknown targets orboth unknown drug compounds and targets.

We test the predictive performance of WNN and WNN-GIP on four drug– target interaction networks in humans involving enzymes, ion channels, GPCRsand nuclear receptors. Results as measured by the area under the curve (AUC)and area under the precision-recall curve (AUPR) (Davis and Goadrich, show that the weighted nearest neighbor profile algorithm and its incorporationinto the GIP method are capable to predict true interactions for unseen drugcompounds with satisfactory accuracy. The algorithms achieve competitive orbetter results than the recent state-of-the-art algorithms KBMF2K (Gönen, and BLM-NII (Mei et al., KBMF2K is based on a fully probabilistic ap-proach to model drug-target interaction, which can be applied to discover target(respectively drug compound) interactions for unseen drug compounds (respec-tively target proteins). Results in Gönen indicate improved accuracy overthe method introduced in Yamanishi, Kotera, et al. BLM-NII is an ex-tension of the BLM method (Bleakley and Yamanishi, to deal with unseendrug compounds (or targets). In BLM-NII a drug-target interaction for a unseendrug compound is inferred by constructing an estimated interaction profile fromthe drug compounds in the training data. The resulting profile is then used aslabel information to learn an interaction model for that drug compound withthe BLM method.

**The GIP method**

Machine learning methods for tackling the drug–target interaction problem aremainly based on the assumption that drug compounds exhibiting a similar pat-tern of interaction and non-interaction with the target proteins in a drug–targetinteraction network are also likely to show similar interaction behavior with re-spect to unseen targets. And a similar assumption on the target proteins is also made. Here we use the method introduced in Chapter It is based on theso-called (target) interaction profile yd of a drug compound d i, which is defined to be row i of the adjacency matrix Y; and on the (drug compound) interactionprofile yTt of a target protein tj, defined to be column j of Y. The interaction profiles generated from a drug–target interaction network are used as featurevectors for a classifier. A kernel from the interaction profiles is constructed us-ing topology of the drug–target network, defined for drug compounds di anddj as follows: KGIP,d(di, dj) = exp(−βdkyd − y k2).

A kernel KGIP,t for the similarities between target proteins is defined analo- gously. Moreover, the kernels Kchemical,d and Kgenomic,t are considered, contain-ing information about the chemical and genomic space. They are constructedfrom the chemical and genomic similarity matrices Sd and Sg between drugcompounds and between targets, by applying a simple transformation to makethem symmetric and positive definite. The interaction profile kernel can be eas-ily combined with these kernels using a weighted average.

The kernel for drug compounds and the kernel for target proteins can be combined using the Kronecker product Kd ⊗ Kt, such that for drug-target pairs(di, ti) and (dj, tj) K((di, ti), (dj, tj)) = Kd(di, dj)Kt(ti, tj).

For each drug compound with at least one known interaction in the training data, a score interaction profile ˆy is computed from its interaction profile y andthe kernel matrix K, using the Regularized Least Squared (RLS) classifier. Thisis achieved by means of the simple closed form solution ˆy = K(K + σI)−1y, where σ is a regularization parameter.

We refer the reader to Chapter for a more detailed description and analysis of this method.

For simplicity, in the rest of this chapter we call the method that uses the the RLS algorithm and the Kronecker product kernel ‘GIP'.

Chapter 9. Interaction prediction for new drugs

**Weighted nearest neighbor for unseen drug compounds**

We want to extend GIP to unseen drug compounds, that is, compounds forwhich no interaction is known. To this aim, we propose a simple weighted near-est neighbor procedure. For an unseen drug compound, its chemical similaritywith other known drug compounds and their corresponding profiles are usedin order to infer a score interaction profile for that drug compound.

Specifically, the score interaction profile yd WNN of an unseen drug compound d is the weighted sum of the profiles of the drug compounds in the training data,where a higher weight is assigned to profiles of those drug compounds moresimilar to d. Let y1, . . , yn be the interaction profiles of the other compounds in the dataset (that is, the rows of Y), listed in decreasing order with respect totheir chemical similarity to d. Then where the weights wi's are computed using a given decay value T 6 1 aswi = T i−1. For computational reasons we only sum over drug compoundswith weight at least 10−4. In our experiments we choose the decay rate T with 5fold cross-validation to maximize AUC. We call the resulting procedure WNN.

An extension of GIP to handle unseen drug compounds using WNN, here- after called WNN-GIP, can be directly formulated: for each new drug compoundd, add yd WNN as new row to the matrix Y and apply GIP to predict the score in- teraction profile ˆy of d.

**A method to show the bias of a LOOCV procedure**

In a recent paper (Mei et al., the BLM-NII algorithm is introduced andassessed using the following leave-one-out cross validation (LOOCV) procedure.

Each compound with only one interaction in Y is treated as a ‘new candidate'in the cross validation and the BLM-NII procedure is applied to it. We observethat in this way a strong prior is implicitly used in the cross validation, namelythe fact that the considered compound had at least one interaction.

To illustrate how this prior introduces a bias on the results, we consider the following simple procedure, called Const. Const constructs an all ‘1's profile forthe drug compounds or target proteins with only one interaction.

We can incorporate Const into GIP in the same way as WNN, giving the Const-GIP method. With this method all possible interactions for drug/targetswith only one interaction will be ranked before interactions with drugs/targetsthat also have other interactions. Essentially, for such interactions the method only has to do half the work, since the fact that the drug/target is correct can beknown with certainty. In real world situations there are also drug compoundsthat interact with none of the target under consideration, and vice versa, whichwould invalidate the Const-GIP method.

We perform a comparative experimental analysis of the proposed algorithmsand two recently published methods, Gönen and Mei et al.

We follow the experimental procedure adopted in Yamanishi, Kotera, et al.

and Gönen Specifically, for each dataset, drug compounds aresplit into five subsets of roughly equal size. Each subset is then used in turn asthe test set and training is performed on the data consisting of the remainingfour subsets. This procedure is repeated five times.

Results are assessed using the AUC and AUPR quality measures, generally used in this type of studies. Specifically, the ROC curve of true positives as afunction of false positives is computed, and the area under the ROC curve (AUC)is considered as quality measure (see for instance Fawcett, Furthermore,the precision–recall curve is computed, that is, the plot of the ratio of true pos-itives among all positive predictions for each given recall rate. The area underthis curve (AUPR) provides a quantitative assessment of how well, on average,predicted scores of true interactions are separated from predicted scores of truenon-interactions. Since there are few true drug–target interactions, the AUPR isa more informative quality measure than the AUC, as it punishes much morethe existence of false positive examples found among the top ranked predictionscores (Davis and Goadrich, Average AUC and AUPR results and standard deviations are reported in Table They indicate that a WNN-GIP has slightly better (average) AUCon all datasets except Enzyme. However, WNN has slightly better AUPR thanWNN-GIP. By itself the GIP method does not work well in this setting, which isto be expected, since it was not designed to handle unseen drugs.

To estimate the statistical significance of the AUC results we used the method described in E. R. DeLong, D. M. DeLong, and Clarke-Pearson To deter-mine significance of the AUPR results we used bootstrapping.

The last column of Table lists the average value of the decay rate T over the folds and repetitions. In general, the larger dataset have a higher (slower)decay rate, which means that more neighbors are taken into account.

Chapter 9. Interaction prediction for new drugs Table 9.1: Results of 5 fold cross validation: average AUC and AUPR over 5runs. Standard deviation is reported between parentheses. The best AUC andAUPR results are indicated in bold, results that are not significantly differentfrom the best (at α = 0.05) are indicated in italic.

**Comparison with other methods**

We consider the two following recent methods: KBMF2K (Gönen, andBLM-NII (Mei et al., KBMF2K is based on a Bayesian formulation that combines dimensionality reduction, matrix factorization and binary classification for predicting drug–target interaction networks using only chemical similarity between drug com-pounds and genomic similarity between target proteins.

In BLM-NII a drug-target interaction for an unseen drug compound d is inferred by constructing an estimated interaction profile for d as follows. Foreach target, an entry of the profile for d is defined as the sum of the similarityvalues of d and each of the drug compounds interacting with that target. Theresulting profile is then used as label information to learn an interaction modelfor d by means of the BLM method.

**Comparison with KBMF2K**

To compare results of WNN and WNN-GIP with those reported in Gönen we follow the experimental procedure therein used (described in the previoussection). Table also includes the AUC and AUPR for the KBMF2K method.

They indicate similar performance of KBMF2K and the simpler WNN algorithm,and slightly better overall results achieved by WNN-GIP, except on the Ion Chan-nel dataset.

We could test the prediction capability of the proposed methods on unknown drug–target interactions of the given network using the procedure adopted inGönen Therein, the complete interaction network for each dataset is usedas training data, and the predictions on non-interacting pairs in the trainingset are ranked with respect to their interaction scores. However, since eachdrug compound or target in the training set has at least one interaction, wedo not need to use WNN and the results are those of GIP. We report the topfive predicted interactions for each dataset in Table and Table The fulllists of all predicted interactions ranked by interaction score can be found in

**Comparison with BLM-NII**

Table shows the results of the LOOCV experiments. As expected, both Const-GIP and BLM-NII achieve very good results, with comparable AUC, and slightlybetter AUPR performance achieved by Const-GIP. To asses the statistical signifi-cance of these differences we used an upper bound on the variance of the AUCand AUPR for BLM-NII, because the actual variance is unknown. With thisbound the differences in AUC scores are not statistically significant.

In general, these results indicate that cross validation should be applied and interpreted with care. Note that the cross validation procedure used in thecomparison with KBMF2K is also positively biased, since we know that each‘unseen' drug compound has at least one interaction, but there the bias is muchsmaller.

In this work, we proposed a simple yet effective procedure to predict interac-tion profiles for unknown drug compounds and show how it can be directlyintegrated into a recent machine learning algorithm for the in-silico predictionof drug–target interactions. The novelty of our approach comes in the use ofa weighted nearest neighbor procedure for inferring a profile for a drug com-pound by using interaction profiles of the compounds in the training data, where Chapter 9. Interaction prediction for new drugs Table 9.2: Highest ranked predicted new interactions for each of the datasets.

Interactions found in ChEMBL, Matador, DrugBank and KEGG are indicated initalic and marked as C, M, D and K respectively.

cytochrome P450, family 21, subfamily A, polypeptide 2 cytochrome P450, family 2, subfamily E, polypeptide 1 cytochrome P450, family 1, subfamily A, polypeptide 1 cytochrome P450, family 11, subfamily B, polypeptide 2 cytochrome P450, family 2, subfamily C, polypeptide 9 calcium channel, voltage-dependent, L type, alpha 1S sub-unit cholinergic receptor, nicotinic, alpha 5 (neuronal) cholinergic receptor, nicotinic, alpha 4 (neuronal) KCNK5: potassium channel, subfamily K, member 5 sodium channel, voltage-gated, type V, alpha subunit Table 9.3: Continuation of Table dopamine receptor D3 adrenoceptor beta 2, surface Clonidine hydrochloride adrenoceptor alpha 1B glutamate receptor, metabotropic 4 adrenoceptor alpha 2C RAR-related orphan receptor B estrogen receptor 1 retinoic acid receptor, beta RAR-related orphan receptor C retinoic acid receptor, gamma each profile is weighted using information about chemical similarity betweendrug compounds integrated with a simple decay scheme. The method can bedirectly modified to predict interaction scores of unknown targets (or of bothunknown targets and drug compounds).

We performed a comparative assessment of the proposed methods on four different drug–target interaction networks from humans involving enzymes, ionchannels, GPCRs and nuclear receptors. Results indicated that WNN is com-petitive in predicting interaction for unknown drug compounds with more in-volved machine learning methods recently proposed, notably a fully probabilis-tic method based on a Bayesian formulation that combines kernel-based non-linear dimensionality reduction, matrix factorization and binary classification.

Chapter 9. Interaction prediction for new drugs Table 9.4: Results of LOOCV on pairs. Results of BLM-NNII are from Mei et al.

The best AUC and AUPR results are indicated in bold, results that arenot significantly different from the best (at α = 0.05) are indicated in italic, seethe main text for details.

Furthermore we showed that the direct integration of WNN in a recent kernelbased machine learning method provides a general and powerful tool for find-ing drug-target interactions.

The computational complexity of WNN is O(n2 + n2 t ), while the computa- tional complexity of WNN-GIP is dominated by the RLS prediction using theKronecker product kernel, which is O(n3 + n3 t ) as implemented in Chapter but can be further improved yielding a quadratic computational complexity byapplying recent techniques for large-scale kernel methods for computing thetwo kernel decompositions, e.g. Kashima, Idé, et al. Therefore WNN-GIPis more efficient than KBMF2K, since the total time complexity of each iterationin the variational approximation method used in KBMF2K is O(Rn3 + Rn3 where R is the subspace dimensionality used in the method.

A limitation of our approach is that it does not make a difference between an inactive target and a target that has not been measured for a compound.

Compounds with a higher mutual chemical similarity also have a higher chance of having the same bioactivity. This information could be considered byWNN by determining directly the weights from the similarity, instead of usingthe proposed ranking-based decay mechanism. In this way all the compoundswith high similarity would be considered with a high weight and all the com-pounds with low similarity would only have a minor contribution to the finalpredicted profile. On the same reasoning there is also a similarity thresholdfrom where the chance is so low that two compounds have the same profilethat it would be better not to predict something in the first place. In particularfor new screening data from very large screening libraries chances are high thatnone of the references are really similar to the screening hits, which would mostlikely have a detrimental effect in the overall prediction performance, if predic-tions would be made for all such compounds. Many published target predictionalgorithms apply such "applicability domain" or confidence estimations for theirpredictions. WNN could be modified to address this issue for instance by in-cluding a binary annotation based on a similarity threshold, or a more advancedprocedure based on the similarities of all compounds considered for the gener-ation of the profile.

**Biases in drug–target interaction data**

Network based prediction of interaction between drug compounds and target proteins isa core step in the drug discovery process. The availability of drug–target interaction datahas boosted the development of machine learning methods for the in silico prediction ofdrug–target interactions. In this chapter we focus on the crucial issue of data bias.

We show that four popular datasets contain a bias because of the way they have been constructed: all drug compounds and target proteins have at least one interaction andsome of them have only a single interaction. We show that this bias can be exploitedby prediction methods to achieve an optimistic generalization performance as estimatedby cross-validation procedures, in particular leave-one-out cross validation. We discusspossible ways to mitigate the effect of this bias, in particular by adapting the validationprocedure. In general, results indicate that the data bias should be taken into accountwhen assessing the generalization performance of machine learning methods for the insilico prediction of drug–target interactions.

The datasets and source code for this article are available at In our recent work on predicting drug–target interactions described in Chap-ter we discovered that a positive bias was implicitly introduced in a publishedmethod. This motivated the two main research questions we will address in thischapter: This chapter is based on van Laarhoven and Marchiori "Biases of drug–target in- teraction network data", published in the proceedings of the 9th international conference on Pat-tern Recognition in Bioinformatics. The datasets and source code for this chapter are available at Chapter 10. Biases in drug–target interaction data 1. How does data bias affect the results of procedures used to estimate the generalization performance of a method? 2. Can we quantify and avoid such bias? Cross-validation (CV) (Kohavi, is typically used to assess the general- ization performance of methods in the above mentioned settings. The datasetis repeatedly partitioned into two disjoint parts, a training set and a hold-outset. For each partition, the training set is used to construct the predictor and thehold-out set is used for testing. Popular variants are 10-fold CV, where the datais partitioned into ten folds, and each fold is used once as the hold-out set, andleave-one-out cross-validation (LOOCV), where each example constitutes onehold-out set. In the context of drug–target interaction various cross-validationsettings can be defined, depending on what is considered an example (e.g. asingle drug–target pair or all interactions with a single drug compound) and onthe selected CV procedure.

As before, we consider the four popular drug–target interaction datasets in humans involving enzymes, ion channels, G-protein-coupled receptors (GPCRs)and nuclear receptors from Yamanishi, Araki, et al. These data have beenused as benchmark datasets in many recent works, e.g. Bleakley and YamanishiChen, Liu, and Yan Gönen van Laarhoven, Nabuurs, andMarchiori van Laarhoven and Marchiori Mei et al. andWassermann, Geppert, and Bajorath In this chapter we show experimentally that these datasets contain a bias which may lead to optimistic CV generalization results. Furthermore, the extentto which this bias affects the results can differ for different methods. As a result,it is unclear whether a method with better CV results on these datasets will alsohave better performance in real applications.

Specifically, these datasets have been constructed in such a way that each drug compound and target protein has at least one interaction. Furthermore,some drug compound and/or targets have only a single interaction.

We show how this bias can be incorporated into a baseline prediction method in such a way that it significantly increases the LOOCV generalization perfor-mance. We investigate how this bias can be reduced and quantified. We showexperimentally that 5- or 10-fold CV reduces (but does not eliminate) the bias.

Furthermore, the presence of this bias can be quantified by separating the per-formance metrics for drug compounds and targets with just one interaction fromthat for other drug–target interaction pairs. This provides an alternative proce-dure to assess the generalization performance of a method by highlighting theeffect of the data bias.

In general, our results provide a contribution towards the understanding of CV procedures in the presence of data bias in the context of drug-target interac-tion networks.

**Related work**

Dataset bias has been investigated in different domains, e.g. in ligand based vir-tual screening (Baumann and Rohrer, where local clustering and globalspread of the considered benchmark datasets were identified influencing valida-tion results, and in object recognition (Torralba and Efros, where currentstate of recognition datasets have been comparatively analyzed and evaluatedbased on criteria including relative data bias and cross-dataset generalization.

To the best of our knowledge, this is the first time that drug–target interactionnetwork data bias is analyzed.

The dangers of CV have been studied by the machine learning community in various contexts. For instance, in Isaksson et al. CV and bootstrappingin small sample classification are investigated. A fundamental problem is thatthe uncertainty in a point estimate obtained with these procedures is unknownand may be quite large. The authors therefore suggest that the final classifi-cation performance should be reported in the form of a Bayesian confidenceinterval or using some other method that yields conservative measures of theuncertainty. Furthermore, in Rao and Fung it was empirically shown thatwhen the number of algorithms is large, LOOCV is not an effective estimate ofgeneralization performance for the algorithm that has the best cross-validationperformance. The authors showed that this behavior worsens as the sample sizedecreases, and as the dimensionality and number of algorithms increase. Thephenomenon of under-estimating cross validation error was also demonstratedon some benchmark data sets, and was shown to be worse for datasets withhigher dimensionality.

In Yamanishi, Araki, et al. datasets were introduced for the drug–targetprediction problem. These datasets are based on four different domains: en-zymes, ion channels, GPCRs and nuclear receptors. The datasets are constructedin such a way that only the proteins that have an interacting drug are included,and for each domain only the drugs that interact with at least one protein areincluded. It turns out that this property introduces problems for validation.

Chapter 10. Biases in drug–target interaction data Table 10.1: The number of drug compounds, the number of target proteins, thenumber of interactions and the number of unique interaction pairs (interactionswhich are the only one for a drug or target) in the drug–target datasets fromYamanishi, Araki, et al. In Table we give an overview of the four datasets as they are used in re- cent As can be seen in the last column, a large fraction of the drugcompounds and target have just one interaction in the dataset. Or equivalently,there are many interactions which are the only interaction for a drug–target. Wecall such interacting pairs unique.

As in previous chapters, the interactions in a dataset are encoded in a matrix ydt, such that ydt = 1 if drug compound d interacts with target protein t,and ydt = 0 otherwise. Besides this interaction information, there is also otherinformation available on the drugs and targets themselves, encoded in kernelmatrices. We will not use the kernels in this chapter.

As discussed in Chapter there are four ways in which these datasets of in-teractions can be used by machine learning methods. Of these the two mostcommon ones are: 1. The ‘unseen drug' setting, that is, to train a model to predict with which targets a previously unseen drug will interact.

2. The ‘pairs' setting, where the goal is to find new putative interactions be- tween drugs and targets already in the dataset.

An overview of the prediction setting and type of CV used in state-of-the-art methods applied to these datasets are shown in Table In this work we focusprimarily on the ‘pairs' setting, which is used by most of the methods listed inthe table. We will briefly discuss the ‘unseen drug' setting in our experimentsas well.

1This table is the same as Table but here we include information on the unique interactions.

Table 10.2: A list of papers that used the interaction data in Table showingthe type of prediction setting (‘unseen drug' or ‘pairs') and type of CV procedureused.

Yamanishi et al., Bleakley et al., LOOCV, 10-fold CV van Laarhoven et al., LOOCV, 10-fold CV LOOCV, 10-fold CV van Laarhoven et al., Usually methods are compared by looking at the ranking of interactions they produce in a cross-validation setting. That is, each drug–target pair is assigneda score by each method, where only other interacting pairs are shown to themethod. Then the pairs are ranked based on these scores and the quality of theranking is compared using AUC, AUPR or other summary statistics. Specifically,the ROC curve of true positives as a function of false positives is computed, andthe area under the ROC curve (AUC) is considered as quality measure, see forinstance Fawcett Furthermore, the precision–recall curve is computed,that is, the plot of the ratio of true positives among all positive predictions foreach given recall rate. The area under this curve (AUPR) is a more informativequality measure than the AUC, as it punishes much more the existence of falsepositive examples found among the top ranked prediction scores (Davis andGoadrich, Suppose that a method is tested using LOOCV. Then if a unique interaction (d, t)is left out, the method will see a row (or column) of zeros in the matrix. But weknow that the dataset does not have such rows or columns, since each drug andtarget has at least one interaction. We can therefore know with certainty thatthis pair interacts. This process is illustrated in Figure Consider a simple baseline method, that ranks drug–target pairs by the num- ber of adjacent pairs that are known to interact, where two drug–target pairs are 2Chapter of this thesis.

3Chapter of this thesis.

Chapter 10. Biases in drug–target interaction data Figure 10.1: In the LOOCV procedure, the task is to predict a single unknowndrug–target interaction, assuming all other interactions are known. This is indi-cated by x in the matrix of drug–target interactions. Because of the constructionof the dataset, we can know with certainty that in the second matrix x = 1,otherwise this drug compound would not be included in the dataset.

adjacent if they share a drug or a target. This number of adjacent interactingpairs for the pair (d, t) is At first glance we would expect drugs or targets that already have many knowninteractions to be more promiscuous, and therefore also more likely to interactwith other drugs and targets. But as explained in the previous paragraph that isnot the case when LOOCV is used.

To test this effect, in Figure we have plotted the fraction of pairs that interact against adt. This is an empirical estimate ˆP(ydt adt) of the probabilitythat d and t interact given the number of adjacent pairs for (d, t). Overall there isindeed a trend for larger adt to correspond to a higher probability of interacting.

But for very low adt we see the bias in action: the probability of such pairsinteracting is very high, since many of them are unique interactions.

A method can exploit this knowledge as follows. Consider the biased variant of the baseline method, which is the same as the baseline, except that it ranks thepairs with no observed adjacent pairs sharing a drug or with no pairs sharinga target before all other drug–target pairs. More precisely, instead of rankingpairs by adt, they are ranked by In Table we compare the LOOCV performance of this biased method to theunbiased baseline.

10.5. Avoiding the bias Figure 10.2: Probability of a drug–target pair interacting given the number ofadjacent interactions. The first plot shows this probability for LOOCV in theGPCR dataset, the second plot for 10-fold cross validation. The shaded areaindicates a 95% confidence interval based on a uniform prior.

To estimate the statistical significance of the AUC results we used the method described in E. R. DeLong, D. M. DeLong, and Clarke-Pearson To deter-mine significance of the AUPR results we used bootstrapping.

The difference between the unbiased and the biased methods is purely due to the unique interactions. In Table we also show the AUC and AUPR splitup for just the unique and non-unique interactions. With the unbiased baselinemethod, the AUC for unique interactions is barely above random chance level,while the biased baseline method achieves a perfect AUC. The overall AUC is aweighted average of the AUCs for unique and non-unique interactions, wherethe weight corresponds to the fraction of unique interactions. For example, forthe GPCR dataset, 79% · 0.863 + 21% · 1.000 = 0.891. Such a relation does nothold for AUPR scores, but the overall picture is similar.

**Avoiding the bias**

It seems that the biased results stem from the use of LOOCV. And so one wouldhope to avoid this problem by using 10 fold CV instead. As the right partof Figure shows, this indeed reduces the bias, but it does not completelyeliminate it.

We have repeated the experiment from the previous section with 10-fold CV instead of LOOCV. This is the setting used by, for instance Yamanishi, Araki, Chapter 10. Biases in drug–target interaction data Table 10.3: Performance of the unbiased baseline method and the biased variantwhen tested with LOOCV. The best results for each dataset are indicated in bold.

et al. As seen in the Table exploiting the data bias still improvesthe AUC and AUPR scores for unique interactions, but this comes at the costof the performance for non-unique interactions. In general, with k-fold cross-validation on a dataset with n drugs/targets, for each unique interacting pair,there are on the order of n/k non-interacting pairs that will be excluded in thesame fold. These pairs will appear similar to the unique interaction ones. Asthe dataset becomes larger, there will be more such pairs.

However, it is still possible to beat the baseline method by making a trade-off between the increased performance on unique interactions and decreased per-formance on other interactions. For example, one can introduce the ‘slight bias'method that ranks pairs which appear to be unique as if they have k adjacentinteractions. So it ranks pairs by unique7→k for some k < ∞. By tuning this pa- rameter k we can tune the trade-off. In our experiments we chose k with crossvalidation. As shown in Table this method achieves best AUC and AUPRon all but the smallest dataset; and in all cases shows a significant improvementover the baseline method.

So far we have considered the bias in the pairs setting. Results suggest that perhaps this validation setting should not be used. An alternative is the unseendrug setting, where one or more rows are left out in their entirety from thedrug–target interaction matrix. This means that it becomes impossible to see if a 10.5. Avoiding the bias Table 10.4: Performance of the unbiased baseline method and the biased vari-ants when tested with 10 fold CV. The best results for each dataset are indicatedin bold, results in italic do not differ significantly from the best (at α = 0.05).

pair is unique for a certain drug. But there are still interactions that are uniquefor a target. As shown in Table this bias can still be exploited for improvingCV performance, even when using 5- or 10-fold cross-validation.

Another option is to separate the unique interactions from the non-unique interactions when doing validation. As shown in our experiments, the non-unique interactions are not sensitive to the same bias. A good solution wouldbe to only consider the AUC and AUPR scores for the non-unique interactionswhen comparing different methods. This still introduces a bias of a differentkind, however, since some drug compounds and targets will be unnecessarilyexcluded.

A different way to validate a method is to seek confirmation of the pre- dictions in other datasets. This is done by for instance Yamanishi, Kotera, etal. van Laarhoven, Nabuurs, and Marchiori and Gönen where the 10 highest rank predictions are looked up in the literature, and innewer versions of the KEGG BRITE, DrugBank Chembl, SuperTarget and Mata- Chapter 10. Biases in drug–target interaction data Table 10.5: Performance of the baseline method and biased variants in the un-seen drug setting, when validated with 5-fold CV. The best results for eachdataset are indicated in bold, results in italic do not differ significantly fromthe best (at α = 0.05).

dor databases. A problem with such validation is that it is hard to quantifythe performance, because only a few interactions are verified, and because thesedatabases are extended between the publication of different papers.

Perhaps the most principled way of avoiding biases in validation is to act on the data and construct more realistic datasets. For this problem, that means thatthe dataset should also include compounds that interact with none of the targets,or targets for which there is no known interacting compound. The question thenbecomes which other drug compounds and proteins to include in the dataset.

This possibility remains to be explored.

We have shown that popular benchmark data for the drug–target interactionproblem are biased because they include only drug compounds and target pro- 10.6. Conclusions teins with at least one interaction. This bias can be quantified by looking atthe CV performance on these unique interactions separately from non-uniqueinteractions. The bias is the largest with leave-one-out cross-validation in thepairs setting. But even with 5- or 10-fold cross-validation and in the unseendrugs setting there is still a significant bias. Our analysis indicates that resultsof CV procedures to assess the predictive performance of methods for drug–target interaction networks should be interpreted with care because they couldbe possibly positively affected by bias contained in the considered datasets.

The baseline method discussed in this chapter does not use the similarity in- formation of drug compounds or target proteins at all. Hence, the performanceis far below the state of the art. However, the effects of the bias carry over toother methods. For any ranking method rdt we can define a variant rdt that exploits the dataset bias and thereby boosts the performance on uniqueinteracting pairs.

We have not performed an empirical study of the prevalence of biases in published methods. Of course none of the methods in Table exploit the biasin quite such a blatant way as our ‘biased baseline' method. Still, there couldbe methods that inadvertently take more advantage of the bias than others, forexample in the choice of parameter values or in the way they handle specifictypes of drug–target pairs.

In this work we have focused on a single group of datasets, with a spe- cific type of interaction, drug–target interaction. It remains to be investigatedwhether other datasets for the drug–target interaction prediction problem anddatasets for other similar problems have the same bias. It would also be inter-esting to consider other interaction datasets, such as the drug–target, enzyme–motabolite and protein–ligand datasets from (Keiser, Roth, et al., Campil-los et al., Faulon et al., Jacob and J.-P. Vert, Keiser, Setola, et al., In this chapter we will look back at questions that were asked at the beginningof this thesis, and at possible new questions that arise.

**So, what is clustering?**

In the first part of this thesis I asked the question: "what is network clustering?".

In chapters and I have made an attempt at answering that question, by givingproperties that clustering methods should satisfy. These properties form a usefulset of axioms for reasoning about clustering.

We know that the set of axioms is sound, that is, there is a quality function that satisfies all of them. We may wonder if the set of axioms is also complete,i.e. if it completely fixes the quality function or clustering method. If that werethe case we would have completely defined what clustering is. Though I haveno proof, I believe this not to be the case.

The only work I am aware of that gives a complete characterization of a form of clustering is (Zadeh and Ben-David, where the authors give axioms forsingle linkage hierarchical clustering. But a large risk of trying to completelycharacterize a certain clustering method is that the axiomatization becomes triv-ial. That is, the axioms are essentially saying "clusters should be the clustersfound by method X".

As said above, the axioms allow us to reason about clustering. But we do not yet have a useful calculus for clustering, nor do we know what exactly it wouldlook like. One area where these axioms are useful is in the design of algorithms.

For example, locality means that a change to one part of a clustering doesn'taffect the optimal clustering on unrelated parts of the network. This allowsthe optimization to be parallelized, with different machines solving clustering Chapter 11. Discussion optimization problems on parts of the network. Another property useful foroptimization, which was not mentioned in Chapter is the ability to zoom out.

That is, to combine multiple nodes into a larger ‘group' node. This property isused by the Louvain optimization algorithm.

In Chapter we saw how locality can also be defined for soft clustering. But besides locality, Chapter contains another important axiom for hard clustering,monotonicity. It is not yet clear what monotonicity would mean for soft andoverlapping clusters. Edges are then no longer completely inside or outside acluster, rather they are in some clusters to some degree. In fact, with Poissonlikelihood NMF clustering, every edge will be inside some cluster, otherwise thequality is −∞.

**Robustness of clustering**

Real data will be noisy. We would like this noise to not affect results too much.

This is an important motivation of several of the axioms for network clustering.

None of the axioms directly guarantees that the clustering is robust to noise, and this is impossible to guarantee in general. After all, at some point a tinychange in the graph must lead to a change in the discrete clustering. But theaxioms do allow us to make some limited statements about robustness. With amethod that satisfies locality, noise in one part of the graph will not affect theclustering in another part of the graph. And with monotonicity additive noiseinside clusters will not affect the quality of the clustering.

In our experiments on protein–protein interaction data we have shown that some clustering methods are much more robust to noise than others. There areways to improve the robustness of a clustering method, which we did not in-vestigate there. A promising idea is to use ensemble methods, where severaldifferent near optimal clusterings are combined. How to combine these cluster-ings is still an open problem, though.

**The relevance of locality**

A motivation for the locality axiom is the fact the observed network may just bea subsample of all objects and relations in the real world. But there are actuallytwo ways in which a graph with a clustering structure may be subsampled.

First of all, we may select nodes and their edges independently. Suppose, for example, that we randomly select 0.1% of all nodes. Then with high confidence,we can assume that the sample actually has nodes from all clusters, assumingthat they are not too small. More concretely, when selecting 0.1% of the peopleon earth, you can be confident that you get someone from every country (except 11.2. Data and validation perhaps Vatican city). In this setting, locality is not a relevant property of theclustering method, and we may even specify the number of clusters in advance.

On the other hand, what if sampling depends on the cluster or relations? You may sample the network of co-authorships, starting with me, my coauthors andso on. Then it becomes much more likely that the sample completely missesparts of the network, say those concerning English literature. In this settinglocality becomes important if we want to draw conclusions of the entire networkbased on just the sample.

In practice the situation is of course somewhere in between. We don't really know how the protein interaction network was sampled. There might be a focuson disease related proteins, or on proteins that are easily isolated in the lab. Inthis case requiring locality is the safe choice.

**Data and validation**

Data is always an important issue when doing machine learning. No matterhow good your methods are, you can't make good inferences with bad data.

There are large network datasets with millions of nodes, to which network clustering can be applied. But for these datasets only the nodes and edges areknown. We don't know the cluster structure, or even if the data has a clusterstructure at all. So the question becomes how to know if the clusters foundby a method are actually sensible, and how to compare different methods. Forthis reason we have used smaller networks where a ground truth clustering isknown, together with synthetic data.

With the drug–target network used in the second part of this thesis we can at least use cross-validation to validate our predictions. But even this cross-validation is still problematic in several ways. First of all, only the positivesare known, so all validation is based on the assumption that if no interactionis observed then no interaction exists. Secondly, in Chapter we saw that thedrug–target interaction network only includes drugs and targets with at leastone known interaction.

Note that we know of these two problems because of meta information. That is, we know how the dataset was constructed. Without this meta information wemight still be able to detect the second problem based on the degree distributionof the network, but the first problem is not detectable.

A third problem with the drug–target dataset is that it is unweighted and unlabeled: edges only indicate that a drug and target interact, not the strengthof the interaction, or what kind of interaction it is. Having this extra informationmight make better predictions possible.

Chapter 11. Discussion Still, despite these problems, it is possible to predict new interactions. An important ingredient in these successful predictions are the kernels for drugcompounds and target proteins. Besides augmenting them with informationfrom the network, we have not looked at the kernels at all. This is an area whereimprovements might be possible, but it would require more domain specificknowledge, and likely methods very different from the research in this thesis.

What if your data does not take the form of a network? One option is to use a k nearest neighbor graph of a dataset. These kNN graphs have a specialstructure. They are directed graphs, where each node has out-degree k. It mightbe possible to take advantage of this structure in a clustering algorithm. Forinstance, normalization of edge weights by node degree might not be needed ifall nodes have a similar degree.

It is also not clear if the kNN graph is the only way of converting a general dataset to a graph. Nor do we know what parameters to use, what value of k,how to weight the edges, and how to make the graph symmetric. If the graphis later used as input for a certain clustering algorithm, it might be possible todetermine what parameters should then be used to generate the graph.

The interaction profile kernel works in the other direction. It can be used to apply kernel methods to network data. Again, this kernel has a special structure,and it might be possible to harness it to get better results.

A large part of this thesis is focused on theory instead of algorithms, and soseveral algorithmic questions remain unanswered. The most important missingpiece is a general method for soft clustering. There are methods for non-negativematrix factorization, but these only work for a small set of quality functions. Inaddition, their running time is linear in the number of clusters, which can be aproblem in large graphs. The experiments in Chapter were performed with ageneral purpose optimization tool. This works fine for toy problems, but doesn'tscale up to larger networks. Compare this to the situation for hard clustering. InChapter we used the Louvain method, which can efficiently find a good (hard)clustering of very large networks. It would be very useful if this method can beextended to soft clustering.

**Table of notations**

Notations used in Edge weight function or adjacency matrix of a graph.

A(i, j); The weight of the edge between i and j, or 0 if thereis no edge.

Predicted cluster membership in NMF: an element of theproduct of membership matrices.

A clustering: a set of clusters.

The set of all clusters with support equal to s.

The set of all clusters with support a subset of s.

A graph: a tuple (V, A).

The ‘entropy' function: h(x) = −x log x.

The membership coefficient for node i in cluster c.

The half-normal distribution with parameter σ.

The normal distribution with mean µ and standard devia-tion σ.

A quality function: a function from graphs and clusteringsto real numbers.

The set of non-negative real numbers.

The set of positive real numbers.

The strength of node i in graph G.

Support of a cluster: the set of nodes with non-zero mem-bership, or equivalently the set s for which c ∈ C(s).

The set of vertices in a graph.

The volume of cluster or set of nodes c in graph G.

Normalized volume of cluster c: vc(G)/vV (G).

Table of notations The within cluster weight of cluster c in graph G.

wc(G)/vV (G).

The second membership coefficients next to hci in asym-metric NMF.

A per cluster hyper parameter in the method of Psorakiset al.

Parameter of the Poisson prior on the number of clustersper node.

Mixing parameter for the LFR graph model: fraction ofedges that are between clusters.

The indicator function: 1 if p is holds, 0 otherwise.

Nodes i and j are in the same cluster in clustering C.

Symmetric difference of two clusterings: the set of clustersin exactly one of C and D.

The sum of multisets, the multiplicity of c in C ] D is thesum of the multiplicities of c in C and of c in D.

Refinement of clusterings.

Notations used in The number of adjacent interacting pairs for the pair (d, t).

Set of all drugs compounds under consideration.

Kernel for drugs based on chemical similarity.

Kernel for proteins based on sequence similarity.

Kernels for drugs and targets based on Gaussian Interac-tion Profiles.

Combined kernel for drugs compounds.

Combined kernel for target proteins.

The number of drug compounds: nd = D .

The number of target proteins: nt = T .

Set of all target proteins under consideration.

View a matrix as a vector by stacking its columns.

Known interaction between drug di and target ti, 1 if theyinteract, 0 if no interaction is known.

Predicted interaction scores.

Interaction Profile of drug i: the ith row of Y.

Interaction Profile of target j: the jth column of Y.

Weight for combining different drug (or target) kernels Bandwidth of the GIP kernel.

Size independent bandwidth parameter of the GIP kernel.

Kronecker product of two matrices.

**M. Ackerman, S. Ben-David, D. Loker, and S. Sabato (2013)**. "Clustering Oli-

garchies". In: Proceedings of the International Conference on Artificial Intelligenceand Statistics (AISTATS). Vol. 31. JMLR Workshop and Conference Proceed-ings, pp. 66–74 (cited on page

**M. Ackerman and S. Ben-David (2008)**. "Measures of Clustering Quality: A

Working Set of Axioms for Clustering". In: NIPS. Ed. by D. Koller, D. Schu-urmans, Y. Bengio, and L. Bottou. Curran Associates, Inc., pp. 121–128 (citedon pages

**M. Ackerman, S. Ben-David, S. Brânzei, and D. Loker (2012)**. "Weighted Clus-

tering". In: AAAI. Ed. by J. Hoffmann and B. Selman. AAAI Press (cited onpage

**M. Ackerman, S. Ben-David, and D. Loker (2010a)**. "Characterization of Linkage-

based Clustering". In: COLT 2010: The 23rd Conference on Learning Theory,pp. 270–281 (cited on pages

**M. Ackerman, S. Ben-David, and D. Loker (2010b)**. "Towards Property-Based

Classification of Clustering Paradigms". In: NIPS. Ed. by J. D. Lafferty, C. K. I.

Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta. Curran Associates,Inc., pp. 10–18 (cited on page

**L. Adamic and N. Glance (2005)**. "The Political Blogosphere and the 2004 U.S.

Election: Divided They Blog". In: LinkKDD '05: Proceedings of the 3rd interna-tional workshop on Link discovery, pp. 36–43 (cited on page

**H. Almeida, D. Guedes, W. M. Jr., and M. J. Zaki (2011)**. "Is There a Best

Quality Metric for Graph Clusters?" In: Machine Learning and Knowledge Dis-covery in Databases. Ed. by D. Gunopulos, T. Hofmann, D. Malerba, and M.

Vazirgiannis. Vol. 6911. Lecture Notes in Computer Science. Springer BerlinHeidelberg, pp. 44–59. isbn: 978-3-642-23779-9 (cited on page

**L. Angelini, D. Marinazzo, M. Pellicoro, and S. Stramaglia (2007)**. "Natural

clustering: the modularity approach". In: Journal of Statistical Mechanics: The-ory and Experiment 2007.08, p. L08001 (cited on page

**A. Arenas, A. Fernández, and S. Gómez (2008)**. "Analysis of the structure of

complex networks at different resolution levels". In: New Journal of Physics10.5, p. 053039 (cited on page

**J. Basilico and T. Hofmann (2004)**. "Unifying collaborative and content-based

filtering". In: ICML '04: Proceedings of the 21st International Conference on Ma-chine learning. ACM, pp. 65–72 (cited on pages

**K. Baumann and S. Rohrer (2008)**. "Exploring benchmark dataset bias in ligand

based virtual screening". In: Chemistry Central Journal 2.Suppl 1, P1. issn:1752-153X (cited on page

**A. Ben-Hur and W. S. Noble (2005)**. "Kernel methods for predicting protein–

protein interactions". In: Bioinformatics 21.suppl. 1, pp. i38–i46 (cited on page

**K. Bleakley and Y. Yamanishi (2009)**. "Supervised prediction of drug-target

interactions using bipartite local models." In: Bioinformatics 25.18, pp. 2397–2403 (cited on pages

**V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre (2008)**. "Fast un-

folding of communities in large networks". In: Journal of Statistical Mechanics:Theory and Experiment 2008.10, P10008. issn: 1742-5468 (cited on pages

**B. Bollobás (2001)**. "The Evolution of Random Graphs – the Giant Component".

In: Cambridge University Press, pp. 130–159. isbn: 9780521797221 (cited onpage

**U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D.**

**Wagner (2008)**. "On Modularity Clustering". In: IEEE Transactions on Knowl-

edge and Data Engineering 20.2, pp. 172–188. issn: 1041-4347 (cited on pages

**M. Brito, E. Chávez, A. Quiroz, and J. Yukich (1997)**. "Connectivity of the mu-

tual k-nearest-neighbor graph in clustering and outlier detection". In: Statis-tics & Probability Letters 35.1, pp. 33–42. issn: 0167-7152 (cited on pages

**S. Brohee and J. van Helden (2006)**. "Evaluation of clustering algorithms for

protein-protein interaction networks". In: BMC Bioinformatics 7.1, p. 488 (citedon pages

**S. Bubeck and U. von Luxburg (2009)**. "Nearest Neighbor Clustering: A Baseline

Method for Consistent Clustering with Arbitrary Objective Functions". In:Journal of Machine Learning Research 10, pp. 657–698. issn: 1532-4435 (cited onpage

**T. Bühler and M. Hein (2009)**. "Spectral clustering based on the graph p-Laplacian".

In: Proceedings of the 26th Annual International Conference on Machine Learning.

ICML '09. Montreal, Quebec, Canada: ACM, pp. 81–88. isbn: 978-1-60558-516-1 (cited on page

**M. Campillos, M. Kuhn, A.-C. Gavin, L. J. Jensen, and P. Bork (2008)**. "Drug

target identification using side-effect similarity". In: Science 321.5886, pp. 263–266 (cited on pages

**G. Carlsson, F. Mémoli, A. Ribeiro, and S. Segarra (2013)**. "Axiomatic Con-

struction of Hierarchical Clustering in Asymmetric Networks". In: ICASSP2013: IEEE International Conference on Acoustics, Speech and Signal Processing,pp. 5219–5223 (cited on pages

**M. Catral, L. Han, M. Neumann, and R. J. Plemmons (2004)**. "On reduced rank

nonnegative matrix factorizations for symmetric matrices". In: Special Issueon Positivity in Linear Algebra, pp. 107–126 (cited on page

**A. T. Cemgil (2009)**. "Bayesian Inference for Nonnegative Matrix Factorisation

Models". In: Computational Intelligence and Neuroscience 2009, 4:1–4:17. issn:1687-5265 (cited on page

**X. Chen, M.-X. Liu, and G.-Y. Yan (2012)**. "Drug-target interaction prediction by

random walk on the heterogeneous network". In: Mol Biosyst. 8.7, pp. 1970–1978 (cited on pages

**A. C. Cheng, R. G. Coleman, K. T. Smyth, Q. Cao, P. Soulard, D. R. Caffrey,**

**A. C. Salzberg, and E. S. Huang (2007)**. "Structure-based maximal affinity

model predicts small-molecule druggability". In: Nature Biotechnology 25.1,

pp. 71–75 (cited on page

**J. M. Cherry et al. (2011)**. "Saccharomyces Genome Database: the genomics re-

source of budding yeast". In: Nucleic Acids Research (cited on page

**H. N. Chua, K. Ning, W. K. Sung, H. W. Leong, and L. Wong (2008)**. "Using

indirect protein-protein interactions for protein complex prediction". In: Jour-nal of Bioinformatics and Computational Biology 6.3, pp. 435–466. issn: 0219-7200(cited on page

**A. Clauset, C. R. Shalizi, and M. E. J. Newman (2009)**. "Power-Law Distribu-

tions in Empirical Data". In: SIAM Rev. 51 (4), pp. 661–703. issn: 0036-1445(cited on page

**S. Collins, P. Kemmeren, X.-C. Zhao, J. Greenblatt, F. Spencer, F. Holstege, J.**

**Weissman, and N. Krogan (2007)**. "Towards a comprehensive atlas of the

physical interactome of Saccharomyces cerevisiae". In: Molecular Cellular Pro-

teomics, pp. 600381–600200 (cited on pages

**L. Danon, J. Duch, A. Arenas, and A. Díaz-Guilera (2005)**. "Comparing com-

munity structure identification". In: Journal of Statistical Mechanics: Theory andExperiment 2005.09, P09008 (cited on page

**J. Davis and M. Goadrich (2006)**. "The relationship between Precision-Recall

and ROC curves". In: ICML '06: Proceedings of the 23rd International Conferenceon Machine learning. ACM, pp. 233–240 (cited on pages

**E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson (1988)**. "Comparing

the Areas under Two or More Correlated Receiver Operating CharacteristicCurves: A Nonparametric Approach". In: Biometrics 44.3, pp. 837–845. issn:0006-341X (cited on pages

**C. Ding, X. He, and H. D. Simon (2005)**. "On the equivalence of nonnegative

matrix factorization and spectral clustering". In: SIAM International Confer-ence on Data Mining (cited on pages

**C. Ding, T. Li, W. Peng, and H. Park (2006)**. "Orthogonal Nonnegative Ma-

trix T-factorizations for Clustering". In: Proceedings of the 12th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining. KDD '06.

Philadelphia, PA, USA: ACM, pp. 126–135. isbn: 1-59593-339-5 (cited onpage

**T. N. Dinh and M. T. Thai (2013)**. "Community Detection in Scale-Free Net-

works: Approximation Algorithms for Maximizing Modularity". In: IEEEJournal on Selected Areas in Communications 31.6, pp. 997–1006 (cited on page

**A. M. Edwards, B. Kus, R. Jansen, D. Greenbaum, J. Greenblatt, and M. Ger-**

**stein (2002)**. "Bridging structural biology and genomics: assessing protein

interaction data with known complexes". In: Trends in Genetics 18.10, pp. 529–

536. issn: 01689525 (cited on page

**J.-L. Faulon, M. Misra, S. Martin, K. Sale, and R. Sapra (2008)**. "Genome scale

enzyme–metabolite and drug–target interaction predictions using the sig- nature molecular descriptor". In: Bioinformatics 24.2, pp. 225–233 (cited onpages

**T. Fawcett (2006)**. "An introduction to ROC analysis". In: Pattern Recognition

Letters 27.8, pp. 861–874 (cited on pages

**S. Fortunato (2010)**. "Community detection in graphs". In: Physics Reports 486,

pp. 75–174 (cited on pages

**S. Fortunato and M. Barthélemy (2007)**. "Resolution limit in community detec-

tion". In: Proceedings of the National Academy of Sciences of the United States ofAmerica 104.1, pp. 36–41 (cited on pages

**P. Franti, O. Virmajoki, and V. Hautamaki (2006)**. "Fast Agglomerative Cluster-

ing Using a k-Nearest Neighbor Graph". In: IEEE Trans. Pattern Anal. Mach.

Intell. 28.11, pp. 1875–1881. issn: 0162-8828 (cited on pages

**A.-C. Gavin, P. Aloy, et al. (2006)**. "Proteome survey reveals modularity of the

yeast cell machinery". In: Nature 440.7084, pp. 631–636 (cited on pages

**A.-C. Gavin, M. Bosche, et al. (2002)**. "Functional organization of the yeast

proteome by systematic analysis of protein complexes." In: Nature 415.6868,pp. 141–7 (cited on pages

**M. Girvan and M. E. J. Newman (2002)**. "Community structure in social and

biological networks". In: Proceedings of the National Academy of Sciences of theUnited States of America 99.12, pp. 7821–7826 (cited on pages

**S. Gollapudi and A. Sharma (2009)**. "An axiomatic approach for result diversi-

fication". In: Proceedings of the 18th international conference on World wide web,pp. 381–390 (cited on page

**M. Gönen (2012)**. "Predicting drug-target interactions from chemical and ge-

nomic kernels using Bayesian matrix factorization". In: Bioinformatics 28.18,pp. 2304–2310 (cited on pages

**B. H. Good, Y. A. de Montjoye, and A. Clauset (2010)**. "Performance of modu-

larity maximization in practical contexts". In: Physical Review E 81.4, p. 046106(cited on pages

**C. Granell, S. Gómez, and A. Arenas (2012)**. "Unsupervised Clustering Anal-

ysis: a Multiscale Complex Networks Approach". In: Internat. J. Bifur. ChaosAppl. Sci. Engrg. 22.7 (cited on page

**S. Günther et al. (2008)**. "SuperTarget and Matador: resources for exploring

drug-target relationships." In: Nucleic Acids Research 36.Database issue, pp. D919–D922 (cited on pages

**S. J. Haggarty, K. M. Koeller, J. C. Wong, R. A. Butcher, and S. L. Schreiber**

**(2003)**. "Multidimensional Chemical Genetic Analysis of Diversity-Oriented

Synthesis-Derived Deacetylase Inhibitors Using Cell-Based Assays". In: Chem-

istry & Biology 10.5, pp. 383–396 (cited on page

**M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa (2003)**. "Development of a

chemical structure comparison method for integrated analysis of chemicaland genomic information in the metabolic pathways". In: Journal of the Amer-ican Chemical Society 125.39, pp. 11853–65 (cited on page

**Y. Ho et al. (2002)**. "Systematic identification of protein complexes in Saccha-

romyces cerevisiae by mass spectrometry." In: Nature 415.6868, pp. 180–183(cited on pages

**A. L. Hopkins and C. R. Groom (2002)**. "The druggable genome." In: Nature

reviews. Drug discovery 1.9, pp. 727–730 (cited on page

**M. Hue and J.-P. Vert (2010)**. "On learning with kernels for unordered pairs". In:

ICML '10: Proceedings of the 27th International Conference on Machine Learning.

Ed. by J. Fürnkranz and T. Joachims. Haifa, Israel: Omnipress, pp. 463–470(cited on page

**A. Isaksson, M. Wallman, H. Göransson, and M. G. Gustafsson (2008)**. "Cross-

validation and bootstrapping are unreliable in small sample classification".

In: Pattern Recognition Letters 29.14, pp. 1960–1965 (cited on page

**L. Jacob, B. Hoffmann, B. Stoven, and J.-P. Vert (2008)**. "Virtual screening of

GPCRs: an in silico chemogenomics approach". In: BMC Bioinformatics 9,p. 363 (cited on page

**L. Jacob and J.-P. Vert (2008)**. "Protein-ligand interaction prediction: an im-

proved chemogenomics approach". In: Bioinformatics 24.19, pp. 2149–2156(cited on pages

**S. E. Jaroch and H. Weinmann, eds. (2006)**. Chemical Genomics: Small Molecule

Probes to Study Cellular Function. Ed. by S. E. Jaroch and H. Weinmann. ErnstSchering Research Foundation Workshop. Berlin: Springer (cited on page

**M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima,**

**T. Katayama, M. Araki, and M. Hirakawa (2006)**. "From genomics to chemi-

cal genomics: new developments in KEGG." In: Nucleic Acids Research 34.Database

issue, pp. D354–357 (cited on pages

**H. Kashima, T. Idé, T. Kato, and M. Sugiyama (2009)**. "Recent Advances and

Trends in Large-Scale Kernel Methods". In: IEICE Transactions 92-D.7, pp. 1338–1353 (cited on pages

**H. Kashima, S. Oyama, Y. Yamanishi, and K. Tsuda (2009)**. "On Pairwise Ker-

nels: An Efficient Alternative and Generalization Analysis". In: PAKDD '09:Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and DataMining. Springer, pp. 1030–1037 (cited on page

**M. J. Keiser, B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin, and B. K.**

**Shoichet (2007)**. "Relating protein pharmacology by ligand chemistry". In:

Nat. Biotechnol. 25.2, pp. 197–206 (cited on page

**M. J. Keiser, V. Setola, et al. (2009)**. "Predicting new molecular targets for

known drugs". In: Nature 462.7270, pp. 175–181 (cited on page

**B. W. Kernighan and S. Lin (1970)**. "An Efficient Heuristic Procedure for Par-

titioning Graphs". In: Bell System Technical Journal 49.1, pp. 291–307 (cited onpage

**T. Klabunde (2007)**. "Chemogenomic approaches to drug discovery: similar re-

ceptors bind similar ligands." In: British Journal of Pharmacology 152, pp. 5–7(cited on page

**J. M. Kleinberg (2002)**. "An Impossibility Theorem for Clustering". In: NIPS.

Ed. by S. Becker, S. Thrun, and K. Obermayer. MIT Press, pp. 446–453. isbn:0-262-02550-7 (cited on pages

**R. Kohavi (1995)**. "A study of cross-validation and bootstrap for accuracy esti-

mation and model selection". In: Proceedings of the 14th international joint con-ference on Artificial intelligence - Volume 2. IJCAI'95. Montreal, Quebec, Canada:Morgan Kaufmann Publishers Inc., pp. 1137–1143. isbn: 1-55860-363-8 (citedon page

**V. Krebs (2004)**. New Political Patterns. Editorial (cited on page

**N. J. Krogan et al. (2006)**. "Global landscape of protein complexes in the yeast

Saccharomyces cerevisiae". In: Nature 440.7084, pp. 637–643 (cited on pages

**J. M. Kumpula, J. Saramaki, K. Kaski, and J. Kertesz (2007)**. "Limited resolution

in complex network community detection with Potts model approach". In:The European Physical Journal B 56, p. 41 (cited on pages

**T. van Laarhoven and E. Marchiori (2012)**. "Robust community detection meth-

ods with resolution parameter for complex detection in protein protein in- teraction networks". In: Proceedings of the 7th IAPR international conference onPattern Recognition in Bioinformatics. PRIB'12. Berlin, Heidelberg: Springer-Verlag, pp. 1–13. isbn: 978-3-642-34122-9 (cited on page

**T. van Laarhoven and E. Marchiori (2013a)**. "Graph clustering with local search

optimization: The resolution bias of the objective function matters most". In:Physical Review E 87 (1), p. 012812 (cited on page

**T. van Laarhoven and E. Marchiori (2013b)**. "Predicting Drug-Target Interac-

tions for New Drug Compounds Using a Weighted Nearest Neighbor Pro-file". In: PLoS ONE 8.6, e66952 (cited on pages

**T. van Laarhoven and E. Marchiori (2014a)**. "Axioms for Graph Clustering Qual-

ity Functions". In: Journal of Machine Learning Research 15, pp. 193–215 (citedon page

**T. van Laarhoven and E. Marchiori (2014b)**. "Biases of drug–target interaction

network data". In: Proceedings of the 9th IAPR international conference on PatternRecognition in Bioinformatics. PRIB'14. Springer-Verlag. isbn: 978-3-319-09191-4 (cited on page

**T. van Laarhoven, S. B. Nabuurs, and E. Marchiori (2011)**. "Gaussian interac-

tion profile kernels for predicting drug–target interaction". In: Bioinformatics27.21, pp. 3036–3043 (cited on pages

**R. Lambiotte (2010)**. "Multi-scale modularity in complex networks". In: Proceed-

ings of the 8th Int. Symp. on Modeling and Optimization in Mobile, Ad Hoc andWireless Networks, pp. 546–553 (cited on pages

**A. Lancichinetti and S. Fortunato (2009)**. "Community detection algorithms:

A comparative analysis". In: Physical Review E 80 (5), p. 056117 (cited onpages

**A. Lancichinetti, S. Fortunato, and F. Radicchi (2008)**. "Benchmark graphs for

testing community detection algorithms". In: Physical Review E 78 (4), p. 046110(cited on page

**A. Lancichinetti and S. Fortunato (2011)**. "Limits of modularity maximization in

community detection". In: Physical Review E 84, p. 066122 (cited on pages

**D. D. Lee and H. S. Seung (1999)**. "Learning the parts of objects by non-negative

matrix factorization". In: Nature 401.6755, pp. 788–791. issn: 0028-0836 (citedon pages

**D. D. Lee and H. S. Seung (2000)**. "Algorithms for Non-negative Matrix Factor-

ization". In: 14th Annual Conference on Neural Information Processing Systems.

NIPS 2000. MIT Press, pp. 556–562 (cited on page

**A. Lewis, N. Jones, M. Porter, and C. Deane (2010)**. "The function of commu-

nities in protein interaction networks at multiple scales". In: BMC Syst Biol.

4:100 (cited on pages

**T. Li and C. H. Q. Ding (2013)**. "Nonnegative Matrix Factorizations for Clus-

tering: A Survey". In: Data Clustering: Algorithms and Applications. Chapmanand Hall/CRC, pp. 149–176. isbn: 9781466558212 (cited on page

**X. Li, M. Wu, C. K. Kwoh, and S. K. Ng (2010)**. "Computational approaches for

detecting protein complexes from protein interaction networks: a survey".

In: BMC Genomics 11.Suppl 1, S3 (cited on pages

**L. Lü and T. Zhou (2011)**. "Link prediction in complex networks: A survey". In:

Physica A: Statistical Mechanics and its Applications 390.6, pp. 1150–1170. issn:0378-4371 (cited on page

**U. von Luxburg (2007)**. "A tutorial on spectral clustering". In: Statistics and Com-

puting 17.4, pp. 395–416. issn: 0960-3174 (cited on pages

**Y. C. Martin, J. L. Kofron, and L. M. Traphagen (2002)**. "Do structurally similar

molecules have similar biological activity?" In: Journal of Medicinal Chemistry45.19, pp. 4350–4358 (cited on page

**J.-P. Mei, C.-K. Kwoh, P. Yang, X. Li, and J. Zheng (2013)**. "Drug-target in-

teraction prediction by learning from local information and neighbors". In:Bioinformatics 29.2, pp. 238–245 (cited on pages

**M. Meila (2005)**. "Comparing clusterings: an axiomatic view". In: Proceedings of

the 22nd international conference on Machine learning. ACM, pp. 577–584 (citedon page

**A. Merino, A. K. Bronowska, D. B. Jackson, and D. J. Cahill (2010)**. "Drug

profiling: knowing where it hits". In: Drug Discovery Today 15.17-18, pp. 749–756. issn: 1359-6446 (cited on page

**J. T. Metz and P. J. Hajduk (2010)**. "Rational approaches to targeted polyphar-

macology: creating and navigating protein-ligand interaction networks". In:Current Opinion in Chemical Biology 14.4. Next Generation Therapeutics, pp. 498–504. issn: 1367-5931 (cited on page

**H. W. Mewes, D. Frishman, U. Gildener, G. Mannhaupt, K. Mayer, M. Mokrejs,**

**B. Morgenstern, M. Minsterötter, S. Rudd, and B. Weil (2002)**. "MIPS: a

database for genomes and protein sequences". In: Nucleic Acids Res 30, pp. 31–34 (cited on page

**T. Nepusz, H. Yu, and A. Paccanaro (2012)**. "Detecting overlapping protein

complexes in protein-protein interaction networks". In: Nature Methods 9 (5),pp. 471–472. issn: 1548-7091 (cited on pages

**M. E. J. Newman (2006)**. "Finding community structure in networks using the

eigenvectors of matrices". In: Physical Review E 74 (3), p. 036104 (cited onpages

**M. E. J. Newman and M. Girvan (2004)**. "Finding and evaluating commu-

nity structure in networks". In: Physical Review E 69 (2), p. 026113 (cited onpages

**Y. Okuno, A. Tamon, H. Yabuuchi, S. Niijima, Y. Minowa, K. Tonomura, R.**

**Kunimoto, and C. Feng (2008)**. "GLIDA: GPCR—ligand database for chem-

ical genomics drug discovery—database and tools update". In: Nucleic Acids

Research 36.suppl 1, pp. D907–D912 (cited on page

**J. Overington (2009)**. "ChEMBL. An interview with John Overington, team leader,

chemogenomics at the European Bioinformatics Institute Outstation of theEuropean Molecular Biology Laboratory (EMBL-EBI). Interview by WendyA. Warr." In: Journal of Computer-Aided Molecular Design 23.4, pp. 195–198.

issn: 0920-654X (cited on pages

**S. Oyama and C. D. Manning (2004)**. "Using Feature Conjunctions Across Ex-

amples for Learning Pairwise Classifiers". In: Proceedings of the 15th EuropeanConference on Machine Learning. Vol. 3201. ECML'04. Springer, pp. 322–333(cited on page

**P. Paatero and U. Tapper (1994)**. "Positive matrix factorization: A non-negative

factor model with optimal utilization of error estimates of data values". In:Environmetrics 5.2, pp. 111–126 (cited on pages

**G. Palla, I. Derenyi, I. Farkas, and T. Vicsek (2005)**. "Uncovering the overlap-

ping community structure of complex networks in nature and society". In:Nature 435 (7043), pp. 814–818 (cited on page

**J. Pitman (2002)**. "Combinatorial Stochastic Processes". In: Lecture Notes for St.

Flour Summer School (cited on page

**I. Psorakis, S. Roberts, M. Ebden, and B. Sheldon (2011)**. "Overlapping com-

munity detection using Bayesian non-negative matrix factorization". In: Phys-ical Review E 83.6, pp. 066114+ (cited on pages

**S. Pu, J. Vlasblom, A. Emili, J. Greenblatt, and S. J. Wodak (2007)**. "Identifying

functional modules in the physical interactome of Saccharomyces cerevisiae."In: Proteomics 7.6, pp. 944–960 (cited on page

**J. Puzicha, T. Hofmann, and J. M. Buhmann (1999)**. "A Theory of Proximity

Based Clustering: Structure Detection by Optimization". In: Pattern Recogni-tion 33, pp. 617–634 (cited on pages

**F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi (2004)**. "Defin-

ing and identifying communities in networks". In: Proceedings of the NationalAcademy of Sciences of the United States of America 101.9, pp. 2658–2663 (citedon page

**V. V. Raghavan, G. S. Jung, and P. Bollmann (1989)**. "A Critical Investigation

of Recall and Precision as Measures of Retrieval System Performance". In:ACM Transactions on Information Systems 7.3, pp. 205–229 (cited on page

**R. B. Rao and G. Fung (2008)**. "On the Dangers of Cross-Validation. An Experi-

mental Evaluation". In: SDM. SIAM, pp. 588–596 (cited on page

**R. Raymond and H. Kashima (2010)**. "Fast and scalable algorithms for semi-

supervised link prediction on static and dynamic graphs". In: Proceedingsof the 2010 European conference on Machine learning and knowledge discovery indatabases: Part III. ECML PKDD'10. Barcelona, Spain: Springer-Verlag, pp. 131–147. isbn: 3-642-15938-9, 978-3-642-15938-1 (cited on page

**J. Reichardt and S. Bornholdt (2004)**. "Detecting Fuzzy Community Structures

in Complex Networks with a Potts Model". In: Physical Review Letters 93 (21),p. 218701 (cited on pages

**J. Reichardt and S. Bornholdt (2006)**. "Statistical mechanics of community de-

tection". In: Physical Review E 74 (1), p. 016110 (cited on page

**J. Reichardt and S. Bornholdt (2007)**. "Partitioning and modularity of graphs

with arbitrary degree distribution". In: Physical Review E 76 (1), p. 015102(cited on page

**R. Rifkin and A. Klautau (2004)**. "In defense of one-vs-all classification". In:

Journal of Machine Learning Research 5, pp. 101–141 (cited on page

**P. Ronhovde and Z. Nussinov (2009)**. "Multiresolution community detection for

megascale networks by information-based replica correlations". In: PhysicalReview E 80.1, pp. 016109+ (cited on page

**M. Rosvall and C. T. Bergstrom (2008)**. "Maps of random walks on complex net-

works reveal community structure." In: Proceedings of the National Academy of Sciences of the United States of America 105.4, pp. 1118–1123 (cited on pages

**J. Ruan and W. Zhang (2008)**. "Identifying network communities with a high

resolution". In: Physical Review E 77 (1), p. 016104 (cited on page

**S. E. Schaeffer (2007)**. "Graph clustering". In: Computer Science Review 1.1, pp. 27–

64. issn: 15740137 (cited on pages

**B. Schölkopf, K. Tsuda, and J. P. Vert, eds. (2004)**. Kernel Methods in Computa-

tional Biology. Ed. by B. Schölkopf, K. Tsuda, and J. P. Vert. MIT Press. isbn:9780262195096s (cited on page

**I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D.**

**Schomburg (2004)**. "BRENDA, the enzyme database: updates and major new

developments". In: Nucleic Acids Research 32.suppl. 1, pp. D431–433 (cited on

pages

**A. Schuffenhauer, P. Floersheim, P. Acklin, and E. Jacoby (2003)**. "Similarity

metrics for ligands reflecting the similarity of the target proteins". In: Jour-nal of Chemical Information and Computer Sciences 43.2, pp. 391–405 (cited onpage

**J. Shi and J. Malik (2000)**. "Normalized Cuts and Image Segmentation". In:

vol. 22. 8. Washington, DC, USA: IEEE Computer Society, pp. 888–905 (citedon pages

**T. F. Smith and M. S. Waterman (1981)**. "Identification of common molecular

subsequences." In: Journal of Molecular Biology 147.1, pp. 195–197 (cited onpage

**C. Stark, B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers**

**(2006)**. "BioGRID: a general repository for interaction datasets." In: Nucleic

Acids Research 34.Database-Issue, pp. 535–539 (cited on pages

**A. Torralba and A. A. Efros (2011)**. "Unbiased look at dataset bias". In: Proceed-

ings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition.

CVPR '11. Washington, DC, USA: IEEE Computer Society, pp. 1521–1528.

isbn: 978-1-4577-0394-2 (cited on page

**V. A. Traag, G. Krings, and P. V. Dooren (2013)**. "Significant Scales in Commu-

nity Structure". In: Scientific Reports 3 (cited on page

**V. A. Traag, P. Van Dooren, and Y. E. Nesterov (2011)**. "Narrow scope for

resolution-limit-free community detection". In: Physical Review E 84 (1), p. 016114(cited on pages

**P. Uetz et al. (2000)**. "A comprehensive analysis of protein-protein interactions in

Saccharomyces cerevisiae". In: Nature 403.6770, pp. 623–627 (cited on pages

**V. Y. F. Tan and C. Févotte (2009)**. "Automatic relevance determination in non-

negative matrix factorization". In: Proc. Workshop on Signal Processing withAdaptative Sparse Structured Representations (SPARS). St-Malo, France (citedon page

**S. Van Dongen (2008)**. "Graph Clustering Via a Discrete Uncoupling Process".

In: SIAM Journal on Matrix Analysis and Applications 30.1, pp. 121–141 (citedon page

**F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding (2011)**. "Community Discovery

Using Nonnegative Matrix Factorization". In: Data Min. Knowl. Discov. 22.3,pp. 493–521. issn: 1384-5810 (cited on page

**A. M. Wassermann, H. Geppert, and J. Bajorath (2009)**. "Ligand prediction

for orphan targets using support vector machines and various target-ligandkernels is dominated by nearest neighbor effects." eng. In: Journal of ChemicalInformation and Modeling 49, pp. 2155–2167 (cited on pages

**D. S. Wishart, C. Knox, A. C. C. Guo, D. Cheng, S. Shrivastava, D. Tzur, B.**

**Gautam, and M. Hassanali (2008)**. "DrugBank: a knowledgebase for drugs,

drug actions and drug targets." In: Nucleic Acids Research 36.Database issue,

pp. D901–906 (cited on pages

**G. Wu, E. Chang, Y. K. Chen, and C. Hughes (2006)**. "Incremental approximate

matrix factorization for speeding up support vector machines". In: KDD '06:Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-covery and data mining. ACM, pp. 760–766 (cited on page

**Z. Xia, L.-Y. Wu, X. Zhou, and S. T. Wong (2010)**. "Semi-supervised drug-protein

interaction prediction from heterogeneous biological spaces". In: BMC Sys-tems Biology 4.Suppl 2, S6 (cited on page

**Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, and M. Kanehisa (2008)**.

"Prediction of drug-target interaction networks from the integration of chem-ical and genomic spaces". In: Bioinformatics 24 (13), pp. i232–i240 (cited onpages

**Y. Yamanishi, M. Kotera, M. Kanehisa, and S. Goto (2010)**. "Drug-target in-

teraction prediction from chemical, genomic and pharmacological data inan integrated framework." In: Bioinformatics 26.12, pp. i246–i254 (cited onpages

**W. W. Zachary (1977)**. "An information flow model for conflict and fission in

small groups". In: Journal of Anthropological Research 33, pp. 452–473 (cited onpages

**R. B. Zadeh and S. Ben-David (2009)**. "A uniqueness theorem for clustering".

In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.

UAI '09. Montreal, Quebec, Canada: AUAI Press, pp. 639–646. isbn: 978-0-9749039-5-8 (cited on pages This thesis focuses on two different machine learning problems in the context ofnetworks.

In part one of this thesis we look at clustering, the problem of finding groups of nodes in networks. There is no precise definition of clustering, but a commonapproach is to say that a ‘good' clustering is one with a high quality, accordingto some quality function. But still, there are many possible choices of qualityfunctions, and is not clear which are the right ones for a certain problem.

Therefore, in

**Chapter**we give an axiomatic formalization of network clus-

tering quality functions. We discussed six desirable properties that we call ax-ioms, and show that some popular quality functions do not satisfy all of them.

This leads us to introduce a novel and flexible quality function that does satisfyall the discussed properties.

**Chapter**considers the same problem of comparing quality functions for

network clustering from a more empirical perspective. We compare the perfor-mance of the different quality functions on progressively harder artificial bench-marks, as well as real world datasets. The main difference between the qualityfunctions is in the number of clusters that are found, and if that number is keptfixed, most quality functions behave similarly. We also introduce several newquality functions.

In

**Chapter**we look at a concrete application of clustering, to protein–

protein interaction networks. We show how the method introduced in the pre-vious chapter can be used to find protein complexes, and experimentally vali-date the efficacy of the method for this purpose. An important advantage of outmethod compared to others is also that it is very robust to errors in the network.

In the real world, clusters are often not as crisp as we made them out to be so far. In

**Chapter**we therefore look at soft clustering, where the nodes can belong

to more than one cluster, and the memberships need not be binary. We again

take an axiomatic approach, and look at locality properties in particular. We

argue that locality is a desirable property because it allows local reasoning about

parts of a clustering. This leads us to introduce a new prior for probabilisticNon-negative Matrix Factorization, a popular way of finding overlapping or softclusterings that is otherwise not local.

In part two of this thesis we lookat another machine learning problem on networks, that of link prediction. In particular, the prediction of new interactionsbetween drug compounds and target proteins.

In

**Chapter**we introduce a kernel method for this interaction prediction

problem. Stated simply, the idea is that two drugs are similar if they interactwith the same or a similar set of target proteins. We construct a kernel based onthis profile of known interactions, and show that together with a simple rankingmethod it can lead to state-of-the-art performance.

This kernel is only useful for predicting new interactions between drugs and targets for which some interactions are already known. In

**Chapter**we address

this issue, and use a simple weighted nearest neighbor scheme to extend the

kernel also to drugs and targets with no known interactions.

In

**Chapter**we investigated the biases of the drug–target interaction datasets

used in our experiments, and also in several similar works. The problem is thatthe dataset only contains drug compounds and target proteins with at least oneknown interaction. This bias can be easily exploited to make a method lookvery good in cross-validation experiments. We then propose ways in which theevaluation procedure can be made robust to this problem and how the datasetcould be corrected.

Dit proefschrift behandeld twee verschillende Machine Learning onderwerpendie beide te maken hebben met netwerken.

In het eerste deel kijken we naar clustering, waar het doel is om groepen van knopen te vinden in een netwerk. Er is geen preciese definitie van clustering,maar een veel gebruikte aanpak is om te zeggen dat een clustering goed is alsdeze een hoge ‘score' heeft volgens een bepaalde scorefunctie. Dit verschuiftslechts het probleem, want er zijn veel verschilende scorefuncties voor netwerkclustering te vinden in de literatuur, en het is niet duidelijk welke de beste isvoor een bepaald probleem.

Daarom geven we in

**hoofdstuk**een axiomatische formalisatie van netwerk

clustering scorefuncties. We beschrijven zes eigenschappen die elke goede score-functie naar onze mening zou moeten hebben. Vervolgens laten we van een aan-tal populaire scorefuncties zien dat ze niet aan al deze eigenschappen voldoen.

Dit motiveert de introductie van een nieuwe en flexibele scorefunctie, die welaan alle beschreven eigenschappen voldoet.

**Hoofdstuk**bekijkt probleem van het kiezen van een goede scorefunctie

van een meer empirische kant. We vergelijken verschillende scorefuncties doorze uit te testen op steeds moeilijkere artificiële netwerken, en ook op een aan-tal echte datasets. Het belangrijkste verschil tussen de scorefuncties blijkt hetaantal clusters dat wordt gevonden. Als dat aantal wordt vastgezet dan gedra-gen de scorefuncties zich vergelijkbaar. Ook introduceren we een aantal nieuwescorefuncties.

In

**hoofdstuk**richten we ons op een concrete toepassing van clustering. We

gebruiken clustering voor het vinden van eiwitcomplexen in een eiwit–eiwit in-teractie netwerk. Hiervoor gebruiken we de methoden uit het vorige hoofdstuk.

Middels een aantal experimenten laten we zien dat de methode zeer effectief isvoor dit doel, met als belangrijk voordeel boven andere methoden het feit dathet veel robuuster is voor kleine veranderingen of fouten in het netwerk.

In het echte leven zijn clusters vaak niet zo vastomlijnd zijn als waar we tot nu toe van uit zijn gegaan. Daarom kijken we in

**hoofdstuk**naar ‘zachte' en

overlappende clustering. Hier kunnen knopen deel uit maken van meer dan een

cluster, en hun lidmaatschap kan partieel zijn in plaats van een harde ‘ja' of ‘nee'.

We gebruiken opnieuw een axiomatische aanpak, en in kijken in het bijzonder

naar lokaliteit. Lokaliteit is een belangrijke eigenschap, die het mogelijk maakt

om lokaal te redeneren over delen van een clustering. Vervolgens introduceren

we een nieuwe prior voor probabilistische Niet-negatieve Matrix Factorisatie,

een veel gebruikte manier om dergelijke zachte clusterings te vinden, die zonder

deze prior niet lokaal is.

Het tweede deel van dit proefschrift gaat over een ander Machine Learn- ing onderwerp: het voorspellen van links in een netwerk. In het bijzonder hetvoorspellen van nieuwe interacties tussen geneesmiddelen en eiwitten.

**Hoofdstuk**beschrijft een kernel methode voor het voorspellen van dergeli-

jke interacties. Kort samengevat is het achterliggende idee dat twee geneesmid-delen op elkaar lijken als ze interacties hebben met dezelfde eiwitten.

bouwen een kernel op basis van de bekende interacties, en laten zien dat samenmet een simpele regressie methode deze kernel state-of-the-art resultaten be-haald.

Deze kernel is echter alleen nuttig voor het voorspellen van nieuwe interac- ties tussen geneesmiddelen en eiwitten waarvoor al een aantal interacties bek-

end is. In

**hoofdstuk**lossen we dat gebrek op met een simpele gewogen

Nearest Neighbor model. Hiermee kunnen we de kernel uit te breiden naar

geneesmiddelen en eiwitten zonder bekende interacties.

In

**hoofdstuk**onderzoeken we hoe de eigenschappen van de door ons en

anderen gebruikte datasets een vertekend beeld kunnen geven bij experimentelevalidatie. Het probleem is dat in de dataset alleen geneesmidellen en eiwittenzitten met tenminste één bekende interactie. De selectie is dus niet volledigwillikeurig, en die eigenschap kan eenvoudig gebruikt worden om een meth-ode goed te laten lijken bij cross-validatie experimenten. We laten zien hoe devalidatieprocedure kan worden aangepast om dit probleem te voorkomen.

Getting a PhD always seemed like this obvious thing to do after finishing mymasters degree. And indeed research is a lot of fun. But some of the aspectsthat come with it, such as writing down your findings, are a bit less so. Becauseof that, completing my PhD thesis was harder than I initially thought. And Icertainly couldn't have done it on my own.

I would therefore first and foremost like to thank my supervisor, Elena Mar- chiori, for motivating me to finish what I started and for providing directionsduring four years of research. And want to thank Tom Heskes for his role as mypromotor, which included always being there to answer questions, even thoughI hardly had any.

I'd also like to thank my other colleagues at the Intelligent Systems section for providing interesting conversations about various research topics, as well asother topic during lunch breaks and Friday afternoon drinks; and for providinga nice work atmosphere in general. This includes not just the Machine Learningpeople, but also everyone from the foundations group, with whom I have spendquite a bit of time, even though they work on completely different (but stillinteresting) topics. So Adriana, Ali, Bram, Carst, Daniel, Dimitris, Elena S.,Fabio, Freek, Helle, Herman, Jelle, Jonce, Joris, Josef, Maya, Max, Nicole, Pavol,Ridho, Robbert, Saiden, Saskia, Simone, Suzan, Thijs, Tjeerd, Tom C., Wout,thank you for the interesting times.

Besides these people at the computer science institute, this work would not have been possible without the support of my other friends and my family.

Finally, I should thank the Netherlands Organization for Scientific Research (NWO) for funding project 612.066.927, which made my PhD project possible inthe first place.

SIKS Dissertatiereeks 2009-01 Rasa Jurgelenaite (RUN) Symmetric Causal Independence Models 2009-02 Willem Robert van Hage (VU) Evaluating Ontology-Alignment Techniques 2009-03 Hans Stol (UvT) A Framework for Evidence-based Policy Making Using IT 2009-04 Josephine Nabukenya (RUN) Improving the Quality of Organisational Policy Making using Collaboration Engineering 2009-05 Sietse Overbeek (RUN) Bridging Supply and Demand for Knowledge Intensive Tasks - Based on Knowledge, Cognition, and Quality 2009-06 Muhammad Subianto (UU) 2009-07 Ronald Poppe (UT) Discriminative Vision-Based Recovery and Recognition of Human Motion 2009-08 Volker Nannen (VU) Evolutionary Agent-Based Policy Analysis in Dynamic Environments 2009-09 Benjamin Kanagwa (RUN) Design, Discovery and Construction of Service-oriented Systems 2009-10 Jan Wielemaker (UVA) Logic programming for knowledge-intensive interactive applications 2009-11 Alexander Boer (UVA) Legal Theory, Sources of Law & the Semantic Web 2009-12 Peter Massuthe (TUE, Humboldt-Universitaet zu Berlin) Operating Guidelines for Services 2009-13 Steven de Jong (UM) Fairness in Multi-Agent Systems 2009-14 Maksym Korotkiy (VU) From ontology-enabled services to service-enabled ontologies (making ontologies work in e-science with ONTO-SOA) 2009-15 Rinke Hoekstra (UVA) Ontology Representation - Design Patterns and Ontologies that Make Sense 2009-16 Fritz Reul (UvT) New Architectures in Computer Chess 2009-17 Laurens van der Maaten (UvT) Feature Extraction from Visual Data 2009-18 Fabian Groffen (CWI) Armada, An Evolving Database System 2009-19 Valentin Robu (CWI) Modeling Preferences, Strategic Reasoning and Collaboration in Agent-Mediated Electronic Markets 2009-20 Bob van der Vecht (UU) Adjustable Autonomy: Controling Influences on Decision Making 2009-21 Stijn Vanderlooy (UM) Ranking and Reliable Classification 2009-22 Pavel Serdyukov (UT) Search For Expertise: Going beyond direct evidence 2009-23 Peter Hofgesang (VU) Modelling Web Usage in a Changing Environment 2009-24 Annerieke Heuvelink (VUA) Cognitive Models for Training Simulations 2009-25 Alex van Ballegooij (CWI) "RAM: Array Database Management through Relational Mapping" 2009-26 Fernando Koch (UU) An Agent-Based Model for the Development of Intelligent Mobile Services 2009-27 Christian Glahn (OU) Contextual Support of social Engagement and Reflection on the Web 2009-28 Sander Evers (UT) Sensor Data Management with Probabilistic Models 2009-29 Stanislav Pokraev (UT) Model-Driven Semantic Integration of Service-Oriented Applications 2009-30 Marcin Zukowski (CWI) Balancing vectorized query execution with bandwidth-optimized storage 2009-31 Sofiya Katrenko (UVA) A Closer Look at Learning Relations from Text 2009-32 Rik Farenhorst (VU) and Remco de Boer (VU) Architectural Knowledge Management: Supporting Architects and Auditors 2009-33 Khiet Truong (UT) How Does Real Affect Affect Affect Recognition In Speech? 2009-34 Inge van de Weerd (UU) Advancing in Software Product Management: An Incremental Method Engineering Approach 2009-35 Wouter Koelewijn (UL) Privacy en Politiegegevens; Over geautomatiseerde normatieve informatie-uitwisseling 2009-36 Marco Kalz (OUN) Placement Support for Learners in Learning Networks 2009-37 Hendrik Drachsler (OUN) Navigation Support for Learners in Informal Learning Networks 2009-38 Riina Vuorikari (OU) Tags and self-organisation: a metadata ecology for learning resources in a multilingual context 2009-39 Christian Stahl (TUE, Humboldt-Universitaet zu Berlin) Service Substitution – A Behavioral Approach Based on Petri Nets 2009-40 Stephan Raaijmakers (UvT) Multinomial Language Learning: Investigations into the Geometry of Language 2009-41 Igor Berezhnyy (UvT) Digital Analysis of Paintings 2009-42 Toine Bogers Recommender Systems for Social Bookmarking 2009-43 Virginia Nunes Leal Franqueira (UT) Finding Multi-step Attacks in Computer Networks using Heuristic Search and Mobile Ambients 2009-44 Roberto Santana Tapia (UT) Assessing Business-IT Alignment in Networked Organizations 2009-45 Jilles Vreeken (UU) Making Pattern Mining Useful 2009-46 Loredana Afanasiev (UvA) Querying XML: Benchmarks and Recursion 2010-01 Matthijs van Leeuwen (UU) Patterns that Matter 2010-02 Ingo Wassink (UT) Work flows in Life Science 2010-03 Joost Geurts (CWI) A Document Engineering Model and Processing Framework for Multimedia documents 2010-04 Olga Kulyk (UT) Do You Know What I Know? Situational Awareness of Co-located Teams in Multidisplay Environments 2010-05 Claudia Hauff (UT) Predicting the Effectiveness of Queries and Retrieval Systems 2010-06 Sander Bakkes (UvT) Rapid Adaptation of Video Game AI 2010-07 Wim Fikkert (UT) Gesture interaction at a Distance 2010-08 Krzysztof Siewicz (UL) Towards an Improved Regulatory Framework of Free Software. Protecting user freedoms in a world of software communities and eGovernments 2010-09 Hugo Kielman (UL) A Politiele gegevensverwerking en Privacy, Naar een effectieve waarborging 2010-10 Rebecca Ong (UL) Mobile Communication and Protection of Children 2010-11 Adriaan Ter Mors (TUD) The world according to MARP: Multi-Agent Route Planning 2010-12 Susan van den Braak (UU) Sensemaking software for crime analysis 2010-13 Gianluigi Folino (RUN) High Performance Data Mining using Bio-inspired techniques 2010-14 Sander van Splunter (VU) Automated Web Service Reconfiguration 2010-15 Lianne Bodenstaff (UT) Managing Dependency Relations in Inter-Organizational Models 2010-16 Sicco Verwer (TUD) Efficient Identification of Timed Automata, theory and practice 2010-17 Spyros Kotoulas (VU) Scalable Discovery of Networked Resources: Algorithms, Infrastructure, Applications 2010-18 Charlotte Gerritsen (VU) Caught in the Act: Investigating Crime by Agent-Based Simulation 2010-19 Henriette Cramer (UvA) People's Responses to Autonomous and Adaptive Systems 2010-20 Ivo Swartjes (UT) Whose Story Is It Anyway? How Improv Informs Agency and Authorship of Emergent Narrative 2010-21 Harold van Heerde (UT) Privacy-aware data management by means of data degradation 2010-22 Michiel Hildebrand (CWI) End-user Support for Access to Heterogeneous Linked Data 2010-23 Bas Steunebrink (UU) The Logical Structure of Emotions 2010-24 Dmytro Tykhonov Designing Generic and Efficient Negotiation Strategies 2010-25 Zulfiqar Ali Memon (VU) Modelling Human-Awareness for Ambient Agents: A Human Mindreading Perspective 2010-26 Ying Zhang (CWI) XRPC: Efficient Distributed Query Processing on Heterogeneous XQuery Engines 2010-27 Marten Voulon (UL) Automatisch contracteren 2010-28 Arne Koopman (UU) Characteristic Relational Patterns 2010-29 Stratos Idreos(CWI) Database Cracking: Towards Auto-tuning Database Kernels 2010-30 Marieke van Erp (UvT) Accessing Natural History - Discoveries in data cleaning, structuring, and retrieval 2010-31 Victor de Boer (UVA) Ontology Enrichment from Heterogeneous Sources on the Web 2010-32 Marcel Hiel (UvT) An Adaptive Service Oriented Architecture: Automatically solving Interoperability Problems 2010-33 Robin Aly (UT) Modeling Representation Uncertainty in Concept-Based Multimedia Retrieval 2010-34 Teduh Dirgahayu (UT) Interaction Design in Service Compositions 2010-35 Dolf Trieschnigg (UT) Proof of Concept: Concept-based Biomedical Information Retrieval 2010-36 Jose Janssen (OU) Paving the Way for Lifelong Learning; Facilitating competence development through a learning path specification 2010-37 Niels Lohmann (TUE) Correctness of services and their composition 2010-38 Dirk Fahland (TUE) From Scenarios to components 2010-39 Ghazanfar Farooq Siddiqui (VU) Integrative modeling of emotions in virtual agents 2010-40 Mark van Assem (VU) Converting and Integrating Vocabularies for the Semantic Web 2010-41 Guillaume Chaslot (UM) Monte-Carlo Tree Search 2010-42 Sybren de Kinderen (VU) Needs-driven service bundling in a multi-supplier setting - the computational e3-service approach 2010-43 Peter van Kranenburg (UU) A Computational Approach to Content-Based Retrieval of Folk Song Melodies 2010-44 Pieter Bellekens (TUE) An Approach towards Context-sensitive and User-adapted Access to Heterogeneous Data Sources, Illustrated in the Television Domain 2010-45 Vasilios Andrikopoulos (UvT) A theory and model for the evolution of software services 2010-46 Vincent Pijpers (VU) e3alignment: Exploring Inter-Organizational Business-ICT Alignment 2010-47 Chen Li (UT) Mining Process Model Variants: Challenges, Techniques, Examples 2010-48 Milan Lovric (EUR) Behavioral Finance and Agent-Based Artificial Markets 2010-49 Jahn-Takeshi Saito (UM) Solving difficult game positions 2010-50 Bouke Huurnink (UVA) Search in Audiovisual Broadcast Archives 2010-51 Alia Khairia Amin (CWI) Understanding and supporting information seeking tasks in multiple sources 2010-52 Peter-Paul van Maanen (VU) Adaptive Support for Human-Computer Teams: Exploring the Use of Cognitive Models of Trust and Attention 2010-53 Edgar Meij (UVA) Combining Concepts and Language Models for Information Access 2011-01 Botond Cseke (RUN) Variational Algorithms for Bayesian Inference in Latent Gaussian Models 2011-02 Nick Tinnemeier(UU) Organizing Agent Organizations. Syntax and Operational Semantics of an Organization-Oriented Programming Language 2011-03Jan Martijn van der Werf (TUE) Compositional Design and Verification of Component-Based Information Systems 2011-04 Hado van Hasselt (UU) Insights in Reinforcement Learning; Formal analysis and empirical evaluation of temporal-difference learning algorithms 2011-05 Base van der Raadt (VU) Enterprise Architecture Coming of Age - Increasing the Performance of an Emerging Discipline.

2011-06 Yiwen Wang (TUE) Semantically-Enhanced Recommendations in Cultural Heritage 2011-07 Yujia Cao (UT) Multimodal Information Presentation for High Load Human Computer Interaction 2011-08 Nieske Vergunst (UU) BDI-based Generation of Robust Task-Oriented Dialogues 2011-09 Tim de Jong (OU) Contextualised Mobile Media for Learning 2011-10 Bart Bogaert (UvT) Cloud Content Contention 2011-11 Dhaval Vyas (UT) Designing for Awareness: An Experience-focused HCI Perspective 2011-12 Carmen Bratosin (TUE) Grid Architecture for Distributed Process Mining 2011-13 Xiaoyu Mao (UvT) Airport under Control. Multiagent Scheduling for Airport Ground Handling 2011-14 Milan Lovric (EUR) Behavioral Finance and Agent-Based Artificial Markets 2011-15 Marijn Koolen (UvA) The Meaning of Structure: the Value of Link Evidence for Information Retrieval 2011-16 Maarten Schadd (UM) Selective Search in Games of Different Complexity 2011-17 Jiyin He (UVA) Exploring Topic Structure: Coherence, Diversity and Relatedness 2011-18 Mark Ponsen (UM) Strategic Decision-Making in complex games 2011-19 Ellen Rusman (OU) The Mind ' s Eye on Personal Profiles 2011-20 Qing Gu (VU) Guiding service-oriented software engineering - A view-based approach 2011-21 Linda Terlouw (TUD) Modularization and Specification of Service-Oriented Systems 2011-22 Junte Zhang (UVA) System Evaluation of Archival Description and Access 2011-23 Wouter Weerkamp (UVA) Finding People and their Utterances in Social Media 2011-24 Herwin van Welbergen (UT) Behavior Generation for Interpersonal Coordination with Virtual Humans On Specifying, Scheduling and Realizing Multimodal Virtual Human Behavior 2011-25 Syed Waqar ul Qounain Jaffry (VU) Analysis and Validation of Models for Trust Dynamics 2011-26 Matthijs Aart Pontier (VU) Virtual Agents for Human Communication - Emotion Regulation and Involvement-Distance Trade-Offs in Embodied Conversational Agents and Robots 2011-27 Aniel Bhulai (VU) Dynamic website optimization through autonomous management of design patterns 2011-28 Rianne Kaptein(UVA) Effective Focused Retrieval by Exploiting Query Context and Document Structure 2011-29 Faisal Kamiran (TUE) 2011-30 Egon van den Broek (UT) Affective Signal Processing (ASP): Unraveling the mystery of emotions 2011-31 Ludo Waltman (EUR) Computational and Game-Theoretic Approaches for Modeling Bounded Rationality 2011-32 Nees-Jan van Eck (EUR) Methodological Advances in Bibliometric Mapping of Science 2011-33 Tom van der Weide (UU) Arguing to Motivate Decisions 2011-34 Paolo Turrini (UU) Strategic Reasoning in Interdependence: Logical and Game-theoretical Investigations 2011-35 Maaike Harbers (UU) Explaining Agent Behavior in Virtual Training 2011-36 Erik van der Spek (UU) Experiments in serious game design: a cognitive approach 2011-37 Adriana Burlutiu (RUN) Machine Learning for Pairwise Data, Applications for Preference Learning and Supervised Network Inference 2011-38 Nyree Lemmens (UM) Bee-inspired Distributed Optimization 2011-39 Joost Westra (UU) Organizing Adaptation using Agents in Serious Games 2011-40 Viktor Clerc (VU) Architectural Knowledge Management in Global Software Development 2011-41 Luan Ibraimi (UT) Cryptographically Enforced Distributed Data Access Control 2011-42 Michal Sindlar (UU) Explaining Behavior through Mental State Attribution 2011-43 Henk van der Schuur (UU) Process Improvement through Software Operation Knowledge 2011-44 Boris Reuderink (UT) Robust Brain-Computer Interfaces 2011-45 Herman Stehouwer (UvT) Statistical Language Models for Alternative Sequence Selection 2011-46 Beibei Hu (TUD) Towards Contextualized Information Delivery: A Rule-based Architecture for the Domain of Mobile Police Work 2011-47 Azizi Bin Ab Aziz(VU) Exploring Computational Models for Intelligent Support of Persons with Depression 2011-48 Mark Ter Maat (UT) Response Selection and Turn-taking for a Sensitive Artificial Listening Agent 2011-49 Andreea Niculescu (UT) Conversational interfaces for task-oriented spoken dialogues: design aspects influencing interaction quality 2012-01 Terry Kakeeto (UvT) Relationship Marketing for SMEs in Uganda 2012-02 Muhammad Umair(VU) Adaptivity, emotion, and Rationality in Human and Ambient Agent Models 2012-03 Adam Vanya (VU) Supporting Architecture Evolution by Mining Software Repositories 2012-04 Jurriaan Souer (UU) Development of Content Management System-based Web Applications 2012-05 Marijn Plomp (UU) Maturing Interorganisational Information Systems 2012-06 Wolfgang Reinhardt (OU) Awareness Support for Knowledge Workers in Research Networks 2012-07 Rianne van Lambalgen (VU) When the Going Gets Tough: Exploring Agent-based Models of Human Performance under Demanding Conditions 2012-08 Gerben de Vries (UVA) Kernel Methods for Vessel Trajectories 2012-09 Ricardo Neisse (UT) Trust and Privacy Management Support for Context-Aware Service Platforms 2012-10 David Smits (TUE) Towards a Generic Distributed Adaptive Hypermedia Environment 2012-11 J.C.B. Rantham Prabhakara (TUE) Process Mining in the Large: Preprocessing, Discovery, and Diagnostics2012-12 Kees van der Sluijs (TUE) Model Driven Design and Data Integration in Semantic Web Information Systems 2012-13 Suleman Shahid (UvT) Fun and Face: Exploring non-verbal expressions of emotion during playful interactions Generic Adaptation Framework for Unifying Adaptive Web-based Systems 2012-15 Natalie van der Wal (VU) Social Agents. Agent-Based Modelling of Integrated Internal and Social Dynamics of Cognitive and Affective Processes 2012-16 Fiemke Both (VU) Helping people by understanding them - Ambient Agents supporting task execution and depression treatment 2012-17 Amal Elgammal (UvT) Towards a Comprehensive Framework for Business Process Compliance 2012-18 Eltjo Poort (VU) Improving Solution Architecting Practices 2012-19 Helen Schonenberg (TUE) What's Next? Operational Support for Business Process Execution 2012-20 Ali Bahramisharif (RUN) Covert Visual Spatial Attention, a Robust Paradigm for Brain-Computer Interfacing 2012-21 Roberto Cornacchia (TUD) Querying Sparse Matrices for Information Retrieval 2012-22 Thijs Vis (UvT) Intelligence, politie en veiligheidsdienst: verenigbare grootheden? 2012-23 Christian Muehl (UT) Toward Affective Brain-Computer Interfaces: Exploring the Neurophysiology of Affect during Human Media Interaction

Source: http://twanvl.nl/files/thesis-2014-12-01.pdf

### Microsoft powerpoint - tics%20and%20tourette%20syndrome.2003[1] [read-only]

Tics and Tourette Syndrome: A Clinical Child and Adolescent Psychiatrist Imam Hossein Hospital • Tics consist of patterned involuntary (or semi voluntary) movements and vocalizations and can present as either motor or phonic (vocal) tics, or both. • One measure of tic severity is how much effort a person must exert in order to suppress a tic and how successfully he can inhibit them.

### dystonie.at

Allen unseren Mitgliedern, Seele des Menschen unseren Freundinnen und Freunden Wie gleichst du dem Wasser! Schicksal des Menschen, sowie unseren Förderern Wie gleichst du dem Wind! einen schönen Sommer 2015 Johann Wolfgang Goethe Titelbild von Helmut PESCHEL Unser Doppeljubiläum 2015: 25 Jahre DYSTONIE-Selbsthilfe 20 Jahre Österreichische Dystonie Gesellschaft