The Many Truths of Community Detection

The Many Truths of Community Detection

You do not need to know the detailed properties of every small part making up a gas, it turns out the bulk properties of a gas can be derived from very general principles. In the same way when looking at Facebook data, we might be able to identify groups of people who behave in a similar way. Searching for these groups or clusters in data is central in many areas of physical and social science. It is often easier to understand the behaviour of a large system by looking at these clusters, which are much fewer in number.

In terms of networks, the clustering is based on the structure (topology) of the network and the groups found are called Communities. In this case we might expect a coherent group to be one which has more links between members of the group than it ha to nodes outside the group in other clusters. I have done some work on what is called Community Detection, particularly in methods which assign nodes to the membership of several clusters (e.g. my line graph and clique graph papers referenced below).  After all, my social connections are likely to show that I am part of several groups: work colleagues, family relationships, connections made through hobbies or sports.

For some time I have been very wary about the meaning of the clusters found with such methods and particular about claims of one method being able to find “better” communities than another.  A recent paper prompted me to think about this again. In Community detection in networks: structural clusters versus ground truthHricDarst, and Fortunato from Aalto University in Finland (a big centre for networks research) asked if the network methods were finding different sorts of clusters from those found using other aspects of the data. Typically when testing a community detection method, one sets up artificial networks in which each node is assigned to one community. The edges between nodes are then assigned at random but with a preference for edges to be between nodes from the same community.  I can do all the tests I like on artificial data but I am always worried that this approach has introduced some hidden bias. Perhaps we end up choosing the methods that ‘work’ on artificial data but which are perhaps not so good on real messy data? It all comes down to the fact that we have mathematical ways to quantify the difference between community assignments but defining what we mean by “the best” clustering is impossible. Even with artificial networks, the “ground truth” is not generally an absolute truth.  Typically the “truth” are input parameters and the actual network generated is partly random. So while the resulting artificial network is correlated with the ground truth it is not designed to be a perfect match. So in this case the “actual truth” will, in almost most cases, be different from the ground truth. [Note added 5th January 2016. Another more recent paper which includes some expert evaluation of communities found as well as a comparison of many different methods is Šubelj, van Eck and Waltman 2015 Clustering scientific publications based on citation relations: A systematic comparison of different methods.]

I also worry about what we do when we run network community detection methods on large real data sets where there is no simple ground truth.  When I have done this, I can find a variety of possible answers for communities in the data.  Many look reasonable but none correlate perfectly with each other or with what I know from other sources. This leaves me wondering if the automatic methods are finding one truth and my other information gives another. Alternatively the automatic methods might be rubbish, good on artificial cases, not so good in reality. There is no simple way of telling.

In any case do real networks have a “ground truth”?  Quite often people have data from other sources about real networks and they use this to construct a “ground truth”.  The test is then to see if automatic methods can find this ground truth. However what if the other data is wrong? People don’t always tell the truth, they can deliberately mislead or they can misunderstand the problem. Children surveyed about their friendships may tell you who they’d like to be friends with (the most popular person in the class) and not who they actually spend time with.

Zachary Karate Club network clustered using clique graph methods

Zachary Karate Club network clustered using clique graph methods

Take the famous Zachary karate club data set used by many (including myself) as a simple test. This is a network of members of a karate club that split in two during the sociologist’s study. Let us accept that the professionalism of Zachary has produced data that is a true reflection of the situation despite the difficulty of measuring associations in social science. If you look at the published paper it actually gives two truths. One is based on which of two factions the members actually joined, and one based on an automatic community detection method.  I suspect most people are using the latter as the ground truth (unwittingly) when testing their work.  Perhaps this is a further example supporting the claim that academics only read 20% of their references. Worse the data given in the published karate club paper is not consistent – the unweighted adjacency matrix is not symmetric.  So which truth was used for all those papers using the Karate club network?

American College Football network

American College Football network clustered using clique graph clustering methods

Another example comes from some work I did on overlapping community methods. Like many other people I downloaded a standard data set from Mark Newman’s web site, an extremely useful resource. The American College Football data was created by Girvan and Newman (in Community structure in social and biological networks) and represents the games played between American College Football teams in one season.  Also provided are the conference membership of each team.  Teams play more games against teams from their own conference than from any one other conference. In fact, this data is so well clustered that surely no method should get anything wrong beyond a few independent teams as my visualisations here illustrate (taken from my clique based clustering paper). So I looked at the “mistakes” made by my method. After about two afternoons of wading through interminable websites of stats on American College football and Wikipedia pages on the College Conference system, I realised that in fact most of the “mistakes” were not from the automatic community detection but lay in “the ground truth”, that is in the conferences assigned to teams in the data file. It turns out that the assignments in the original football.gml file are for the 2001 season while the file records information about the games played for the 2000 season. For instance, the Big West conference existed for football till 2000 while the Sun Belt conference was only started in 2001. There were 11 conferences and 5 independents in 2001 but 10 conferences and 8 independents in 2000.   Care is needed as American College athletic conferences cover many sports, with some sports joining or dropped from any one conference time to time. Teams can also switch conferences too. In fact, around 10% of the college teams playing American football at the top level changed conferences around 2000-2001. (Note added 5th January 2016.  These errors were also noted in Gschwind et al 2015, Social Network Analysis and Community Detection by Decomposing a Graph into Relaxed Cliques, only the second paper I’ve seen which does this independent from my discussion).

So often the “ground truth” is just another truth not some absolute truth! The errors in the Zachary Karate club and American College Football data do not matter in one sense as they still provide valid and valuable tests for methods.   The conclusions in the hundreds of papers using these data sets and which use these questionable ground truths would not change. Indeed it highlights one role for automatic methods.   You can see that where Girvan and Newman’s methods get the “wrong” answer in their original paper (Community structure in social and biological networks) they are in fact highlighting problems with their conference data. Validation of data is a very useful if boring job. A final question will always be if there is a single truth. For instance I am in the theoretical physics group of the physics department of the Faculty of Natural Sciences at Imperial College London.  That top-down hierarchical truth is important when allocating desks and teaching.  However another truth would emerge if you studied my research relationships.  Those are with staff and students based in other physics research groups and with colleagues from other departments and even other faculties.

So I was really pleased to see that Community detection in networks: structural clusters versus ground truth were questioning the meaning of truth in community detection from a quantitative point of view. Clustering of data, finding communities in data is of tremendous value commercially and for research, but there is still a lot more work to do before we understand these different truths.

References

M. Girvan, M.E.J. Newman, Community structure in social and biological networks, PNAS (2002) 99, 7821-782

W. Zachary, Information-Flow Model For Conflict And Fission In Small-Groups Journal Of Anthropological Research, (1977) 33  452–473

D. Hric, R.K. Darst,and S.Fortunato,  Community detection in networks: structural clusters versus ground truth,  arXiv:1406.0146

T.S. Evans, American College Football Network Files. figshare. (2012).  http://dx.doi.org/10.6084/m9.figshare.93179

T.S. Evans, and R. Lambiotte, Line Graphs, Link Partitions and Overlapping Communities Phys.Rev.E, 2009, 80, 016105 [arXiv:0903.2181].

T.S. Evans, Clique Graphs and Overlapping Communities J. Stat. Mech. (2010) P12037  http://dx.doi.org/10.1088/1742-5468/2010/12/P12037
[arXiv:1009.0638]

L. Šubelj, N.J. van Eck, L. Waltman, Clustering scientific publications based on citation relations: A systematic comparison of different methodsarXiv:1512.09023 .

Power Laws

Why are power laws still in the news? Michael Stumpf and Mason Porter recently published a commentary on the topic in Science (“Critical Truth About Power Laws“).  Yet the nature of long tailed distributions has been a debate in physics since the work on critical phenomena in the 70′s.  This then continued in the context of fractals in the 80s, self-organised criticality in the 90′s and now in terms of complex networks. So the issues should be familiar to theoretical physicists and applied mathematicians working in these areas.  However the field of complexity is multidisciplinary and there is a real need to reach out to talk to those researchers from other backgrounds to explain about what we have learnt in all these debates about power laws.  Often they have their own experiences to add to this debate – financial mathematicians have long been interested in long tailed distributions.  Physicists may understand the subtext behind attention grabbing headlines but there is no reason to think others know how large a pinch of salt to add to these claims. That has certainly been my impression when hearing experts from other fields refer to some of early claims about complex networks. Perhaps the most worrying aspect is that many of the points raised in this age old debate are still not addressed in many modern publications. This is part of the frustration I hear when reading the Stumpf and Porter article. To me this is a good example of the repeated failure in the quality control derived from the current referring system (but that is a debate for another time).

One of the key points made by Stumpf and Porter, though I’ve heard this many times before, is the lack of statistical support behind many claims for a power law. Identifying a power law behaviour is not trivial. I always remember that Kim Christensen at Imperial recommended to me that four decades of data were needed following his experiences with power laws while Stumpf and Porter suggest two.  Many examples fail to pass even this lower limit.

Aaron Clauset,  Cosma Shalizi and Mark Newman provide one popular statistical recipe and code that addresses most issues ( “Power-law distributions in empirical data”  - and check out the blogs on the arXiv trackback).  The approach will produce much more useful results than I’ve seen in several published papers. It might not be perfect though. I think I’ve seen warnings against the use of cumulative distributions as problems in the early part of the data will effect many more data points that issues in later data points.

In fact I would go further. I often reject papers with no error estimates in measured quantities, no uncertainties in data points – I mark our own physics students down for these failings. Why show points and a best fit on a log-log plot except as an illustration? A plot of residuals (differences between data and fit) is far more informative scientifically. Many refereed papers appear to let these things go limiting the impact of their message.

Another part of the debate and a key message in Stumpf and Porter was the meaning of a power law. Most researchers I know realised any hope of universality in the powers seen in network data was misplaced in the early naughties. For me this was one of my first questions and as I did not see it answered in the literature it led me to think about the formation of scale-free networks in the real world.  I wrote up as a paper with Jari Sarämaki from Finland (“Realistic models for the formation of scale-free networks).  Existing models just didn’t explain this at the time, precisely what Stumpf and Porter say is still missing in many discussions about power laws even today.

Fig 2, Evans and Saramäki, arXiv:cond-mat/0411390

Power Laws in a Scale Free network formed using random walks. Even with a theoretical model I fell just short of 4 decades of data. Note the right hand plot showed non-trivial finite size effects hiding in the tail. This was all for networks with one million vertices. Fig 2 of arXiv:cond-mat/0411390

There was a technical point in the Stumpf and Porter article that did catch my eye but in the end left me disappointed.  They highlighted that for random numbers drawn from any long tailed distribution there is a generalization of the central limit theorem.  It tells us that sums of many random numbers drawn from such distributions will lie in distributions with a power law tail (you have to define all this carefully though). However this is not as powerful as it sounds. The definition of a long tailed distribution in this context is one with a power law tail. Seems circular to me.

Less seriously I thought the sketch provided in the Stumpf and Porter Science article was a bad example to set in an article about bad representations of data. One axis is a qualitative judgment while the other is an unspecified statistical measure. The ‘data points’ on the plot are mostly unspecified. The one marked Zipf’s law is presumably for the classic data on cities though which of many data sets this on this topic I’m not sure. What I think was intended was a sketch to indicate the subjective judgments of the authors on the nature of power laws which have been suggested in different fields. This would be terms of their two main themes: the statistical support for power laws, and the modelling and interpretation of results.  In the end the sketch given didn’t convey anything useful to me.

Still these are minor quibbles. If an article can keep the Imperial Complexity and Networks programme’s weekly meeting talking for an extra half an hour, a discussion led by Gunnar Pruessnar from Imperial, it has got to be doing something good.