The Many Truths of Community Detection

The Many Truths of Community Detection

You do not need to know the detailed properties of every small part making up a gas, it turns out the bulk properties of a gas can be derived from very general principles. In the same way when looking at Facebook data, we might be able to identify groups of people who behave in a similar way. Searching for these groups or clusters in data is central in many areas of physical and social science. It is often easier to understand the behaviour of a large system by looking at these clusters, which are much fewer in number.

In terms of networks, the clustering is based on the structure (topology) of the network and the groups found are called Communities. In this case we might expect a coherent group to be one which has more links between members of the group than it ha to nodes outside the group in other clusters. I have done some work on what is called Community Detection, particularly in methods which assign nodes to the membership of several clusters (e.g. my line graph and clique graph papers referenced below).  After all, my social connections are likely to show that I am part of several groups: work colleagues, family relationships, connections made through hobbies or sports.

For some time I have been very wary about the meaning of the clusters found with such methods and particular about claims of one method being able to find “better” communities than another.  A recent paper prompted me to think about this again. In Community detection in networks: structural clusters versus ground truthHricDarst, and Fortunato from Aalto University in Finland (a big centre for networks research) asked if the network methods were finding different sorts of clusters from those found using other aspects of the data. Typically when testing a community detection method, one sets up artificial networks in which each node is assigned to one community. The edges between nodes are then assigned at random but with a preference for edges to be between nodes from the same community.  I can do all the tests I like on artificial data but I am always worried that this approach has introduced some hidden bias. Perhaps we end up choosing the methods that ‘work’ on artificial data but which are perhaps not so good on real messy data? It all comes down to the fact that we have mathematical ways to quantify the difference between community assignments but defining what we mean by “the best” clustering is impossible. Even with artificial networks, the “ground truth” is not generally an absolute truth.  Typically the “truth” are input parameters and the actual network generated is partly random. So while the resulting artificial network is correlated with the ground truth it is not designed to be a perfect match. So in this case the “actual truth” will, in almost most cases, be different from the ground truth. [Note added 5th January 2016. Another more recent paper which includes some expert evaluation of communities found as well as a comparison of many different methods is Šubelj, van Eck and Waltman 2015 Clustering scientific publications based on citation relations: A systematic comparison of different methods.]

I also worry about what we do when we run network community detection methods on large real data sets where there is no simple ground truth.  When I have done this, I can find a variety of possible answers for communities in the data.  Many look reasonable but none correlate perfectly with each other or with what I know from other sources. This leaves me wondering if the automatic methods are finding one truth and my other information gives another. Alternatively the automatic methods might be rubbish, good on artificial cases, not so good in reality. There is no simple way of telling.

In any case do real networks have a “ground truth”?  Quite often people have data from other sources about real networks and they use this to construct a “ground truth”.  The test is then to see if automatic methods can find this ground truth. However what if the other data is wrong? People don’t always tell the truth, they can deliberately mislead or they can misunderstand the problem. Children surveyed about their friendships may tell you who they’d like to be friends with (the most popular person in the class) and not who they actually spend time with.

Zachary Karate Club network clustered using clique graph methods

Zachary Karate Club network clustered using clique graph methods

Take the famous Zachary karate club data set used by many (including myself) as a simple test. This is a network of members of a karate club that split in two during the sociologist’s study. Let us accept that the professionalism of Zachary has produced data that is a true reflection of the situation despite the difficulty of measuring associations in social science. If you look at the published paper it actually gives two truths. One is based on which of two factions the members actually joined, and one based on an automatic community detection method.  I suspect most people are using the latter as the ground truth (unwittingly) when testing their work.  Perhaps this is a further example supporting the claim that academics only read 20% of their references. Worse the data given in the published karate club paper is not consistent – the unweighted adjacency matrix is not symmetric.  So which truth was used for all those papers using the Karate club network?

American College Football network

American College Football network clustered using clique graph clustering methods

Another example comes from some work I did on overlapping community methods. Like many other people I downloaded a standard data set from Mark Newman’s web site, an extremely useful resource. The American College Football data was created by Girvan and Newman (in Community structure in social and biological networks) and represents the games played between American College Football teams in one season.  Also provided are the conference membership of each team.  Teams play more games against teams from the their own conference than from any one other conference. In fact this data is so well clustered that surely no method should get anything wrong beyond a few independent teams as my visualisations here illustrate (taken from my clique based clustering paper). So I looked at the “mistakes” made by my method. After about two afternoons of wading through interminable web sites of stats on American College football and Wikipedia pages on the College Conference system, I realised that in fact most of the “mistakes” were not from the automatic community detection but lay in “the ground truth”, that is in the conferences assigned to teams in the data file. It turns out that the assignments in the original football.gml file are for the 2001 season while the file records information about the games played for the 2000 season. For instance the Big West conference existed for football till 2000 while the Sun Belt conference was only started in 2001. There were 11 conferences and 5 independents in 2001 but 10 conferences and 8 independents in 2000.   Care is needed as American College athletic conferences cover many sports, with some sports joining or dropped from any one conference time to time. Teams can also switch conferences too. In fact around 10% of the college teams playing American football at the top level changed conferences around 2000-2001. (Note added 5th January 2016.  These errors were also noted in Gschwind et al 2015, Social Network Analysis and Community Detection by Decomposing a Graph into Relaxed Cliques, only the second paper I’ve seen which does this independent from my discussion).

So often the “ground truth” is just another truth not some absolute truth! The errors in the Zachary Karate club and American College Football data do not matter in one sense as they still provide valid and valuable tests for methods.   The conclusions in the hundreds of papers using these data sets and which use these questionable ground truths would not change. Indeed it highlights one role for automatic methods.   You can see that where Girvan and Newman’s methods get the “wrong” answer in their original paper (Community structure in social and biological networks) they are in fact highlighting problems with their conference data. Validation of data is a very useful if boring job. A final question will always be if there is a single truth. For instance I am in the theoretical physics group of the physics department of the Faculty of Natural Sciences at Imperial College London.  That top-down hierarchical truth is important when allocating desks and teaching.  However another truth would emerge if you studied my research relationships.  Those are with staff and students based in other physics research groups and with colleagues from other departments and even other faculties.

So I was really pleased to see that Community detection in networks: structural clusters versus ground truth were questioning the meaning of truth in community detection from a quantitative point of view. Clustering of data, finding communities in data is of tremendous value commercially and for research, but there is still a lot more work to do before we understand these different truths.

References

M. Girvan, M.E.J. Newman, Community structure in social and biological networks, PNAS (2002) 99, 7821-782

W. Zachary, Information-Flow Model For Conflict And Fission In Small-Groups Journal Of Anthropological Research, (1977) 33  452–473

D. Hric, R.K. Darst,and S.Fortunato,  Community detection in networks: structural clusters versus ground truth,  arXiv:1406.0146

T.S. Evans, American College Football Network Files. figshare. (2012).  http://dx.doi.org/10.6084/m9.figshare.93179

T.S. Evans, and R. Lambiotte, Line Graphs, Link Partitions and Overlapping Communities Phys.Rev.E, 2009, 80, 016105 [arXiv:0903.2181].

T.S. Evans, Clique Graphs and Overlapping Communities J. Stat. Mech. (2010) P12037  http://dx.doi.org/10.1088/1742-5468/2010/12/P12037
[arXiv:1009.0638]

L. Šubelj, N.J. van Eck, L. Waltman, Clustering scientific publications based on citation relations: A systematic comparison of different methodsarXiv:1512.09023 .

Who’s #1?: The Science of Rating and Ranking

I’ve recently been reading Langville and Meyer’s book, Who’s #1?: The Science of Rating and Ranking.  This is a neat book filled with nice ideas, which are clearly explained.  This isn’t Langville and Meyer’s first foray into helping popularise ranking, having previously written: Google’s PageRank and Beyond: The Science of Search Engine Rankings.  While that book concentrated more specifically on the ideas behind Google’s PageRank algorithm, the current offering covers a slightly wider range of material.  However, as with the first book, the material is interesting, and well explained.  I think that it would be useful for anyone wanting to teach a course on ranking, and covers some excellent areas that could be used for special topics lectures or guest lectures.  The material is well presented and would make good reading for undergraduates and some advanced high school students.  Having had fun with football ranking problems a few years ago, I found this book to be nice summary of the area.  Great read!

http://press.princeton.edu/titles/9661.html