Networks, Geometry and Clustering

Networks, Geometry and Clustering

Clustering is a vital tool when handling data making it a central part of data science. By grouping similar objects together, it helps us find what we are looking for. I don’t go to a bakery to find a book. Clustering is part of a wider idea in science as we are always faced with thousands of potential or actual measurements but we need to focus on the few which are relevant to the process we are trying to understand. I do not need to know the nuclear properties of the constituents of a gas to understand its properties, while measuring temperature, pressure and volume do throw a lot of light on that problem. In whatever branch of science we are working in, we are always trying to reduce the dimensionality of our data, to use the language of statistics and data analysis.

Many of the techniques we use will need a measure of distance and it is most natural to call upon the everyday distance as defined by any ruler – formally the Euclidean distances d where for example d2 =  x2  +  y2  +  z2  for the distance between the origin and a point at (x,y,z) in 3-dimensions.

However, what if time is present? Time is very different from space. Mathematically it leads to new types of geometry for space-times, Lorentzian rather than Euclidean. The simplest example is the Minkowski space-time used for studying special relativity. James Clough and I have been using Minkowski space as part of our study of networks which have a sense of time built into them - Directed Acyclic Graphs (see my blog on Time Constrained Networks for instance). Essentially these networks have a time associated with each vertex and then any edges present always point in one direction in time, say from the more recent vertex to an older one. Typically the time is a real physical time but for these types of network one can always construct an effective if artificial time coordinate.

There are many types of data with a directed acyclic graph structure. Citation networks are excellent examples and we will use them to illustrate our ideas in the rest of this article. Each node in a citation network is a document. The edges represent the entries in the bibliography of one document which always reference older documents - our arrow of time. We have worked with several different types of citation network: academic paper networks based on sections of the arXiv  paper repository, US Supreme court judgements, and patents. My blog on citation network modelling gives some more background and how I think about citation networks in general.

Combining these two concepts  James Clough and I have adapted a well known clustering method, MDS (Multidimensional scaling), so that it works for directed acyclic graphs (Clough and Evans 2016b). Traditional MDS is usually applied to data sets where you have a matrix of distances between each object. For a network, this would usually be the length of the shortest path between each node. MDS then assumes that these objects/nodes are embedded in a Euclidean space and suggests the best set of coordinates for the objects in that space. Clustering can then be performed by looking at which points are close together in this space. We found a way to take account of the fact that two papers on exactly the same topic can be published at the same time in different places. They are clearly ‘close’ together in any common sense definition of close yet there is no direct connection through their citation network. Our method will show that these papers are similar just from the pattern of their citations. Indeed the text could be fairly different (perhaps with two documents on networks, one uses the terms node, link, network while the second uses vertex, edge, graph for the same concepts) but the way these two documents are used by others later, or the way the to documents were based on the same material, indicates they are likely to be working on the same ideas.

Once you have the coordinates of each document in the citation network there are many other standard geometric tools you can use to do other jobs. For instance to recommend similar papers to one you are reading, you just look for other documents close in a geometric sense given the coordinates we have calculated. In the figure we show the top two hundred papers from the first decade of the hep-th part of the arXiv paper repository (this is dominated by string theory). The visualisation uses coordinates found using our Lorentzian MDS technique.

Top 200 hep-th citation network

A two dimensional embedding of the 200 most cited papers in the hep-th citation network where coordinates are found using our Lorentzian MDS algorithm. From Clough and Evans 2016b.

Our work with Minkowski space fits into broader programme of looking at networks in terms of the geometry of different types of space, what I call Netometry (Networks + Geometry, or perhaps Neteometry is better), as exemplified by Krioukov et al 2009. For instance, a good indication that a low dimensional Minkowski space might be a good representation of many citation networks came from our measurements of dimension (Clough and Evans 2016a).

Bibliography

Clough, J.R. & Evans, T.S., 2016a. What is the dimension of citation space? Physica A (in press) 2016. [ DOI 10.1016/j.physa.2015.12.053
arXiv:1408.1274 ]

Clough, J.R. & Evans, T.S., 2016b. Embedding graphs in Lorentzian spacetime, arXiv:1602.03103

Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. and Boguna, M. 2010. Hyperbolic geometry of complex networks. Phys. Rev. E, 82 [ arXiv:1006.5169 ]

 

The Many Truths of Community Detection

The Many Truths of Community Detection

You do not need to know the detailed properties of every small part making up a gas, it turns out the bulk properties of a gas can be derived from very general principles. In the same way when looking at Facebook data, we might be able to identify groups of people who behave in a similar way. Searching for these groups or clusters in data is central in many areas of physical and social science. It is often easier to understand the behaviour of a large system by looking at these clusters, which are much fewer in number.

In terms of networks, the clustering is based on the structure (topology) of the network and the groups found are called Communities. In this case we might expect a coherent group to be one which has more links between members of the group than it ha to nodes outside the group in other clusters. I have done some work on what is called Community Detection, particularly in methods which assign nodes to the membership of several clusters (e.g. my line graph and clique graph papers referenced below).  After all, my social connections are likely to show that I am part of several groups: work colleagues, family relationships, connections made through hobbies or sports.

For some time I have been very wary about the meaning of the clusters found with such methods and particular about claims of one method being able to find “better” communities than another.  A recent paper prompted me to think about this again. In Community detection in networks: structural clusters versus ground truthHricDarst, and Fortunato from Aalto University in Finland (a big centre for networks research) asked if the network methods were finding different sorts of clusters from those found using other aspects of the data. Typically when testing a community detection method, one sets up artificial networks in which each node is assigned to one community. The edges between nodes are then assigned at random but with a preference for edges to be between nodes from the same community.  I can do all the tests I like on artificial data but I am always worried that this approach has introduced some hidden bias. Perhaps we end up choosing the methods that ‘work’ on artificial data but which are perhaps not so good on real messy data? It all comes down to the fact that we have mathematical ways to quantify the difference between community assignments but defining what we mean by “the best” clustering is impossible. Even with artificial networks, the “ground truth” is not generally an absolute truth.  Typically the “truth” are input parameters and the actual network generated is partly random. So while the resulting artificial network is correlated with the ground truth it is not designed to be a perfect match. So in this case the “actual truth” will, in almost most cases, be different from the ground truth. [Note added 5th January 2016. Another more recent paper which includes some expert evaluation of communities found as well as a comparison of many different methods is Šubelj, van Eck and Waltman 2015 Clustering scientific publications based on citation relations: A systematic comparison of different methods.]

I also worry about what we do when we run network community detection methods on large real data sets where there is no simple ground truth.  When I have done this, I can find a variety of possible answers for communities in the data.  Many look reasonable but none correlate perfectly with each other or with what I know from other sources. This leaves me wondering if the automatic methods are finding one truth and my other information gives another. Alternatively the automatic methods might be rubbish, good on artificial cases, not so good in reality. There is no simple way of telling.

In any case do real networks have a “ground truth”?  Quite often people have data from other sources about real networks and they use this to construct a “ground truth”.  The test is then to see if automatic methods can find this ground truth. However what if the other data is wrong? People don’t always tell the truth, they can deliberately mislead or they can misunderstand the problem. Children surveyed about their friendships may tell you who they’d like to be friends with (the most popular person in the class) and not who they actually spend time with.

Zachary Karate Club network clustered using clique graph methods

Zachary Karate Club network clustered using clique graph methods

Take the famous Zachary karate club data set used by many (including myself) as a simple test. This is a network of members of a karate club that split in two during the sociologist’s study. Let us accept that the professionalism of Zachary has produced data that is a true reflection of the situation despite the difficulty of measuring associations in social science. If you look at the published paper it actually gives two truths. One is based on which of two factions the members actually joined, and one based on an automatic community detection method.  I suspect most people are using the latter as the ground truth (unwittingly) when testing their work.  Perhaps this is a further example supporting the claim that academics only read 20% of their references. Worse the data given in the published karate club paper is not consistent – the unweighted adjacency matrix is not symmetric.  So which truth was used for all those papers using the Karate club network?

American College Football network

American College Football network clustered using clique graph clustering methods

Another example comes from some work I did on overlapping community methods. Like many other people I downloaded a standard data set from Mark Newman’s web site, an extremely useful resource. The American College Football data was created by Girvan and Newman (in Community structure in social and biological networks) and represents the games played between American College Football teams in one season.  Also provided are the conference membership of each team.  Teams play more games against teams from their own conference than from any one other conference. In fact, this data is so well clustered that surely no method should get anything wrong beyond a few independent teams as my visualisations here illustrate (taken from my clique based clustering paper). So I looked at the “mistakes” made by my method. After about two afternoons of wading through interminable websites of stats on American College football and Wikipedia pages on the College Conference system, I realised that in fact most of the “mistakes” were not from the automatic community detection but lay in “the ground truth”, that is in the conferences assigned to teams in the data file. It turns out that the assignments in the original football.gml file are for the 2001 season while the file records information about the games played for the 2000 season. For instance, the Big West conference existed for football till 2000 while the Sun Belt conference was only started in 2001. There were 11 conferences and 5 independents in 2001 but 10 conferences and 8 independents in 2000.   Care is needed as American College athletic conferences cover many sports, with some sports joining or dropped from any one conference time to time. Teams can also switch conferences too. In fact, around 10% of the college teams playing American football at the top level changed conferences around 2000-2001. (Note added 5th January 2016.  These errors were also noted in Gschwind et al 2015, Social Network Analysis and Community Detection by Decomposing a Graph into Relaxed Cliques, only the second paper I’ve seen which does this independent from my discussion).

So often the “ground truth” is just another truth not some absolute truth! The errors in the Zachary Karate club and American College Football data do not matter in one sense as they still provide valid and valuable tests for methods.   The conclusions in the hundreds of papers using these data sets and which use these questionable ground truths would not change. Indeed it highlights one role for automatic methods.   You can see that where Girvan and Newman’s methods get the “wrong” answer in their original paper (Community structure in social and biological networks) they are in fact highlighting problems with their conference data. Validation of data is a very useful if boring job. A final question will always be if there is a single truth. For instance I am in the theoretical physics group of the physics department of the Faculty of Natural Sciences at Imperial College London.  That top-down hierarchical truth is important when allocating desks and teaching.  However another truth would emerge if you studied my research relationships.  Those are with staff and students based in other physics research groups and with colleagues from other departments and even other faculties.

So I was really pleased to see that Community detection in networks: structural clusters versus ground truth were questioning the meaning of truth in community detection from a quantitative point of view. Clustering of data, finding communities in data is of tremendous value commercially and for research, but there is still a lot more work to do before we understand these different truths.

References

M. Girvan, M.E.J. Newman, Community structure in social and biological networks, PNAS (2002) 99, 7821-782

W. Zachary, Information-Flow Model For Conflict And Fission In Small-Groups Journal Of Anthropological Research, (1977) 33  452–473

D. Hric, R.K. Darst,and S.Fortunato,  Community detection in networks: structural clusters versus ground truth,  arXiv:1406.0146

T.S. Evans, American College Football Network Files. figshare. (2012).  http://dx.doi.org/10.6084/m9.figshare.93179

T.S. Evans, and R. Lambiotte, Line Graphs, Link Partitions and Overlapping Communities Phys.Rev.E, 2009, 80, 016105 [arXiv:0903.2181].

T.S. Evans, Clique Graphs and Overlapping Communities J. Stat. Mech. (2010) P12037  http://dx.doi.org/10.1088/1742-5468/2010/12/P12037
[arXiv:1009.0638]

L. Šubelj, N.J. van Eck, L. Waltman, Clustering scientific publications based on citation relations: A systematic comparison of different methodsarXiv:1512.09023 .

Myths and Networks

Myths and Networks

I have just read an intriguing paper by Carron and Kenna entitled the  ‘Universal properties of mythological networks‘. In it they analyse the character networks  in three ancient stories, Beowulf , the Iliad and the Irish story Táin Bó Cuailnge.  That is the characters form the nodes of a network and they are connected if they appear together in the same part of the story. It has caused quite a bit of activity.  It has prompted two posts on The Networks Network already and has even sparked activity in the UK newspapers (see John Sutherland writing in the Guardian Wednesday 25 July 2012 and the follow up comment by Ralph Kenna one of the authors).  Well summer is the traditional silly season for newspapers.

However I think it is too easy to dismiss the article. I think Tom Brugmans posting on The Networks Network has it right that  “as an exploratory exercise it would have been fine”.  I disagreed with much in the paper, but it did intrigue me and many papers fail to do even this much.  So overall I think it was a useful publication. I think there are ideas there waiting to be developed further.

I like the general idea that there might be some information in the character networks which would enable one to say if it was based on fact or was pure fiction. That is if the character networks have the same characteristics as a social network it would support the idea that it was based on historical events. I was intrigued by some of the measures suggested as a way to differentiate between different types of literary work.  However like both Tom Brugmans and Marco Büchler, I was unconvinced the authors’ measures really do the job suggested. I’d really like to see a lot more evidence from many more texts before linking a particular measurement to a particular feature in character networks.

For instance Carron and Kenna suggest that in hierarchical networks for every node the degree times the clustering coefficient is a constant, eqn (2).  That is each of your friends is always connected to the same (on average) number of your friends.  By way of contrast, in a classical (Erdos-Reyni) random graph the clustering coefficient is a constant. However I don’t see that as hierarchical but an indication that everyone lives in similar size communities, some sort of fiction character Dunbar number. I’m sure you could have a very flat arrangement of communities and get the same result. Perhaps we mean different things by hierarchical.

Another claim was that in collaboration networks less than 90% of nodes are in the giant component.  The Newman paper referred to is about scientific collaboration derived from coauthorships which is very different from the actual social network of scientists (science is not done in isolation no one is really isolated). I’m not sure the Newman paper tells us anything about character structure in fictional or non-fictional texts.  I can not see why one would introduce any set of characters in any story (fictional or not) who are disconnected from the rest. Perhaps some clever tale with two strands separated in time yet connected in terms other than social relationships (e.g. through geography or action) – David Mitchell’s “Cloud Atlas” comes to my mind  - but these are pretty contrived structures.

I think a real problem in the detail of the paper, as Marco Büchler points out, is that these texts and their networks are just too small.  There is no way one can talk rigorously about power laws, and certainly not to two decimal place accuracy. I thought Michael Stumpf and Mason Porter’s commentary (Critical Truths about Power Laws) was not needed since every one knew the issues by now (I don’t in fact agree with some of the interpretation of mathematical results in Stumpf and Porter).  Perhaps this mythological networks paper shows I was wrong. At best power law forms for small networks (and small to me means under a million nodes in this context) give a reasonable description or summary of fat tailed distributions found here but many other functional forms will do this too.  I see no useful information in the specific forms suggested by Carron and Kenna.

Another point raised in the text was the idea that you could extract subnetworks representing `friendly’ social networks. That is interesting but really they are suggesting we need to do a semantic analysis of the links in the text, indicating where links are positive or negative (if they are that simple of course) and form signed networks (e.g. see Szell et al. on how this might be done on a large scale http://arxiv.org/abs/1003.5137).  I think that is a much harder job to do in these texts than the simple tricks used here suggest but it is an important aspect in such analysis and I take the authors’ point.

Finally I was interested that they mention other character networks derived from five other fictional sources.  I always liked the Marvel comic character example for instance (Alberich et al, http://arxiv.org/abs/cond-mat/0202174) as it showed that while networks were indeed trendy and hyped (everything became a network) there was often something useful hiding underneath and trying to get out in even the most bizarre examples.  However what caught my eye in the five extra examples mentioned by Carron and Kenna was that they treated these five as ‘fictional literature’.  One, Shakespeare’s Richard III, is surely a fictionalised account of real history written much closer to the real events and drawing on `historical’ accounts.  I’d would have expected it to show the same features as they claim for their three chosen texts.

So I was intrigued and in that sense that always makes a paper/talk worthwhile to me.  However while I was interested I’d need to see much more work on the idea.  You might try many different tests and measurements and see if they cumulatively point in one direction or another – I imagine a PCA type plot showing different types of network in tight clusters in some `measurement’ space.  I’d still need convincing on a large number of trial texts.  These do now exist though, so surely there is a digital humanities project here? Or is it already happening somewhere?