Networks, Geometry and Clustering

Networks, Geometry and Clustering

Clustering is a vital tool when handling data making it a central part of data science. By grouping similar objects together, it helps us find what we are looking for. I don’t go to a bakery to find a book. Clustering is part of a wider idea in science as we are always faced with thousands of potential or actual measurements but we need to focus on the few which are relevant to the process we are trying to understand. I do not need to know the nuclear properties of the constituents of a gas to understand its properties, while measuring temperature, pressure and volume do throw a lot of light on that problem. In whatever branch of science we are working in, we are always trying to reduce the dimensionality of our data, to use the language of statistics and data analysis.

Many of the techniques we use will need a measure of distance and it is most natural to call upon the everyday distance as defined by any ruler – formally the Euclidean distances d where for example d2 =  x2  +  y2  +  z2  for the distance between the origin and a point at (x,y,z) in 3-dimensions.

However, what if time is present? Time is very different from space. Mathematically it leads to new types of geometry for space-times, Lorentzian rather than Euclidean. The simplest example is the Minkowski space-time used for studying special relativity. James Clough and I have been using Minkowski space as part of our study of networks which have a sense of time built into them - Directed Acyclic Graphs (see my blog on Time Constrained Networks for instance). Essentially these networks have a time associated with each vertex and then any edges present always point in one direction in time, say from the more recent vertex to an older one. Typically the time is a real physical time but for these types of network one can always construct an effective if artificial time coordinate.

There are many types of data with a directed acyclic graph structure. Citation networks are excellent examples and we will use them to illustrate our ideas in the rest of this article. Each node in a citation network is a document. The edges represent the entries in the bibliography of one document which always reference older documents - our arrow of time. We have worked with several different types of citation network: academic paper networks based on sections of the arXiv paper repository, US Supreme court judgements, and patents. My blog on citation network modelling gives some more background and how I think about citation networks in general.

Combining these two concepts  James Clough and I have adapted a well-known clustering method, MDS (Multidimensional scaling), so that it works for directed acyclic graphs (Clough and Evans 2017). Traditional MDS is usually applied to data sets where you have a matrix of distances between each object. For a network, this would usually be the length of the shortest path between each node. MDS then assumes that these objects/nodes are embedded in a Euclidean space and suggests the best set of coordinates for the objects in that space. Clustering can then be performed by looking at which points are close together in this space. We found a way to take account of the fact that two papers on exactly the same topic can be published at the same time in different places. They are clearly ‘close’ together in any common sense definition of close yet there is no direct connection through their citation network. Our method will show that these papers are similar just from the pattern of their citations. Indeed the text could be fairly different (perhaps with two documents on networks, one uses the terms node, link, network while the second uses vertex, edge, graph for the same concepts) but the way these two documents are used by others later, or the way the two documents were based on the same material, indicates they are likely to be working on the same ideas.

Once you have the coordinates of each document in the citation network there are many other standard geometric tools you can use to do other jobs. For instance, to recommend similar papers to one you are reading, you just look for other documents close in a geometric sense given the coordinates we have calculated. In the figure, we show the top two hundred papers from the first decade of the hep-th part of the arXiv paper repository (this is dominated by string theory). The visualisation uses coordinates found using our Lorentzian MDS technique.

Top 200 hep-th citation network

A two-dimensional embedding of the 200 most cited papers in the hep-th citation network where coordinates are found using our Lorentzian MDS algorithm. From Clough and Evans 2016b.

Our work with Minkowski space fits into a broader programme of looking at networks in terms of the geometry of different types of space, what I call Netometry (Networks + Geometry, or perhaps Neteometry is better), as exemplified by Krioukov et al 2009. For instance, a good indication that a low dimensional Minkowski space might be a good representation of many citation networks came from our measurements of dimension (Clough and Evans 2016a).

Bibliography

Clough, J.R. & Evans, T.S., 2016. What is the dimension of citation space? Physica A 448, 235-247 [ DOI 10.1016/j.physa.2015.12.053
arXiv:1408.1274 ]

Clough, J.R. & Evans, T.S., 2017. Embedding graphs in Lorentzian spacetime, PLoS ONE 12 e0187301 [DOI: 10.1371/journal.pone.0187301 , arXiv:1602.03103]

Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. and Boguna, M. 2010. Hyperbolic geometry of complex networks.
Phys. Rev. E, 82 [ arXiv:1006.5169 ]

 

Modelling the Footprints of Innovation: Citation Networks

Modelling the Footprints of Innovation: Citation Networks

When one document refers to another in it’s text, this is called a citation. The pattern of these citations is most naturally represented as a network where the nodes are the documents and the links are the citations between the documents. When documents were physical documents, printed or written on paper, then these citations must (almost always) always point back in time to older documents. This arrow of time is imprinted on these citation networks and it leads to interesting mathematical properties.

One of the most interesting features of citations is that they have been carefully curated, sometimes for hundreds of years. The data I use on the US supreme court judgments goes back to the founding of the USA. So citation data is one of the oldest and continuous ‘big data’ sets to study.

The reason why records of citations have been maintained so carefully is that they record the process of innovation, be it in patents, law or in academic study. When you try to claim a patent you must by law indicate the prior art, earlier patents with relevant (but presumably less advanced) ideas. When a judge makes a judgement in a case in the USA, they draw on earlier cases which have interpreted, and so created, the law needed to come to a conclusion in the current case. And of course, academics can’t discuss the whole of existing science when explaining their own new ideas so they have to refer back to papers where previous researchers have set out a key idea needed in the current work. Citations are therefore a vital part of knowledge transfer, new ideas build on all the work done in earlier judgments, previous patents or older papers. That is why citations have been so carefully recorded. The network formed by documents and their citations show whose giant shoulders we are standing on when innovations are made, to paraphrase Newton.

From a theoretical point of view there are many interesting features in these networks. If you follow citations from one document to another to another and so on, at each step you will always reach an older paper. So you can never come back to the starting point and such paths. There are no cycles in a citation network (it is an example of a directed acyclic graph). If you look at the number of citations each document gets, how many newer documents refer back to one document, then they follow a fat-tailed distribution – a few documents have most of these citations, while most documents have very few citations each. Derek de Solla Price’s 1965 paper for an early discussion of this feature. Moreover, if you look at documents in the same year, you get roughly the same shape for the number of documents with a given number of citations (see Radicchi et al 2008, Evans et al 2012), at least for well cited documents. Since these networks are of such great interest,many other features have been noted too.

One way for a theorist to understand what is happening is to build a model from a few simple rules which captures as many of the features as possible. One of the first was that of Derek de Solla Price (1965) whose theory of “cumulative advantage” suggested that as new documents were created they would cite existing papers in proportion to the number of citations they already had, that is the richer get richer. This follows a principle used in many other models of fat-tails in other data, and indeed was later rediscovered in the context of the number of links to modern web pages – Barabási and Albert’s preferential attachment (1999). One trouble with this simple model is that the oldest documents are always the ‘rich’ ones with the most links. In reality, each year of publication there are a few documents with many citations (relative to the average number for that year) and most have very few. The Price model does not give this as all documents published in the same year will have roughly the same number of citations. To address this problem we (Sophia Goldberg, Hannah Anthony and myself,, Goldberg et al 2015) searched for a simple model which reproduced this behaviour – fat tails for the citation data of papers published in one field and one year (Goldberg et al, 2015).

The simplest model we found works as follows. At each step we add a new document, representing the evolution in time of our citation network.

A new document (red diamond) is added to an existing set of documents (blue circles) and their citations to earlier documents (arrows).

A new document first looks at recently published documents, as it is well known that citations tend to favour more recent documents. What we mean by recent is set by one of our parameters, a time scale τ.

First look at recent documents only.

We choose these recent documents partly at random (fraction (1-p)) and partly with cumulative attachment (fraction p) in which we pick recent papers to cite in proportion to the number of their current citations. This choice of papers is not realistic since it requires the authors to be able to choose from all recent documents while in reality authors only have a limited knowledge. However this stage is meant to capture, statistically at least, the way authors learn about recent developments: a recommendation from a colleague, a talk at a conference, scanning new editions of certain journals and so forth. Sometimes it will be essentially random, sometimes this first choice will reflect the attention papers have or will receive.

Choose to cite a recent document, with probability p use cumulative advantage (preferential attachment) or simply random with probability (1-p).

Once we have chosen these primary documents to cite, our new document then looks at the references within these primary documents.

Look at the papers cited by the primary paper just selected.

Each paper cited in the primary paper is then cited, copied, by the new document with probability q, the third and last parameter of the model.

Each paper cited in the primary paper is selected with probability q.

The new paper cites the selected secondary papers, so on average q references are copied from the primary paper.

This ‘copying’ process is known to be a way of getting cumulative attachment with only local knowledge of the network (see for instance my paper with Jari Saramäki (Evans & Saramäki 2005) and references therein). That is you only need to read the papers you have already cited, already found, to find these secondary documents to reference. There is no need for the model to know about the whole network at this point, reflecting the limited knowledge of actual authors.

It was only when we added the last copying process that we found our model reproduced the fat-tails seen within the citations to documents published in the same year. Nothing else we tried gave a few well cited papers in every year. Comparing with one data set, taken from the hep-th section of the arXiv repository, we found that the best values for our parameters led to a typical paper in our model of hep-th made up as follows:

  • Two primary papers chosen at random from recent papers.
  • Two primary papers chosen in proportion to the number of their citations from recent papers.
  • Eight secondary papers chosen by copying a reference from one of the first four primary papers.

This may seem like a very high level of papers being copied from the primary ones – on average we found 70% of papers were secondary citations, papers already cited in other papers being cited. One has to ask if the more recent paper, the primary one, contained all the information from the earlier ones as well as innovations being built on in the current paper. Did the new documents really derive useful information from the secondary papers cited? Often you see old ‘classic papers’ gathering citations as they are name checked in the introduction to a new paper. Not clear if the classic paper was even read while performing the current research. This feeling that some papers gain attention and acquire citations that does not reflect any direct influence on the current work is supported by at least two other studies. One was a study by Simkin and Roychowdhury (2005) of the way errors in the bibliographies of papers are copied in later papers. They suggest that this meant 80% of citations came from such copying of references. In another approach, James Clough, Tamar Loach, Jamie Gollings and myself (Clough et al, 2015) exploited the special properties of citation networks and this also suggested that 70%-80% of links were unnecessary for the logical structure of academic citation networks.

Of course constructing simple models will never capture the whole story. Models are, though, a good way to see if we have understood the key principles underlying a system.

References

CitNetExplorer – Citation Network Analyser and Visualisation

I have just come across an interesting citation network analyser and visualiser – CitNetExplorer. Looks to be a very professional package that I will certainly be using.

One of the interesting things about citation networks is that their vertices have an order given by their publication date. This is a very strong constraint on the system so when you analyse or visualise such a network you should take the time ordering  into account. The simplest example is that you should not just look at the vertex degree in these networks, but at in-degree (citation count) and out-degree (length of bibliography). Perhaps the most obvious aspect of the constraint comes when you try to visualise the network. You can just put such networks into a standard network package and treat it as a directed network, which it is. However any standard visualisation will undoubtedly place the vertices all over the two-dimensional surface used for display. Standard visualisations pay no attention to the time-ordering of the vertices yet you almost certainly want to show that information when displaying a citation network as it is such a critical part of the definition. So many of the properties will depend on the age of the publication for instance. I have encountered this myself and played around with a few ad-hoc solutions but came to the conclusion I needed to write something myself, adapting a standard layout method to set one dimension of the vertex coordinates while the second dimensions is set by the vertice’s time. Since the same problem is encountered when making diagrams showing the critical paths in a set of tasks (such as Gantt charts) there are packages which will do this. However you will also want to do different types of analysis on a citation network plus they are likely to be much bigger than a normal Gantt chart.

This is where CitNetExplorer comes in. This comes from Nees Jan van Eck and Ludo Waltman at the CWTS (Centre for Science and Technology Studies) in Leiden, so comes from one of the leading institutes in bibliometric research. Its very early days and I have only had a short play but for me its good points are:

  • Free for noncommercial and teaching purposes.
  • Cross platform as written in java.
  • Stable on my Windows 7 machines.
    As it is written in java, it is likely to be stable on other platforms too.
  • Well presented with a reassuring professional feel.
  • Good graphical display.
    The publications are laid out using their publication data for the vertical coordinate and a layout algorithm to place the publications horizontally
  • Good default options.
    I got an instantly readable figure every tine I tried it
  • Good range of graphical output options.
    Vector graphics, especially postscript (eps), is essential for me. Note these are all under the Screenshot menu option.
  • Two basic network format output options.
    A pajek .net and a simple text file format (see below)
  • Various basic analysis tools.
    This includes transitive reduction  which is something I have been very interested in and can throw up some new insights into the citation counts of papers (see arXiv:1310.8224).

The forty most highly cited papers in hep-th (1992-2003) after transitive reduction as an example of output from CitNetExplorer. (click on image for better version)

So this looks to be a really nice package. Of course, I am never satisfied so what would I like to see in future versions:

  • Open source.
    It would be nice to be able to learn from their computational work and to add to this myself. Maybe some type of plug-in could be added to solve the latter problem. I have a few more tricks for citation networks in the pipeline for instance.
  • More input options.
    There are only two and one is tied to Thomson-Reuter’s WoS (Web of Science) database. In the example given by the authors you perform a search on WoS and then save the results in a text file (saverecs.txt).  Note you must select the “Web of Science Core Collection” not the “All Databases” option which the example clearly shows but I didn’t read, otherwise the output file will not include the full citation information needed to construct the citation network.  This file is a simple text file so you should be able to combine them by hand if like me you are limited to 500 records per file.
    The alternative is a pair of relatively simple text files.  These are not as yet explained in the documentation. Basically there are two files.  First is namepub.txt file lists the properties of the publications and the order in this file assigns each publication an index (the publication on line 2 is vertex 1, line 3 defines vertex 3 and so on). The second file is called namecite.txt and is an edge list written in terms of the vertex index. Look at the first few lines of the example data James Clough made from the open source KDD cup arXiv citation network data that we have been using in our recent work. Alternatively if you can produce a file from WoS open it in CitNetExplorer then save it in what is called CitNetExplorer format. These CitNetExplorer files are easy to look at, edit and prepare in a spreadsheet or a basic text editor and appear to be tab separated.
  • Visualisation editing.
    No layout is perfect so it is essential to be able to move the vertices by hand. One of my favourite visualisation packages, visone, shows what you can do in java, and even my own ariadne package built on the jung library gave that functionality automatically.

Rather less seriously, I am not sure about the name.  I would pronounce the “Cit” in “CitNetExplorer” as “sit” or perhaps “chit” so I would have kept the “e” in “cite”, CiteNetExplorer, but its not my product. As I’m getting bored typing it it, I’m sure it will become just CNE in any case.

Links

CitNetExplorer http://www.citnetexplorer.nl/

Van Eck, N.J., & Waltman, L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. [arXiv:1404.5322]

Van Eck, N.J., & Waltman, L. (2014). Systematic retrieval of scientific literature based on citation relations: Introducing the CitNetExplorer tool. In Proceedings of the First Workshop on Bibliometric-enhanced Information Retrieval (BIR 2014), pages 13-20.

James R. Clough, Jamie Gollings, Tamar V. Loach, Tim S. Evans (2013).
Transitive Reduction of Citation Networks. [arXiv:1310.8224]

Clough, James; Evans, Tim; Loach, Tamar (2013). Transitive Reduction of Citation Networks. (data set) figshare
http://dx.doi.org/10.6084/m9.figshare.834935

Clough, James; Evans, Tim (2014). KDD cup arXiv data for CitNetExplorer. figshare fileset.
http://dx.doi.org/10.6084/m9.figshare.1021647