## Networks, Geometry and Clustering

Clustering is a vital tool when handling data making it a central part of data science. By grouping similar objects together, it helps us find what we are looking for. I don’t go to a bakery to find a book. Clustering is part of a wider idea in science as we are always faced with thousands of potential or actual measurements but we need to focus on the few which are relevant to the process we are trying to understand. I do not need to know the nuclear properties of the constituents of a gas to understand its properties, while measuring temperature, pressure and volume do throw a lot of light on that problem. In whatever branch of science we are working in, we are always trying to reduce the dimensionality of our data, to use the language of statistics and data analysis.

Many of the techniques we use will need a measure of distance and it is most natural to call upon the everyday distance as defined by any ruler – formally the Euclidean distances d where for example d^{2} = x^{2} + y^{2} + z^{2} for the distance between the origin and a point at (x,y,z) in 3-dimensions.

However, what if time is present? Time is very different from space. Mathematically it leads to new types of geometry for space-times, Lorentzian rather than Euclidean. The simplest example is the Minkowski space-time used for studying special relativity. James Clough and I have been using Minkowski space as part of our study of networks which have a sense of time built into them - Directed Acyclic Graphs (see my blog on Time Constrained Networks for instance). Essentially these networks have a time associated with each vertex and then any edges present always point in one direction in time, say from the more recent vertex to an older one. Typically the time is a real physical time but for these types of network one can always construct an effective if artificial time coordinate.

There are many types of data with a directed acyclic graph structure. Citation networks are excellent examples and we will use them to illustrate our ideas in the rest of this article. Each node in a citation network is a document. The edges represent the entries in the bibliography of one document which always reference older documents - our arrow of time. We have worked with several different types of citation network: academic paper networks based on sections of the arXiv paper repository, US Supreme court judgements, and patents. My blog on citation network modelling gives some more background and how I think about citation networks in general.

Combining these two concepts James Clough and I have adapted a well known clustering method, MDS (Multidimensional scaling), so that it works for directed acyclic graphs (Clough and Evans 2016b). Traditional MDS is usually applied to data sets where you have a matrix of distances between each object. For a network, this would usually be the length of the shortest path between each node. MDS then assumes that these objects/nodes are embedded in a Euclidean space and suggests the best set of coordinates for the objects in that space. Clustering can then be performed by looking at which points are close together in this space. We found a way to take account of the fact that two papers on exactly the same topic can be published at the same time in different places. They are clearly ‘close’ together in any common sense definition of close yet there is no direct connection through their citation network. Our method will show that these papers are similar just from the pattern of their citations. Indeed the text could be fairly different (perhaps with two documents on networks, one uses the terms node, link, network while the second uses vertex, edge, graph for the same concepts) but the way these two documents are used by others later, or the way the to documents were based on the same material, indicates they are likely to be working on the same ideas.

Once you have the coordinates of each document in the citation network there are many other standard geometric tools you can use to do other jobs. For instance to recommend similar papers to one you are reading, you just look for other documents close in a geometric sense given the coordinates we have calculated. In the figure we show the top two hundred papers from the first decade of the hep-th part of the arXiv paper repository (this is dominated by string theory). The visualisation uses coordinates found using our Lorentzian MDS technique.

Our work with Minkowski space fits into broader programme of looking at networks in terms of the geometry of different types of space, what I call *Netometry* (Networks + Geometry, or perhaps *Neteometry* is better), as exemplified by Krioukov et al 2009. For instance, a good indication that a low dimensional Minkowski space might be a good representation of many citation networks came from our measurements of dimension (Clough and Evans 2016a).

**Bibliography**

Clough, J.R. & Evans, T.S., 2016a. What is the dimension of citation space? Physica A (in press) 2016. [ DOI 10.1016/j.physa.2015.12.053

arXiv:1408.1274 ]

Clough, J.R. & Evans, T.S., 2016b. Embedding graphs in Lorentzian spacetime, arXiv:1602.03103

Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. and Boguna, M. 2010. Hyperbolic geometry of complex networks. Phys. Rev. E, 82 [ arXiv:1006.5169 ]

## Citeology

Well, we all know that adding “-ology” to a word makes it a science – geology, biology, scientology – oh, well, perhaps not scientology. The citeology project at Autodesk Research is a wonderful visualisation that shows the temporal relationship between references. The corpus to which the analysis is applied is currently quite small, extending to some 3502 papers in Human Computer Interaction conferences between 1982 and 2010 – 11699 citations are tracked. The ensuing diagrams give a compelling visualisation showing quickly just how many citations have been made to articles and in the corpus, which articles are uncited and what the temporal “reach” of an article has been. There is a nice app on the page that allows you to explore the data set. While this works well for smaller datasets, I wonder how this approach could be scaled to work with something of the size of the Web of Science or Scopus data sets?

Evidently, Justin Matejka is the force behind this work – a contact link can be found to him on the page mentioned above. A paper describing the approach by Justin and his colleagues Tovi Grossman and George Fitzmaurice is available here http://autodeskresearch.com/publications/citeology2.