There is a wealth of data on academic publications, going back a hundred years or more, and this is now held electronically. Some is freely available, some of the rest can be purchased. There is great interest in the analysis of this data too, for both intellectual and commercial reasons, as it represents an important output from academics and a marker of the research and innovation process. The study of this data, a field known as Scientometrics or Bibliometrics, is therefore an ideal area for the application of ideas from Complexity and Networks.
I am interested in this data at various levels. It poses interesting questions in terms of theoretical graph analysis. The data is used to look at the performance of individual researchers or to evaluate whole units such as Universities. Trying to improve these methods is an interesting theoretical problem with immediate practical and commercial applications. So any new ideas can be tested and then applied in the real world. This data can be seen as the best documented footprint of innovation processes in the human society so there are broad and fascinating social science questions which might be studied.
- I have been looking at new ways to analyse citation networks. This is the directed network formed when the vertices are published documents, and the links are defined by the references in the bibliography of one document to older, previously published documents. Academic papers, patents and court judgements are some examples of document sets that are used to create large citation networks. My thesis is that any network constrained by time (citations always point backwards in time) needs to be analysed with that constraint taken into account (see my blog on time constrained networks). A trivial example is that there is no point using total degree for these networks, in-degree (citation count) and out-degree (length of bibliography) need to be treated separately.
Some of these ideas have been developed with James Clough, Jamie Gollings and Tamar Loach, producing the paper Transitive Reduction of Citation Networks [arXiv:1310.8224] with the data we used on figshare.com under Transitive Reduction of Citation Networks [DOI 10.6084/m9.figshare.834935]. This work looked at transitive reduction on citation networks and found that this highlights key differences between academic papers and court judgements on the one hand, and patent citations on the other. We also found that papers could have wildly different citation counts before and after transitive reduction. We conjectured that papers which retained a large number of citations after transitive reduction are more likely to be of interest to a wider audience crossing many topics or disciplines.
James Clough and I have taken the idea of embedding these citation networks in a space-time further. We started by continuing the work in some of the student projects which we were involved in. Our aim was to measure the dimension of such spaces. The time direction is related to the real physical time but our conjecture is that the space directions of this space capture information about the variety of research ‘directions’, fields or topics, in the data. We found that we could measure the dimension in a consistent and robust manner. Further different fields show different dimensions. The hep-th section of the arXiv repository (largely string theory) is narrow with just one effective space dimension while the astro particle physics section has a spatial dimension of around 2.5. Our results are in our paper What is the dimension of citation space? [arXiv:1408.1274]. A general discussion of our approach was also presented at the ISSI 2015 meeting in Istanbul and will appear as Time and Citation Networks [arXiv:1507.01388] in the ISSI proceedings. Slides from my ISSI 2015 talk are on my figshare account at http://dx.doi.org/10.6084/m9.figshare.1464980 along with slides from similar talks given elsewhere.
We have now taken this work to its logical conclusion where James and I have found a way to assign every paper in a citation network coordinates in a Minkowski time [Embedding graphs in Lorentzian spacetime arXiv:1602.03103 blog article].
- I have been interested in how we can extract information about academic papers from altmetric data. That is data obtained from modern electronic media such as twitter, blogs or websites, which typically responds much quicker events than is reflected in the bibliographies of papers publihsed in traditional journals. Working with Tamar Loach we looked at how we could use this data to rank journals, an alternative to traditional measures such as Impact Factor. The basic idea is that we consider all the mentions of academic papers coming from one altmetric account (single author or not). We then treat each account as rating the different journals through the choice of papers its mentions. So if the source reads ten papers from Nature but only one from PNAS we can use this to say this source thinks Nature deserves more attention than PNAS and should be rated more highly in any journal ranking scheme. There are many ways to use these counts to produce such ratings and rankings. One useful analogy is to think of each source as a tournament between different teams – the journals. Taking journals in pairs, the number of different papers mentioned by this account is the score in a ‘game’ between these journals, so in this example we would have a Nature vs. PNAS game with a score of 10-1. There is then a wealth of ways from sports analysis to turn such scores into ranking of the teams, the journals in this case.
I presented our preliminary results at the ISSI 2015 meeting in Istanbul. Our paper for the proceedings is Ranking Journals Using Altmetric [arXiv:1507.00451] while the slides from the talk are on figshare as http://dx.doi.org/10.6084/m9.figshare.1461693. We found that most (but not all) of our approaches produce reasonable results, that is results not too dissimilar from standard rankings such as Impact Factor, but there were still considerable differences suggesting that altmetrics do reflect different aspects of attention that journals receive. Interestingly blogs seemed to produce results closest to standard measures but that may be because their choice of journal to follow is heavily influenced by the standard journal rankings?.
- I have been thinking about simple models which produce realistic citation networks (see my blog on Modelling Citation Networks). The main idea is that people search through bibliographies to find papers to cite in two main ways. One uses ‘local’ information in terms of the citation network. That is people read one paper, then see that this paper cites something else that seems to be of particular interest so they go and read this second paper and decide to add that to the bibliography in the new paper they are writing. Alternatively, authors use ‘global’ information which follows connections made outside of the citation network. Such global processes could involve hearing about a paper in a talk at a conference, reading a review in a blog they follow or just from the contenets of a new edition of a journal. With Sophia Goldberg and Hannah Anthony we described this work in our paper Modelling Citation Networks [arXiv:1408.2970]. I also presented this work at ISSI 2015 as a poster The Formation of the Citation Network from Global and Local Knowledge [http://dx.doi.org/10.6084/m9.figshare.1452953] and a short paper is in the proceedings. We found that we were able to reproduce some key features of the citation network, such as the similar fat-tailed citation distribution found for papers published in the same year and in the same field (see my discussion of the Universality of Performance Indicators based on Citation and Reference Counts below) using a model with only three parameters.
- Poster entitled Temporal Evolution Of Universal Performance Indicators For Academic Publication presented at ECCS 2012, Brussels, 4th September 2012. This is based on work with N.Hopkins and B.Kaube which appeared as the paper: Universality of Performance Indicators based on Citation and Reference Counts [Scientometrics, 2012, arxiv:1110.3271]. This looks at data from a single institute and from the arXiv electronic repository and finds a log normal shape to the citations to papers published under the same field (defined in different ways) and in the same time span (usually one calendar year), provided one normalises with respect to the average number of citations in each group of papers.
- Paper: Community Structure and Patterns of Scientific Collaboration in Business and Management [Scientometrics, 2011, 89, 381-396, arxiv:1006.1788] with Renaud Lambiotte (now Univ. Namur) and Pietro Panzarasa (School of Business and Management, Queen Mary University of London). Looks at data from UK REF publications in field of Business and Management studies to see how academic collaborate.
- Talk: Scaling and Citations given at the EPSRC Workshop on Scaling in Social Systems Saïd Business School, Oxford, 1st December 2011.
- With Karen Gurney of Evidence (a division of ThomsonReuters) and Daniel Hook of Symplectic Ltd, Imperial College and Washington University, St. Louis, USA, we have a poster from the INORMS conference entitled “Collaboration Profiling in UK Higher Education“
- I have longer term contact with Symplectic Ltd who produce software making day-to-day information management easier for leading academic institutions across the world. Managing academic publications is an important part of their work. Symplectic are now part of the Digital Science division of Macmillan Publishing, and have also had good discussions with with several people there involved in many other interesting projects such as altmetric.com and figshare.