Networks, Geometry and Clustering

Networks, Geometry and Clustering

Clustering is a vital tool when handling data making it a central part of data science. By grouping similar objects together, it helps us find what we are looking for. I don’t go to a bakery to find a book. Clustering is part of a wider idea in science as we are always faced with thousands of potential or actual measurements but we need to focus on the few which are relevant to the process we are trying to understand. I do not need to know the nuclear properties of the constituents of a gas to understand its properties, while measuring temperature, pressure and volume do throw a lot of light on that problem. In whatever branch of science we are working in, we are always trying to reduce the dimensionality of our data, to use the language of statistics and data analysis.

Many of the techniques we use will need a measure of distance and it is most natural to call upon the everyday distance as defined by any ruler – formally the Euclidean distances d where for example d2 =  x2  +  y2  +  z2  for the distance between the origin and a point at (x,y,z) in 3-dimensions.

However, what if time is present? Time is very different from space. Mathematically it leads to new types of geometry for space-times, Lorentzian rather than Euclidean. The simplest example is the Minkowski space-time used for studying special relativity. James Clough and I have been using Minkowski space as part of our study of networks which have a sense of time built into them - Directed Acyclic Graphs (see my blog on Time Constrained Networks for instance). Essentially these networks have a time associated with each vertex and then any edges present always point in one direction in time, say from the more recent vertex to an older one. Typically the time is a real physical time but for these types of network one can always construct an effective if artificial time coordinate.

There are many types of data with a directed acyclic graph structure. Citation networks are excellent examples and we will use them to illustrate our ideas in the rest of this article. Each node in a citation network is a document. The edges represent the entries in the bibliography of one document which always reference older documents - our arrow of time. We have worked with several different types of citation network: academic paper networks based on sections of the arXiv  paper repository, US Supreme court judgements, and patents. My blog on citation network modelling gives some more background and how I think about citation networks in general.

Combining these two concepts  James Clough and I have adapted a well known clustering method, MDS (Multidimensional scaling), so that it works for directed acyclic graphs (Clough and Evans 2016b). Traditional MDS is usually applied to data sets where you have a matrix of distances between each object. For a network, this would usually be the length of the shortest path between each node. MDS then assumes that these objects/nodes are embedded in a Euclidean space and suggests the best set of coordinates for the objects in that space. Clustering can then be performed by looking at which points are close together in this space. We found a way to take account of the fact that two papers on exactly the same topic can be published at the same time in different places. They are clearly ‘close’ together in any common sense definition of close yet there is no direct connection through their citation network. Our method will show that these papers are similar just from the pattern of their citations. Indeed the text could be fairly different (perhaps with two documents on networks, one uses the terms node, link, network while the second uses vertex, edge, graph for the same concepts) but the way these two documents are used by others later, or the way the to documents were based on the same material, indicates they are likely to be working on the same ideas.

Once you have the coordinates of each document in the citation network there are many other standard geometric tools you can use to do other jobs. For instance to recommend similar papers to one you are reading, you just look for other documents close in a geometric sense given the coordinates we have calculated. In the figure we show the top two hundred papers from the first decade of the hep-th part of the arXiv paper repository (this is dominated by string theory). The visualisation uses coordinates found using our Lorentzian MDS technique.

Top 200 hep-th citation network

A two dimensional embedding of the 200 most cited papers in the hep-th citation network where coordinates are found using our Lorentzian MDS algorithm. From Clough and Evans 2016b.

Our work with Minkowski space fits into broader programme of looking at networks in terms of the geometry of different types of space, what I call Netometry (Networks + Geometry, or perhaps Neteometry is better), as exemplified by Krioukov et al 2009. For instance, a good indication that a low dimensional Minkowski space might be a good representation of many citation networks came from our measurements of dimension (Clough and Evans 2016a).

Bibliography

Clough, J.R. & Evans, T.S., 2016a. What is the dimension of citation space? Physica A (in press) 2016. [ DOI 10.1016/j.physa.2015.12.053
arXiv:1408.1274 ]

Clough, J.R. & Evans, T.S., 2016b. Embedding graphs in Lorentzian spacetime, arXiv:1602.03103

Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A. and Boguna, M. 2010. Hyperbolic geometry of complex networks. Phys. Rev. E, 82 [ arXiv:1006.5169 ]

 

Modelling the Footprints of Innovation: Citation Networks

Modelling the Footprints of Innovation: Citation Networks

When one document refers to another in it’s text, this is called a citation. The pattern of these citations is most naturally represented as a network where the nodes are the documents and the links are the citations between the documents. When documents were physical documents, printed or written on paper, then these citations must (almost always) always point back in time to older documents. This arrow of time is imprinted on these citation networks and it leads to interesting mathematical properties.

One of the most interesting features of citations is that they have been carefully curated, sometimes for hundreds of years. The data I use on the US supreme court judgments goes back to the founding of the USA. So citation data is one of the oldest and continuous ‘big data’ sets to study.

The reason why records of citations have been maintained so carefully is that they record the process of innovation, be it in patents, law or in academic study. When you try to claim a patent you must by law indicate the prior art, earlier patents with relevant (but presumably less advanced) ideas. When a judge makes a judgement in a case in the USA, they draw on earlier cases which have interpreted, and so created, the law needed to come to a conclusion in the current case. And of course, academics can’t discuss the whole of existing science when explaining their own new ideas so they have to refer back to papers where previous researchers have set out a key idea needed in the current work. Citations are therefore a vital part of knowledge transfer, new ideas build on all the work done in earlier judgments, previous patents or older papers. That is why citations have been so carefully recorded. The network formed by documents and their citations show whose giant shoulders we are standing on when innovations are made, to paraphrase Newton.

From a theoretical point of view there are many interesting features in these networks. If you follow citations from one document to another to another and so on, at each step you will always reach an older paper. So you can never come back to the starting point and such paths. There are no cycles in a citation network (it is an example of a directed acyclic graph). If you look at the number of citations each document gets, how many newer documents refer back to one document, then they follow a fat-tailed distribution – a few documents have most of these citations, while most documents have very few citations each. Derek de Solla Price’s 1965 paper for an early discussion of this feature. Moreover, if you look at documents in the same year, you get roughly the same shape for the number of documents with a given number of citations (see Radicchi et al 2008, Evans et al 2012), at least for well cited documents. Since these networks are of such great interest,many other features have been noted too.

One way for a theorist to understand what is happening is to build a model from a few simple rules which captures as many of the features as possible. One of the first was that of Derek de Solla Price (1965) whose theory of “cumulative advantage” suggested that as new documents were created they would cite existing papers in proportion to the number of citations they already had, that is the richer get richer. This follows a principle used in many other models of fat-tails in other data, and indeed was later rediscovered in the context of the number of links to modern web pages – Barabási and Albert’s preferential attachment (1999). One trouble with this simple model is that the oldest documents are always the ‘rich’ ones with the most links. In reality, each year of publication there are a few documents with many citations (relative to the average number for that year) and most have very few. The Price model does not give this as all documents published in the same year will have roughly the same number of citations. To address this problem we (Sophia Goldberg, Hannah Anthony and myself,, Goldberg et al 2015) searched for a simple model which reproduced this behaviour – fat tails for the citation data of papers published in one field and one year (Goldberg et al, 2015).

The simplest model we found works as follows. At each step we add a new document, representing the evolution in time of our citation network.

A new document (red diamond) is added to an existing set of documents (blue circles) and their citations to earlier documents (arrows).

A new document first looks at recently published documents, as it is well known that citations tend to favour more recent documents. What we mean by recent is set by one of our parameters, a time scale τ.

First look at recent documents only.

We choose these recent documents partly at random (fraction (1-p)) and partly with cumulative attachment (fraction p) in which we pick recent papers to cite in proportion to the number of their current citations. This choice of papers is not realistic since it requires the authors to be able to choose from all recent documents while in reality authors only have a limited knowledge. However this stage is meant to capture, statistically at least, the way authors learn about recent developments: a recommendation from a colleague, a talk at a conference, scanning new editions of certain journals and so forth. Sometimes it will be essentially random, sometimes this first choice will reflect the attention papers have or will receive.

Choose to cite a recent document, with probability p use cumulative advantage (preferential attachment) or simply random with probability (1-p).

Once we have chosen these primary documents to cite, our new document then looks at the references within these primary documents.

Look at the papers cited by the primary paper just selected.

Each paper cited in the primary paper is then cited, copied, by the new document with probability q, the third and last parameter of the model.

Each paper cited in the primary paper is selected with probability q.

The new paper cites the selected secondary papers, so on average q references are copied from the primary paper.

This ‘copying’ process is known to be a way of getting cumulative attachment with only local knowledge of the network (see for instance my paper with Jari Saramäki (Evans & Saramäki 2005) and references therein). That is you only need to read the papers you have already cited, already found, to find these secondary documents to reference. There is no need for the model to know about the whole network at this point, reflecting the limited knowledge of actual authors.

It was only when we added the last copying process that we found our model reproduced the fat-tails seen within the citations to documents published in the same year. Nothing else we tried gave a few well cited papers in every year. Comparing with one data set, taken from the hep-th section of the arXiv repository, we found that the best values for our parameters led to a typical paper in our model of hep-th made up as follows:

  • Two primary papers chosen at random from recent papers.
  • Two primary papers chosen in proportion to the number of their citations from recent papers.
  • Eight secondary papers chosen by copying a reference from one of the first four primary papers.

This may seem like a very high level of papers being copied from the primary ones – on average we found 70% of papers were secondary citations, papers already cited in other papers being cited. One has to ask if the more recent paper, the primary one, contained all the information from the earlier ones as well as innovations being built on in the current paper. Did the new documents really derive useful information from the secondary papers cited? Often you see old ‘classic papers’ gathering citations as they are name checked in the introduction to a new paper. Not clear if the classic paper was even read while performing the current research. This feeling that some papers gain attention and acquire citations that does not reflect any direct influence on the current work is supported by at least two other studies. One was a study by Simkin and Roychowdhury (2005) of the way errors in the bibliographies of papers are copied in later papers. They suggest that this meant 80% of citations came from such copying of references. In another approach, James Clough, Tamar Loach, Jamie Gollings and myself (Clough et al, 2015) exploited the special properties of citation networks and this also suggested that 70%-80% of links were unnecessary for the logical structure of academic citation networks.

Of course constructing simple models will never capture the whole story. Models are, though, a good way to see if we have understood the key principles underlying a system.

References

Why Google scholar is no worse than anything else

Just read an interesting couple of blogs by Stacy Konkiel from the ImpactStory team, one entitled “4 reasons why Google Scholar isn’t as great as you think it is“ nicely followed up by “7 ways to make your Google Scholar Profile better“ which keeps things constructive.  My immediate response to the first blog is that the problems are not unique to Google scholar and that you could highlight the same issues with most bibliometric sources.  Certainly the criticisms also apply to the two other commercial bibliometric data sources, Scopus and Web of Science. The four points were as follows.

1) Google Scholar Profiles include dirty data

What is “dirty data”? A site like Impactstory pulling in data from a wide range of non-traditional sources ought to be a little more careful about throwing this term about! One person’s dirty citation is another’s useful lead. It does seem Google scholar is more open to gaming but at least it is easy to spot this using Google scholar if you see some anomalous data. Scopus and Web of Science make their decisions behind closed doors about what to include and what not; how many ‘weak’ journals and obscure conference proceedings are included there, how many book citations are excluded? I’ve heard at least one story about the way bibliometric data was used as a pawn in a dispute between two companies over other commercial interests. I just have no idea how much manipulation of data goes on inside a commercial company. On the altmetrics side of the story, most departments still regard any social media counts as dirty.

2) Google Scholar Profiles may not last

Surely a problem with anything, commercial or not. Your institution may switch subscription and cut off your access even if it is still out there.  Google certainly has poor reputation on this front.  In so many ways, we always gamble when we invest time in a computer product – not sure my PL/1 programming knowledge is much use these days.

3) Google Scholar Profiles won’t allow itself to be improved upon

Scopus and Web of Science also carefully control what you can do with their data. In any case you need a subscription before you can start to do anything. So surely this is a criticism of all closed data systems.

4) Google Scholar Profiles only measure a narrow kind of scholarly impact

Again, I don’t see Scopus and Web of Science producing much more than bare citation counts and h-indices. The UK 2012 research assessment procedure (REF) only quoted bare citation counts from Scopus. This is a problem of education. Until more people understand how use bibliometric data nothing much will happen and I know h-indices still get thrown about during promotion discussions at my institution (again see an Impactstory blog about why people should stop using the h-index).

My Approach

I tend to think of all sources as data.  Your interpretation should vary as you take each into account.  Like all data and measurements, results derived from bibliometric information needs to be checked and validated using several independent sources and alternative methods.

For instance I have access to these three commercial sources, and they tend to give citations counts which differ.  Web of Science is generally the most conservative,  Scopus is in the middle and Google scholar leads the counts.  So I can use all three to give a balanced view and to weed out any problems. They also have different strengths and weaknesses.  Google is ahead of the curve and shows where the other two will go a year or two later.  My work on archaeology has a large component in books which  Google scholar  reflects but the other two fail to capture.  Scopus is very weak on my early Quantum Field Theory work, while both Web of Science and Google scholar are equally strong in this area and time period.

The tips discussed in “7 ways to make your Google Scholar Profile better” are very useful but many apply to all data sources. For instance Scopus just added two papers by another T.S.Evans working near London to my profile, even though its in a completely different research field, the address is completely different (not even in London) and there is basically zero overlap between these papers and my work.  Makes you worry about the quality of the automatic detection software used in commercial bibliometric firms. I can’t fix this myself, I have to email  Scopus while I can tidy up Google scholar myself whenever I want. Currently I also feel that the Google scholar recommendations are the most useful source of targeted information on papers I should look at but I am always looking for improvements.

Overall, I feel you need to keep a very balanced approached.  Never trust a statistic until you’ve found at least two other ways to back it up independently.

Can you game google scholar?

The answer appears to be yes, according to a recent paper entitled Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting  by Emilio Delgado López-CózarNicolás Robinson-García, and Daniel Torres-Salinas from the Universities of Granada and Nararra in Spain.  I thought their experiment was illuminating and, while it is an obvious one to try, the results seemed pretty clear to me. The rest of the paper is generally informative and useful too. For instance there is a list of other studies which have looked at how to manipulate bibliographic indices.

For the experiment, six false “papers” were created, with authorship assigned to imaginary author.  Each of the six documents cited the same set of 129 genuine papers.  The cited papers all had at least one coauthor from the same EC3 research group as the authors of the study.  This could generate 774 (=129 x 6) new but artificial citations to members of the EC3 group (more if some papers had more than  one EC3 group member but this is not noted) . These six fake documents were then placed on web pages in an academic domain, much as many academics can do freely. Twenty five days later, these were picked up by google scholar and large increases in citations to the papers of the authors of this study are shown.

The basic conclusion does seem clear.  As its stands it is easy for many academic authors to boost their google scholar counts using false documents.  In that sense, as things stand, it seems one should not use these google scholar counts for any serious analysis without at least some checks on the citations themselves.  Of course, google scholar makes that easy to do and free.

However I do not feel we should rush to dismiss Google Scholar too quickly.  Any system can be gamed.  Useful references are given in the paper to other examples and studies of bibliometric manipulation, in both human edited/refereed sources and in uncontrolled electronic cases.  A major point of the paper is to point out that it is possible in both cases, just that it is much easier to do for web pages and google scholar.  What is less clear from the paper is that the solutions may be similar to those employed by traditional indices of refereed sources.  As the authors point out, the manipulation of the refereed/edited literature can and is spotted – journals are excluded from traditional bibliographic databases if they are caught manipulating indices.  The easiest way to do it is to look for sudden and unexpected increases in statistics.  One should always treat statistics with care and there needs to be some sort of assurance that the numbers are sound.  Looking for unusual behaviour, studying outliers should always be done as a check whenever statistics are being used.  The authors themselves present the very data that should be able to flag a problem in their case.  As they point out, their indices under google scholar went up by amazing amounts in a short time.  Given this indicator of an issue, it would be trivial to discover the source of the problem as google makes it trivial to find the source of the new citations.  Then of course, if such manipulation was being used for an important process, e.g. promotion or getting another job,  it becomes fraud and the research community and society at large already has severe sanctions to deal with such situations.  It may be easy to do but the sanctions may be enough to limit the problem.

So to my mind the main message of this paper is not so much that google can be manipulated easily, but that currently there are no simple tools to spot such issues.  The timeline for the citations to a set of papers, be they for a person, research group or journal, can not be obtained easily.  One can get the raw citation lists themselves, but you would have to construct the time line yourself, not an easy job.

However the same is also true of traditional paper based citation counts.  It is harder to manipulate them perhaps, but it is also hard to check on a person’s performance over time.  I imagine that checks like this will be done and the information to perform such checks will be provided in future for all such alt-metric measures based on information where there is little if any editorial control.

However there is another approach to this problem.  The authors of this paper reflect the focus of google scholar and most other bibliometric sites on the crudest of indices, citation counts and h-index.  Indeed too many academics quote these.  The UK’s REF procedure, which is used to assign research funding, will produce raw citation counts for individual papers for many fields  (Panel criteria and working methods, January 2012, page 8, para 51). This will be based on Elsevier’s SCOPUS data (Process for gathering citation information for REF 2014, November 2011), except for Computer Science where interestingly they claim google scholar will be used in a “systematic way” (Panel criteria and working methods, January 2012, page 45, para 57 and 61).  Yet it is well known that raw citation counts and the h-index these are badly flawed measures, almost useless for any comparison (of people, institutes or subjects) which is inevitably what they are used for.  Indeed where the REF document says citation data will be used, it specifically lists many of the obvious problems in interpreting the data they provide (Panel criteria and working methods, January 2012, page 8, para 51) so I am sure I can hear the sounds of hands being washed at this point.

One solution to the weakness of google scholar citation counts, or indeed counts derived from other sources, is to look for better measures.  For example in this study the six dummy papers will never gain any citations.  An index based on a weighted citation count, such as PageRank, would assign little or no value to a citation from an uncited paper.

Of course any index can be gamed.  PageRank was the original basis of google’s web index and people have been gaming this for as long as it has existed: google bombs, where many false web pages all point to the page being boosted, is the equivalent for web pages of the google scholar experiment performed in this paper .  It is equally well known that google strives to detect this and will exclude pages from its lists if people are found to be cheating.  So google has developed mechanisms to detect and counter artificial boosting of a web page’s rank. There is no reason (except perhaps a commercial one) why similar techniques could not be used on academic indexes.

My google Scholar citation count for the second part of 2012

As few other points struck me as worth noting.  The authors waited for  25 days for google to index their false papers, yet only allowed 17 days for google to remove them.  Slightly odd as data was valid up to the date on the paper, 29th May 2012, yet arXiv submission was made 6 months later.  Pity this information was not updated. There is a much wider debate here on who owns data and if individuals can or should be able to delete personal data e.g. from Facebook.  What exactly does google do if documents disappear   Monitoring my own google scholar counts, there was a massive rise then fall in my counts over a period of about a month in September/October 2012, before the count settled down to pretty much the same trend as it had been earlier in 2012.  It does seem that google Scholar is monitoring and changing its inputs.

As with many experiments of this kind, the ethics are a little unclear.  Interesting to note that authors reported that other researchers, who were not part of the team performing this experiment, noted changes in their citation counts coming from the six fake papers. Would this have been allowed by an ethics committee?  Should it have been run past an ethics committee?  Am I allowed to say what kind of documents are allowed to cite my own papers?  My colleagues in the bibliometrics business suggest there is no legal bar to anyone citing a paper, even if there is no automatic right to use the content of that paper.

And finally, surely the increase in citation counts reported for the authors of this paper should be divisible by 6, as the authors imply the same set of 129 papers was used as the bibliography in each of the six fake papers. Yet the results reported in their Figure 2 are not all divisible by 6.

This paper seems to be telling us that google Scholar is still at an early stage of its development.  However given the resources and experience of google at indices derived from open sources like the web, I would not be surprised if it was soon much harder to manipulate these indices.

Note added: There is a lot of other work on google scholar including other studies of how it might be tricked.  A good source of papers on google scholar, with references to other studies, are the papers of Peter Jasco.

Emilio Delgado López-Cózar, Nicolás Robinson-García, & Daniel Torres-Salinas (2012). Manipulating Google Scholar Citations and Google Scholar Metrics:
simple, easy and tempting Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012 arXiv: 1212.0638v1

ResearchBlogging.org