Can you game google scholar?

The answer appears to be yes, according to a recent paper entitled Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting  by Emilio Delgado López-CózarNicolás Robinson-García, and Daniel Torres-Salinas from the Universities of Granada and Nararra in Spain.  I thought their experiment was illuminating and, while it is an obvious one to try, the results seemed pretty clear to me. The rest of the paper is generally informative and useful too. For instance there is a list of other studies which have looked at how to manipulate bibliographic indices.

For the experiment, six false “papers” were created, with authorship assigned to imaginary author.  Each of the six documents cited the same set of 129 genuine papers.  The cited papers all had at least one coauthor from the same EC3 research group as the authors of the study.  This could generate 774 (=129 x 6) new but artificial citations to members of the EC3 group (more if some papers had more than  one EC3 group member but this is not noted) . These six fake documents were then placed on web pages in an academic domain, much as many academics can do freely. Twenty five days later, these were picked up by google scholar and large increases in citations to the papers of the authors of this study are shown.

The basic conclusion does seem clear.  As its stands it is easy for many academic authors to boost their google scholar counts using false documents.  In that sense, as things stand, it seems one should not use these google scholar counts for any serious analysis without at least some checks on the citations themselves.  Of course, google scholar makes that easy to do and free.

However I do not feel we should rush to dismiss Google Scholar too quickly.  Any system can be gamed.  Useful references are given in the paper to other examples and studies of bibliometric manipulation, in both human edited/refereed sources and in uncontrolled electronic cases.  A major point of the paper is to point out that it is possible in both cases, just that it is much easier to do for web pages and google scholar.  What is less clear from the paper is that the solutions may be similar to those employed by traditional indices of refereed sources.  As the authors point out, the manipulation of the refereed/edited literature can and is spotted – journals are excluded from traditional bibliographic databases if they are caught manipulating indices.  The easiest way to do it is to look for sudden and unexpected increases in statistics.  One should always treat statistics with care and there needs to be some sort of assurance that the numbers are sound.  Looking for unusual behaviour, studying outliers should always be done as a check whenever statistics are being used.  The authors themselves present the very data that should be able to flag a problem in their case.  As they point out, their indices under google scholar went up by amazing amounts in a short time.  Given this indicator of an issue, it would be trivial to discover the source of the problem as google makes it trivial to find the source of the new citations.  Then of course, if such manipulation was being used for an important process, e.g. promotion or getting another job,  it becomes fraud and the research community and society at large already has severe sanctions to deal with such situations.  It may be easy to do but the sanctions may be enough to limit the problem.

So to my mind the main message of this paper is not so much that google can be manipulated easily, but that currently there are no simple tools to spot such issues.  The timeline for the citations to a set of papers, be they for a person, research group or journal, can not be obtained easily.  One can get the raw citation lists themselves, but you would have to construct the time line yourself, not an easy job.

However the same is also true of traditional paper based citation counts.  It is harder to manipulate them perhaps, but it is also hard to check on a person’s performance over time.  I imagine that checks like this will be done and the information to perform such checks will be provided in future for all such alt-metric measures based on information where there is little if any editorial control.

However there is another approach to this problem.  The authors of this paper reflect the focus of google scholar and most other bibliometric sites on the crudest of indices, citation counts and h-index.  Indeed too many academics quote these.  The UK’s REF procedure, which is used to assign research funding, will produce raw citation counts for individual papers for many fields  (Panel criteria and working methods, January 2012, page 8, para 51). This will be based on Elsevier’s SCOPUS data (Process for gathering citation information for REF 2014, November 2011), except for Computer Science where interestingly they claim google scholar will be used in a “systematic way” (Panel criteria and working methods, January 2012, page 45, para 57 and 61).  Yet it is well known that raw citation counts and the h-index these are badly flawed measures, almost useless for any comparison (of people, institutes or subjects) which is inevitably what they are used for.  Indeed where the REF document says citation data will be used, it specifically lists many of the obvious problems in interpreting the data they provide (Panel criteria and working methods, January 2012, page 8, para 51) so I am sure I can hear the sounds of hands being washed at this point.

One solution to the weakness of google scholar citation counts, or indeed counts derived from other sources, is to look for better measures.  For example in this study the six dummy papers will never gain any citations.  An index based on a weighted citation count, such as PageRank, would assign little or no value to a citation from an uncited paper.

Of course any index can be gamed.  PageRank was the original basis of google’s web index and people have been gaming this for as long as it has existed: google bombs, where many false web pages all point to the page being boosted, is the equivalent for web pages of the google scholar experiment performed in this paper .  It is equally well known that google strives to detect this and will exclude pages from its lists if people are found to be cheating.  So google has developed mechanisms to detect and counter artificial boosting of a web page’s rank. There is no reason (except perhaps a commercial one) why similar techniques could not be used on academic indexes.

My google Scholar citation count for the second part of 2012

As few other points struck me as worth noting.  The authors waited for  25 days for google to index their false papers, yet only allowed 17 days for google to remove them.  Slightly odd as data was valid up to the date on the paper, 29th May 2012, yet arXiv submission was made 6 months later.  Pity this information was not updated. There is a much wider debate here on who owns data and if individuals can or should be able to delete personal data e.g. from Facebook.  What exactly does google do if documents disappear   Monitoring my own google scholar counts, there was a massive rise then fall in my counts over a period of about a month in September/October 2012, before the count settled down to pretty much the same trend as it had been earlier in 2012.  It does seem that google Scholar is monitoring and changing its inputs.

As with many experiments of this kind, the ethics are a little unclear.  Interesting to note that authors reported that other researchers, who were not part of the team performing this experiment, noted changes in their citation counts coming from the six fake papers. Would this have been allowed by an ethics committee?  Should it have been run past an ethics committee?  Am I allowed to say what kind of documents are allowed to cite my own papers?  My colleagues in the bibliometrics business suggest there is no legal bar to anyone citing a paper, even if there is no automatic right to use the content of that paper.

And finally, surely the increase in citation counts reported for the authors of this paper should be divisible by 6, as the authors imply the same set of 129 papers was used as the bibliography in each of the six fake papers. Yet the results reported in their Figure 2 are not all divisible by 6.

This paper seems to be telling us that google Scholar is still at an early stage of its development.  However given the resources and experience of google at indices derived from open sources like the web, I would not be surprised if it was soon much harder to manipulate these indices.

Note added: There is a lot of other work on google scholar including other studies of how it might be tricked.  A good source of papers on google scholar, with references to other studies, are the papers of Peter Jasco.

Emilio Delgado López-Cózar, Nicolás Robinson-García, & Daniel Torres-Salinas (2012). Manipulating Google Scholar Citations and Google Scholar Metrics:
simple, easy and tempting Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012 arXiv: 1212.0638v1

Leave a Reply