Why Google scholar is no worse than anything else

Just read an interesting couple of blogs by Stacy Konkiel from the ImpactStory team, one entitled “4 reasons why Google Scholar isn’t as great as you think it is“ nicely followed up by “7 ways to make your Google Scholar Profile better“ which keeps things constructive.  My immediate response to the first blog is that the problems are not unique to Google scholar and that you could highlight the same issues with most bibliometric sources.  Certainly the criticisms also apply to the two other commercial bibliometric data sources, Scopus and Web of Science. The four points were as follows.

1) Google Scholar Profiles include dirty data

What is “dirty data”? A site like Impactstory pulling in data from a wide range of non-traditional sources ought to be a little more careful about throwing this term about! One person’s dirty citation is another’s useful lead. It does seem Google scholar is more open to gaming but at least it is easy to spot this using Google scholar if you see some anomalous data. Scopus and Web of Science make their decisions behind closed doors about what to include and what not; how many ‘weak’ journals and obscure conference proceedings are included there, how many book citations are excluded? I’ve heard at least one story about the way bibliometric data was used as a pawn in a dispute between two companies over other commercial interests. I just have no idea how much manipulation of data goes on inside a commercial company. On the altmetrics side of the story, most departments still regard any social media counts as dirty.

2) Google Scholar Profiles may not last

Surely a problem with anything, commercial or not. Your institution may switch subscription and cut off your access even if it is still out there.  Google certainly has poor reputation on this front.  In so many ways, we always gamble when we invest time in a computer product – not sure my PL/1 programming knowledge is much use these days.

3) Google Scholar Profiles won’t allow itself to be improved upon

Scopus and Web of Science also carefully control what you can do with their data. In any case you need a subscription before you can start to do anything. So surely this is a criticism of all closed data systems.

4) Google Scholar Profiles only measure a narrow kind of scholarly impact

Again, I don’t see Scopus and Web of Science producing much more than bare citation counts and h-indices. The UK 2012 research assessment procedure (REF) only quoted bare citation counts from Scopus. This is a problem of education. Until more people understand how use bibliometric data nothing much will happen and I know h-indices still get thrown about during promotion discussions at my institution (again see an Impactstory blog about why people should stop using the h-index).

My Approach

I tend to think of all sources as data.  Your interpretation should vary as you take each into account.  Like all data and measurements, results derived from bibliometric information needs to be checked and validated using several independent sources and alternative methods.

For instance I have access to these three commercial sources, and they tend to give citations counts which differ.  Web of Science is generally the most conservative,  Scopus is in the middle and Google scholar leads the counts.  So I can use all three to give a balanced view and to weed out any problems. They also have different strengths and weaknesses.  Google is ahead of the curve and shows where the other two will go a year or two later.  My work on archaeology has a large component in books which  Google scholar  reflects but the other two fail to capture.  Scopus is very weak on my early Quantum Field Theory work, while both Web of Science and Google scholar are equally strong in this area and time period.

The tips discussed in “7 ways to make your Google Scholar Profile better” are very useful but many apply to all data sources. For instance Scopus just added two papers by another T.S.Evans working near London to my profile, even though its in a completely different research field, the address is completely different (not even in London) and there is basically zero overlap between these papers and my work.  Makes you worry about the quality of the automatic detection software used in commercial bibliometric firms. I can’t fix this myself, I have to email  Scopus while I can tidy up Google scholar myself whenever I want. Currently I also feel that the Google scholar recommendations are the most useful source of targeted information on papers I should look at but I am always looking for improvements.

Overall, I feel you need to keep a very balanced approached.  Never trust a statistic until you’ve found at least two other ways to back it up independently.

Can you game google scholar?

The answer appears to be yes, according to a recent paper entitled Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting  by Emilio Delgado López-CózarNicolás Robinson-García, and Daniel Torres-Salinas from the Universities of Granada and Nararra in Spain.  I thought their experiment was illuminating and, while it is an obvious one to try, the results seemed pretty clear to me. The rest of the paper is generally informative and useful too. For instance there is a list of other studies which have looked at how to manipulate bibliographic indices.

For the experiment, six false “papers” were created, with authorship assigned to imaginary author.  Each of the six documents cited the same set of 129 genuine papers.  The cited papers all had at least one coauthor from the same EC3 research group as the authors of the study.  This could generate 774 (=129 x 6) new but artificial citations to members of the EC3 group (more if some papers had more than  one EC3 group member but this is not noted) . These six fake documents were then placed on web pages in an academic domain, much as many academics can do freely. Twenty five days later, these were picked up by google scholar and large increases in citations to the papers of the authors of this study are shown.

The basic conclusion does seem clear.  As its stands it is easy for many academic authors to boost their google scholar counts using false documents.  In that sense, as things stand, it seems one should not use these google scholar counts for any serious analysis without at least some checks on the citations themselves.  Of course, google scholar makes that easy to do and free.

However I do not feel we should rush to dismiss Google Scholar too quickly.  Any system can be gamed.  Useful references are given in the paper to other examples and studies of bibliometric manipulation, in both human edited/refereed sources and in uncontrolled electronic cases.  A major point of the paper is to point out that it is possible in both cases, just that it is much easier to do for web pages and google scholar.  What is less clear from the paper is that the solutions may be similar to those employed by traditional indices of refereed sources.  As the authors point out, the manipulation of the refereed/edited literature can and is spotted – journals are excluded from traditional bibliographic databases if they are caught manipulating indices.  The easiest way to do it is to look for sudden and unexpected increases in statistics.  One should always treat statistics with care and there needs to be some sort of assurance that the numbers are sound.  Looking for unusual behaviour, studying outliers should always be done as a check whenever statistics are being used.  The authors themselves present the very data that should be able to flag a problem in their case.  As they point out, their indices under google scholar went up by amazing amounts in a short time.  Given this indicator of an issue, it would be trivial to discover the source of the problem as google makes it trivial to find the source of the new citations.  Then of course, if such manipulation was being used for an important process, e.g. promotion or getting another job,  it becomes fraud and the research community and society at large already has severe sanctions to deal with such situations.  It may be easy to do but the sanctions may be enough to limit the problem.

So to my mind the main message of this paper is not so much that google can be manipulated easily, but that currently there are no simple tools to spot such issues.  The timeline for the citations to a set of papers, be they for a person, research group or journal, can not be obtained easily.  One can get the raw citation lists themselves, but you would have to construct the time line yourself, not an easy job.

However the same is also true of traditional paper based citation counts.  It is harder to manipulate them perhaps, but it is also hard to check on a person’s performance over time.  I imagine that checks like this will be done and the information to perform such checks will be provided in future for all such alt-metric measures based on information where there is little if any editorial control.

However there is another approach to this problem.  The authors of this paper reflect the focus of google scholar and most other bibliometric sites on the crudest of indices, citation counts and h-index.  Indeed too many academics quote these.  The UK’s REF procedure, which is used to assign research funding, will produce raw citation counts for individual papers for many fields  (Panel criteria and working methods, January 2012, page 8, para 51). This will be based on Elsevier’s SCOPUS data (Process for gathering citation information for REF 2014, November 2011), except for Computer Science where interestingly they claim google scholar will be used in a “systematic way” (Panel criteria and working methods, January 2012, page 45, para 57 and 61).  Yet it is well known that raw citation counts and the h-index these are badly flawed measures, almost useless for any comparison (of people, institutes or subjects) which is inevitably what they are used for.  Indeed where the REF document says citation data will be used, it specifically lists many of the obvious problems in interpreting the data they provide (Panel criteria and working methods, January 2012, page 8, para 51) so I am sure I can hear the sounds of hands being washed at this point.

One solution to the weakness of google scholar citation counts, or indeed counts derived from other sources, is to look for better measures.  For example in this study the six dummy papers will never gain any citations.  An index based on a weighted citation count, such as PageRank, would assign little or no value to a citation from an uncited paper.

Of course any index can be gamed.  PageRank was the original basis of google’s web index and people have been gaming this for as long as it has existed: google bombs, where many false web pages all point to the page being boosted, is the equivalent for web pages of the google scholar experiment performed in this paper .  It is equally well known that google strives to detect this and will exclude pages from its lists if people are found to be cheating.  So google has developed mechanisms to detect and counter artificial boosting of a web page’s rank. There is no reason (except perhaps a commercial one) why similar techniques could not be used on academic indexes.

My google Scholar citation count for the second part of 2012

As few other points struck me as worth noting.  The authors waited for  25 days for google to index their false papers, yet only allowed 17 days for google to remove them.  Slightly odd as data was valid up to the date on the paper, 29th May 2012, yet arXiv submission was made 6 months later.  Pity this information was not updated. There is a much wider debate here on who owns data and if individuals can or should be able to delete personal data e.g. from Facebook.  What exactly does google do if documents disappear   Monitoring my own google scholar counts, there was a massive rise then fall in my counts over a period of about a month in September/October 2012, before the count settled down to pretty much the same trend as it had been earlier in 2012.  It does seem that google Scholar is monitoring and changing its inputs.

As with many experiments of this kind, the ethics are a little unclear.  Interesting to note that authors reported that other researchers, who were not part of the team performing this experiment, noted changes in their citation counts coming from the six fake papers. Would this have been allowed by an ethics committee?  Should it have been run past an ethics committee?  Am I allowed to say what kind of documents are allowed to cite my own papers?  My colleagues in the bibliometrics business suggest there is no legal bar to anyone citing a paper, even if there is no automatic right to use the content of that paper.

And finally, surely the increase in citation counts reported for the authors of this paper should be divisible by 6, as the authors imply the same set of 129 papers was used as the bibliography in each of the six fake papers. Yet the results reported in their Figure 2 are not all divisible by 6.

This paper seems to be telling us that google Scholar is still at an early stage of its development.  However given the resources and experience of google at indices derived from open sources like the web, I would not be surprised if it was soon much harder to manipulate these indices.

Note added: There is a lot of other work on google scholar including other studies of how it might be tricked.  A good source of papers on google scholar, with references to other studies, are the papers of Peter Jasco.

Emilio Delgado López-Cózar, Nicolás Robinson-García, & Daniel Torres-Salinas (2012). Manipulating Google Scholar Citations and Google Scholar Metrics:
simple, easy and tempting Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012 arXiv: 1212.0638v1

ResearchBlogging.org