Time Constrained Networks

One of the key features of complex networks is that they capture interactions which have no limitations.  In most electronic systems, be they Facebook, emails or web pages, we can make connections across the world with little if any cost.

However what if there are constraints on the links made in a network? Surely we should change the way we study networks if space, time or some other constraint is having a significant effect on the formation or use of the network.  This has been a major interest of mine over the last few years. Space is one obvious limitation as in some cases long distance are less likely to be made. There has been a lot of work in this area over many decades but I will leave this constraint for another blog.

It is only more recently that the role of time in networks has began to receive more attention. A lot of this recent interest in how to deal with networks where the connections, are made at one time.  That is because most communication networks, emails, phone calls and so forth, are of this type. The recent review by Holmes and Saramäki (2012) is such temporal edge networks.

Yet networks are made of two parts: vertices and edges. My recent work has focussed on the case where it is the vertices, not the edges, which are created at a definite time. In such temporal vertex networks, causality forces the interactions between nodes to always point in one direction. For example consider a citation network formed by academic papers. The nodes in our network are the academic papers and the links are formed by their bibliographies.  So if paper A refers to another paper B then we can be (almost) sure that A was written after B. Information can therefore flow only from B to A. In fact any set of documents can only refer to older ones such networks are common. In law, judges refer to previous judgments to support their arguments.  When registering a patent, prior art needs to be cited, that is other previously granted work which may have some relevance to the current claim.

The same types of structure occur in several other situations.  Any situation where there is a logical flow has the same causal structure.  If we have a project where the nodes represent individual tasks then an edge from task S to task T could represent the fact that task T requires task S to have been completed before task T is started.  This has been the focus of work on temporal vertex networks in computer science. The logical flow of a mathematical argument or an excel spreadsheet show the same properties.  These networks define what is called a partially ordered set or poset and it is under this title that you find relevant work coming from mathematicians. A final example comes from the Causal Sets approach to quantum gravity (see Dowker 2006 for a review).  Here space-time is discrete not continuous, and these discrete points are the nodes of the network.  The nodes are connected by edges only if they are causally connected and causality again gives these a common direction.

All of these temporal vertex networks have a key property.  That is they contain no loops if you always follow the direction on the edges.  You can not go backwards in time.  Thus the traditional name for such a network is a directed acyclic networks or DAG for short.

So the question is how can we adapt traditional network measures to deal with the fact that these networks, DAGs, are constrained by causality?  Are there new measures we should employ which give more insights to such networks?

I’ve been looking at these problems with several students (undergraduates in their final year projects and some MSc students), one of whom, James Clough, is now working for his PhD on this topic.

Paths in networks are always important.  However one feature of a DAG we have been exploiting is that if we always follow the direction of the arrows, the direction of time, then not all nodes are connected. If we like we could add edges whenever there is such a path connected a late node to an earlier one, a process known as transitive completion.  On the other hand we could remove as many edges as we can while leaving the causal relationships intact, a process known as transitive reduction. That is, if there is a path between two nodes in the network before transitive reduction, then there will still be a link afterwards.

Example of the transitive reduction (left) and the transitive completion (right) of the directed acyclic graph in the centre.

What we have done (in Transitive Reduction of Citation Networks) is look at how real data from citation networks behaves after transitive reduction.  What we find is that different types of citation network behave very differently. The citation network formed from academic papers taken from the arXiv repositoryand the network of US Supreme Court judgments both show that about 80% of the edges are not needed to retain all the causal relationships.  On the other hand the patents network shows the opposite behaviour with all but 15% of edges being essential. The edges removed tend to be the citation to older papers. One interpretation is that academics and and judges may be citing well known early papers and judgments though their current work is only directly related to more recent documents. Perhaps some of these citations do not indicate the early work was needed but reflect other motivations, such as simple copying of popular papers or review in the field which at best only have general relevance. For academic papers this interpretation is supported by the work of Simkins and Roychowdhury In this sense unnecessarily.

The number of citations to a document after transitive reduction certainly gives us a different view of the importance of different documents. For instance paper hep-th/9802109 on the arXiv (Gauge Theory Correlators from Non-Critical String Theory by Gubsner et al.) was cited by 1641 papers in the network, but only three citations remained after TR! On the other hand, paper hep-th/9905111 (Large N Field Theories, String Theory and Gravity by Aharony et al.) has also large number of citations in the raw data, 806, yet after transitive reduction it has 77, so retaining far more of its original citations. Perhaps information in the second paper was used more diversely.

We can find similar examples in the US Supreme Court citation network. The case Schneider vs. New Jersey (1939)’ has 144 citations in the original data but this drops to just just one after transitive reduction. Stromberg vs. California (1931) also falls from 132 citations to just one. Conversely, the case Heller vs. New York (1973) only shows a slight fall after transitive reduction, falling from from 68 to 48 citations and has the most citations in our reduced network. The second most cited case after transitive reduction is Hamling vs. United States, which drops from 68 to 38 citations. Wikipedia lists hundreds of Supreme Court cases but the last two are not famous enough to make the Wikipedia list.  Our analysis suggests they may have more importance than a simple citation count would suggest. At the very least it might be be worth checking out documents that are highly cited in the essential.

Another way to look at citation networks is to see if we can define a dimension to the network.  That is we can try to quantify how much variation there is in the citation process.  A low dimension means that there are few directions , few distinct themes relevant for citation in a document.  A high dimension indicates that there is a wide range of relevant but distinct directions from which a document will draw on for inspiration.  What James Clough and I found (in What is the dimension of citation space?) is that we were often able to assign an interesting value for the dimension of our citation data.  For academic papers, we found that different fields of research have different dimensions.  For papers in the hep-th arXiv section (largely string theory) we found a low dimension of around 2 while for theoretical papers closely linked to particle physics experiments (hep-ph section) we found more variation as indicated by a higher dimension of 3. The quant-ph also around 3 while the astro-ph section had a slightly higher dimension of around 3.5. So clearly despite similarities in the main data using standard measures, our time-aware dimension measures show clear differences in the citation behaviour of different areas. String theory in particular seems to be a tightly knit collection of work with each work largely dependent on all the other work, few independent directions can be pursued. The US Supreme Court judgments were more complicated.  Small samples (usually from modern judgments) showed a dimension of around 2.5 to 3 but larger samples, typically ranging from modern to the earliest judgments, had lower dimensions, closer to 2. We interpreted this as reflecting the way that there were few early judgments compared to the number produced to day.  So that the further back we traced in time to find the influence of judgments on recent ones, the smaller the variation.  Perhaps that is not so surprising and we might expect a similar shape if we could follow scientific papers back to the 18th century! patents on the other hand showed a much higher dimension though again these were involved.

It is clear from just the few studies we have made that time makes a crucial difference to the structure of a network. We have tried a few new measures adapted to take account of time and in doing so we have thrown up some intriguing features in real data.  There is surely much more to find when networks are embedded in time.

References

Clough, J.R. & Evans, T.S. What is the dimension of citation space?, arXiv:1408.1274

Clough, J. R.; Gollings, J.; Loach, T. V. & Evans, T. S., Transitive Reduction of Citation Networks, J.Complex Networks to appear 2014, arXiv:1310.8224 DOI: 10.1093/comnet/cnu039 (open access)

Dowker, F. Causal sets as discrete spacetime, 2006. Contemporary Physics, 47, 1-9

Holme, P. & Saramäki, J. 2012. Temporal Networks Physics Reports, 519, 97-125

Simkin M.V. and Roychowdhury V.P., 2003. Read before you cite! Complex Systems 14 269-274

The Many Truths of Community Detection

The Many Truths of Community Detection

You do not need to know the detailed properties of every small part making up a gas, it turns out the bulk properties of a gas can be derived from very general principles. In the same way when looking at Facebook data, we might be able to identify groups of people who behave in a similar way. Searching for these groups or clusters in data is central in many areas of physical and social science. It is often easier to understand the behaviour of a large system by looking at these clusters, which are much fewer in number.

In terms of networks, the clustering is based on the structure (topology) of the network and the groups found are called Communities. In this case we might expect a coherent group to be one which has more links between members of the group than it ha to nodes outside the group in other clusters. I have done some work on what is called Community Detection, particularly in methods which assign nodes to the membership of several clusters (e.g. my line graph and clique graph papers referenced below).  After all, my social connections are likely to show that I am part of several groups: work colleagues, family relationships, connections made through hobbies or sports.

For some time I have been very wary about the meaning of the clusters found with such methods and particular about claims of one method being able to find “better” communities than another.  A recent paper prompted me to think about this again. In Community detection in networks: structural clusters versus ground truthHricDarst, and Fortunato from Aalto University in Finland (a big centre for networks research) asked if the network methods were finding different sorts of clusters from those found using other aspects of the data. Typically when testing a community detection method, one sets up artificial networks in which each node is assigned to one community. The edges between nodes are then assigned at random but with a preference for edges to be between nodes from the same community.  I can do all the tests I like on artificial data but I am always worried that this approach has introduced some hidden bias. Perhaps we end up choosing the methods that ‘work’ on artificial data but which are perhaps not so good on real messy data? It all comes down to the fact that we have mathematical ways to quantify the difference between community assignments but defining what we mean by “the best” clustering is impossible. Even with artificial networks, the “ground truth” is not generally an absolute truth.  Typically the “truth” are input parameters and the actual network generated is partly random. So while the resulting artificial network is correlated with the ground truth it is not designed to be a perfect match. So in this case the “actual truth” will, in almost most cases, be different from the ground truth.

I also worry about what we do when we run network community detection methods on large real data sets where there is no simple ground truth.  When I have done this, I can find a variety of possible answers for communities in the data.  Many look reasonable but none correlate perfectly with each other or with what I know from other sources. This leaves me wondering if the automatic methods are finding one truth and my other information gives another. Alternatively the automatic methods might be rubbish, good on artificial cases, not so good in reality. There is no simple way of telling.

In any case do real networks have a “ground truth”?  Quite often people have data from other sources about real networks and they use this to construct a “ground truth”.  The test is then to see if automatic methods can find this ground truth. However what if the other data is wrong? People don’t always tell the truth, they can deliberately mislead or they can misunderstand the problem. Children surveyed about their friendships may tell you who they’d like to be friends with (the most popular person in the class) and not who they actually spend time with.

Zachary Karate Club network clustered using clique graph methods

Zachary Karate Club network clustered using clique graph methods

Take the famous Zachary karate club data set used by many (including myself) as a simple test. This is a network of members of a karate club that split in two during the sociologist’s study. Let us accept that the professionalism of Zachary has produced data that is a true reflection of the situation despite the difficulty of measuring associations in social science. If you look at the published paper it actually gives two truths. One is based on which of two factions the members actually joined, and one based on an automatic community detection method.  I suspect most people are using the latter as the ground truth (unwittingly) when testing their work.  Perhaps this is a further example supporting the claim that academics only read 20% of their references. Worse the data given in the published karate club paper is not consistent – the unweighted adjacency matrix is not symmetric.  So which truth was used for all those papers using the Karate club network?

American College Football network

American College Football network clustered using clique graph clustering methods

Another example comes from some work I did on overlapping community methods. Like many other people I downloaded a standard data set from Mark Newman’s web site, an extremely useful resource. The American College Football data was created by Girvan and Newman (in Community structure in social and biological networks) and represents the games played between American College Football teams in one season.  Also provided are the conference membership of each team.  Teams play more games against teams from the their own conference than from any one other conference. In fact this data is so well clustered that surely no method should get anything wrong beyond a few independent teams as my visualisations here illustrate (taken from my clique based clustering paper). So I looked at the “mistakes” made by my method. After about two afternoons of wading through interminable web sites of stats on American College football and Wikipedia pages on the College Conference system, I realised that in fact most of the “mistakes” were not from the automatic community detection but lay in “the ground truth”, that is in the conferences assigned to teams in the data file. It turns out that the assignments in the original football.gml file are for the 2001 season while the file records information about the games played for the 2000 season. For instance the Big West conference existed for football till 2000 while the Sun Belt conference was only started in 2001. There were 11 conferences and 5 independents in 2001 but 10 conferences and 8 independents in 2000.   Care is needed as American College athletic conferences cover many sports, with some sports joining or dropped from any one conference time to time. Teams can also switch conferences too. In fact around 10% of the college teams playing American football at the top level changed conferences around 2000-2001.

So often the “ground truth” is just another truth not some absolute truth! The errors in the Zachary Karate club and American College Football data do not matter in one sense as they still provide valid and valuable tests for methods.   The conclusions in the hundreds of papers using these data sets and which use these questionable ground truths would not change. Indeed it highlights one role for automatic methods.   You can see that where Girvan and Newman’s methods get the “wrong” answer in their original paper (Community structure in social and biological networks) they are in fact highlighting problems with their conference data. Validation of data is a very useful if boring job. A final question will always be if there is a single truth. For instance I am in the theoretical physics group of the physics department of the Faculty of Natural Sciences at Imperial College London.  That top-down hierarchical truth is important when allocating desks and teaching.  However another truth would emerge if you studied my research relationships.  Those are with staff and students based in other physics research groups and with colleagues from other departments and even other faculties.

So I was really pleased to see that Community detection in networks: structural clusters versus ground truth were questioning the meaning of truth in community detection from a quantitative point of view. Clustering of data, finding communities in data is of tremendous value commercially and for research, but there is still a lot more work to do before we understand these different truths.

References

M. Girvan, M.E.J. Newman, Community structure in social and biological networks, PNAS (2002) 99, 7821-782

W. Zachary, Information-Flow Model For Conflict And Fission In Small-Groups Journal Of Anthropological Research, (1977) 33  452–473

D. Hric, R.K. Darst,and S.Fortunato,  Community detection in networks: structural clusters versus ground truth,  arXiv:1406.0146

T.S. Evans, American College Football Network Files. figshare. (2012).  http://dx.doi.org/10.6084/m9.figshare.93179

T.S. Evans, and R. Lambiotte, Line Graphs, Link Partitions and Overlapping Communities Phys.Rev.E, 2009, 80, 016105 [arXiv:0903.2181].

T.S. Evans, Clique Graphs and Overlapping Communities J. Stat. Mech. (2010) P12037  http://dx.doi.org/10.1088/1742-5468/2010/12/P12037
[arXiv:1009.0638]

Why Google scholar is no worse than anything else

Just read an interesting couple of blogs by Stacy Konkiel from the ImpactStory team, one entitled “4 reasons why Google Scholar isn’t as great as you think it is“ nicely followed up by “7 ways to make your Google Scholar Profile better“ which keeps things constructive.  My immediate response to the first blog is that the problems are not unique to Google scholar and that you could highlight the same issues with most bibliometric sources.  Certainly the criticisms also apply to the two other commercial bibliometric data sources, Scopus and Web of Science. The four points were as follows.

1) Google Scholar Profiles include dirty data

What is “dirty data”? A site like Impactstory pulling in data from a wide range of non-traditional sources ought to be a little more careful about throwing this term about! One person’s dirty citation is another’s useful lead. It does seem Google scholar is more open to gaming but at least it is easy to spot this using Google scholar if you see some anomalous data. Scopus and Web of Science make their decisions behind closed doors about what to include and what not; how many ‘weak’ journals and obscure conference proceedings are included there, how many book citations are excluded? I’ve heard at least one story about the way bibliometric data was used as a pawn in a dispute between two companies over other commercial interests. I just have no idea how much manipulation of data goes on inside a commercial company. On the altmetrics side of the story, most departments still regard any social media counts as dirty.

2) Google Scholar Profiles may not last

Surely a problem with anything, commercial or not. Your institution may switch subscription and cut off your access even if it is still out there.  Google certainly has poor reputation on this front.  In so many ways, we always gamble when we invest time in a computer product – not sure my PL/1 programming knowledge is much use these days.

3) Google Scholar Profiles won’t allow itself to be improved upon

Scopus and Web of Science also carefully control what you can do with their data. In any case you need a subscription before you can start to do anything. So surely this is a criticism of all closed data systems.

4) Google Scholar Profiles only measure a narrow kind of scholarly impact

Again, I don’t see Scopus and Web of Science producing much more than bare citation counts and h-indices. The UK 2012 research assessment procedure (REF) only quoted bare citation counts from Scopus. This is a problem of education. Until more people understand how use bibliometric data nothing much will happen and I know h-indices still get thrown about during promotion discussions at my institution (again see an Impactstory blog about why people should stop using the h-index).

My Approach

I tend to think of all sources as data.  Your interpretation should vary as you take each into account.  Like all data and measurements, results derived from bibliometric information needs to be checked and validated using several independent sources and alternative methods.

For instance I have access to these three commercial sources, and they tend to give citations counts which differ.  Web of Science is generally the most conservative,  Scopus is in the middle and Google scholar leads the counts.  So I can use all three to give a balanced view and to weed out any problems. They also have different strengths and weaknesses.  Google is ahead of the curve and shows where the other two will go a year or two later.  My work on archaeology has a large component in books which  Google scholar  reflects but the other two fail to capture.  Scopus is very weak on my early Quantum Field Theory work, while both Web of Science and Google scholar are equally strong in this area and time period.

The tips discussed in “7 ways to make your Google Scholar Profile better” are very useful but many apply to all data sources. For instance Scopus just added two papers by another T.S.Evans working near London to my profile, even though its in a completely different research field, the address is completely different (not even in London) and there is basically zero overlap between these papers and my work.  Makes you worry about the quality of the automatic detection software used in commercial bibliometric firms. I can’t fix this myself, I have to email  Scopus while I can tidy up Google scholar myself whenever I want. Currently I also feel that the Google scholar recommendations are the most useful source of targeted information on papers I should look at but I am always looking for improvements.

Overall, I feel you need to keep a very balanced approached.  Never trust a statistic until you’ve found at least two other ways to back it up independently.

CitNetExplorer – Citation Network Analyser and Visualisation

I have just come across an interesting citation network analyser and visualiser – CitNetExplorer. Looks to be a very professional package that I will certainly be using.

One of the interesting things about citation networks is that their vertices have an order given by their publication date. This is a very strong constraint on the system so when you analyse or visualise such a network you should take the time ordering  into account. The simplest example is that you should not just look at the vertex degree in these networks, but at in-degree (citation count) and out-degree (length of bibliography). Perhaps the most obvious aspect of the constraint comes when you try to visualise the network. You can just put such networks into a standard network package and treat it as a directed network, which it is. However any standard visualisation will undoubtedly place the vertices all over the two-dimensional surface used for display. Standard visualisations pay no attention to the time-ordering of the vertices yet you almost certainly want to show that information when displaying a citation network as it is such a critical part of the definition. So many of the properties will depend on the age of the publication for instance. I have encountered this myself and played around with a few ad-hoc solutions but came to the conclusion I needed to write something myself, adapting a standard layout method to set one dimension of the vertex coordinates while the second dimensions is set by the vertice’s time. Since the same problem is encountered when making diagrams showing the critical paths in a set of tasks (such as Gantt charts) there are packages which will do this. However you will also want to do different types of analysis on a citation network plus they are likely to be much bigger than a normal Gantt chart.

This is where CitNetExplorer comes in. This comes from Nees Jan van Eck and Ludo Waltman at the CWTS (Centre for Science and Technology Studies) in Leiden, so comes from one of the leading institutes in bibliometric research. Its very early days and I have only had a short play but for me its good points are:

  • Free for noncommercial and teaching purposes.
  • Cross platform as written in java.
  • Stable on my Windows 7 machines.
    As it is written in java, it is likely to be stable on other platforms too.
  • Well presented with a reassuring professional feel.
  • Good graphical display.
    The publications are laid out using their publication data for the vertical coordinate and a layout algorithm to place the publications horizontally
  • Good default options.
    I got an instantly readable figure every tine I tried it
  • Good range of graphical output options.
    Vector graphics, especially postscript (eps), is essential for me. Note these are all under the Screenshot menu option.
  • Two basic network format output options.
    A pajek .net and a simple text file format (see below)
  • Various basic analysis tools.
    This includes transitive reduction  which is something I have been very interested in and can throw up some new insights into the citation counts of papers (see arXiv:1310.8224).

The forty most highly cited papers in hep-th (1992-2003) after transitive reduction as an example of output from CitNetExplorer. (click on image for better version)

So this looks to be a really nice package. Of course, I am never satisfied so what would I like to see in future versions:

  • Open source.
    It would be nice to be able to learn from their computational work and to add to this myself. Maybe some type of plug-in could be added to solve the latter problem. I have a few more tricks for citation networks in the pipeline for instance.
  • More input options.
    There are only two and one is tied to Thomson-Reuter’s WoS (Web of Science) database. In the example given by the authors you perform a search on WoS and then save the results in a text file (saverecs.txt).  Note you must select the “Web of Science Core Collection” not the “All Databases” option which the example clearly shows but I didn’t read, otherwise the output file will not include the full citation information needed to construct the citation network.  This file is a simple text file so you should be able to combine them by hand if like me you are limited to 500 records per file.
    The alternative is a pair of relatively simple text files.  These are not as yet explained in the documentation. Basically there are two files.  First is namepub.txt file lists the properties of the publications and the order in this file assigns each publication an index (the publication on line 2 is vertex 1, line 3 defines vertex 3 and so on). The second file is called namecite.txt and is an edge list written in terms of the vertex index. Look at the first few lines of the example data James Clough made from the open source KDD cup arXiv citation network data that we have been using in our recent work. Alternatively if you can produce a file from WoS open it in CitNetExplorer then save it in what is called CitNetExplorer format. These CitNetExplorer files are easy to look at, edit and prepare in a spreadsheet or a basic text editor and appear to be tab separated.
  • Visualisation editing.
    No layout is perfect so it is essential to be able to move the vertices by hand. One of my favourite visualisation packages, visone, shows what you can do in java, and even my own ariadne package built on the jung library gave that functionality automatically.

Rather less seriously, I am not sure about the name.  I would pronounce the “Cit” in “CitNetExplorer” as “sit” or perhaps “chit” so I would have kept the “e” in “cite”, CiteNetExplorer, but its not my product. As I’m getting bored typing it it, I’m sure it will become just CNE in any case.

Links

CitNetExplorer http://www.citnetexplorer.nl/

Van Eck, N.J., & Waltman, L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. [arXiv:1404.5322]

Van Eck, N.J., & Waltman, L. (2014). Systematic retrieval of scientific literature based on citation relations: Introducing the CitNetExplorer tool. In Proceedings of the First Workshop on Bibliometric-enhanced Information Retrieval (BIR 2014), pages 13-20.

James R. Clough, Jamie Gollings, Tamar V. Loach, Tim S. Evans (2013).
Transitive Reduction of Citation Networks. [arXiv:1310.8224]

Clough, James; Evans, Tim; Loach, Tamar (2013). Transitive Reduction of Citation Networks. (data set) figshare
http://dx.doi.org/10.6084/m9.figshare.834935

Clough, James; Evans, Tim (2014). KDD cup arXiv data for CitNetExplorer. figshare fileset.
http://dx.doi.org/10.6084/m9.figshare.1021647

Sculplexity: sculptures of complexity using 3D printing

Sculplexity: sculptures of complexity using 3D printing

I have just had an unusual paper published by European Physics Letters (doi: 10.1209/0295-5075/104/48001). 3D printing (or additive manufacturing technologies to give the field its more formal name) is very much in vogue: the first 3D printed gun, Obama’s 2013 state of the Union address, and so forth. I like to know about new technological advances, even better if I can play with it. So how can a theoretical physicist get involved with 3D printing?

One of the ways I learn about a new topic is to set it up as a project for final year undergraduate students or as a masters level project. We have some of the best students in the UK here so all I have to do is come up with an outline and the students usually rise to the challenge.

What I needed was a connection between my work in complexity and 3D printing. This link came from one of those serendipitous moments. Another advantage of being at Imperial is that it is part of a Victorian complex of arts and science institutions so we are surrounded by national museums. One is the V&A (Victoria and Albert museum) which is dedicated to all aspects of design, from all ages and all locations. It also has some amazing tea rooms, covered in tiles designed by William Morris, great when I have a visitor and a little more time. It was on one of those trips that I just happened to walk past something that caught my eye. At one level it is just a table. However I saw this table as a branching process. To the designers, it was inspired by the tree-like structures of nature. The designers had embraced technology to do this, using Computer Aided Design (CAD) and 3D printing. For on further investigation this was the Fractal.MGX table designed by WertelOberfell and Matthias Bär, the first 3D printed object the V&A had acquired.

Fractal.MGX table WertelOberfell photo credit Stephane Briolant

The Fractal.MGX table designed by WertelOberfell and Mathias Bär, photo credit Stephane Briolant (Image rights WertelOberfell).

Branching processes occur in many problems in complex systems and they have a long history of mathematical investigation. So here was the link I was looking for. The question I asked was what other physics models familiar to me could be turned into a 3D printed object? How would you actually do this conversion? Does the tool, the 3D printer, impose its own limitations on the type of mathematical model we can use? Are these new limitations interesting mathematically in their own right? Until now researchers had only seen these mathematical models visualised using two-dimensional representations, often not even using perspective to give the impression of a 3D object. Making a 3D printed object opens up uncharted territory. My project shows one can move from traditional visualisations to new “tactilisations”. So can we gain new insights by using touch rather than vision?

The approach might also be useful for outreach as well as research. The same things that got my students and I interested in the project might intrigue school children, a retired judge or whoever. These objects might be particularly useful when explaining science to those whose sense of touch is better than their sight. However we could also go back to where this project started and see if models of complexity can produce something of aesthetic value alone – hence Sculplexity: sculptures of complexity.

The basic idea is simple. A 3D printer builds up its object in layers. So the height of the object can be thought of as a time. Suppose I have a model which defines a flat (two dimensional) picture. Typically this will be a grid with some squares full and some empty. The model also has to describe how this picture evolves in time. For instance there is a famous example known as Conway’s Game of Life for which are there many 2D visualisations. What I do is use the model at each point in time to define what the printer should print at one height. The next time step in the model will then define what to print on top of the first layer, and so forth.

In fact while the basic idea is straightforward, the implementation turned out to be much much harder than I expected. It is to the real credit of the undergraduate students working with me on this project, Dominic Reiss and Joshua  Price, that we pushed this idea through and actually got a final 3D printed object representing our modified forest fire model. OK so our final result is a bit of a lump of black plastic compared to the inspiration for this project, the Fractal.MGX table in the V&A. But this is just the start.

Now that we have shown that this can be done, there is so much more to explore.  We are adding another dimension to representations of mathematical models but the possibilities for 3D printing are endless.  All we have done is made the first step in terms of 3d printing and mathematical models. We have highlighted the key problems and given at least one way to fix them. I can already see how to extend the existing approach, new solutions to some of the problems, different ways to create an object from a wider variety of theoretical models. Imagination and ingenuity are all that are required.

View of top of Sculplexity output

Top view of the output produced from the Sculplexity project. Yes, its on a bed of rice since rice was used in an experiment to simulate another classic model of complexity - the sandpile model.

Note added

I have since found some other work using 3D printing to visualise maths which is full of useful ideas.  So look for the work of Aboufadel, Krawczyk and Sherman-Bennettand as well as that by Knill and Slavkovsky listed below.

References

  1. Reiss D.S., Price J.J. and Evans T.S., 2013. Sculplexity: Sculptures of Complexity using 3D printing, European Physics Letters 104 (2013) 48001,  doi 10.1209/0295-5075/104/48001.
    Copy of Sculplexity: Sculptures of Complexity using 3D printing on personal web page.
    Chosen to be part of IOPselect, a collection of papers chosen by the editors for their novelty, significance and potential impact on future research.
    (altmetrics for paper).
  2. Evans T.S., 2013. Images of 3D printing output for Sculptures of Complexity – Sculpexityhttp://dx.doi.org/10.6084/m9.figshare.868866
  3. Reiss D.S. and Price J.J., 2013. Source-code for Complex Processes and 3D Printing, https://github.com/doreiss/3D_Print, doi 10.6084/m9.figshare.718155 .
  4. Reiss D.S., 2013. Complex Processes and 3D Printing, project report, http://dx.doi.org/10.6084/m9.figshare.718146.
  5. Price J.J., 2013. Complex Processes and 3D Printing, project report, http://dx.doi.org/10.6084/m9.figshare.718147.
  6. 3D printing used as a tool to explain theoretical physics by Laura Gallagher, Imperial College News and Events, 9th December 2013
  7. 3D-printed models of complex theoretical physics“ The IET Engineering and Technology Magazine, 9th December 2013.
  8. Aboufadel E., Krawczyk S.V. and Sherman-Bennett, M. “3D Printing for Math Professors and Their Students“, arXiv:1308.3420, 2013.
  9. Knill O. and Slavkovsky E., “Illustrating Mathematics using 3D Printers“, arXiv:1306.5599, 2013

Little LaTeX Lessons

I know most of the world uses Word and Microsoft Office products so this post is for the chosen few – the small minority do use LaTeX. LaTeX produces beautiful documents, with equations if needed which is big selling point for many. LaTeX has been around since the 1980′s and is based on TeX produced by iconic computer scientist Donald Knuth in 1978. Unlike most things in the ethereal world of information technology, its seems LaTeX is not going to die even in the age of tablets (see The revolution will be typeset, D.Steele, Physics World, Jan 2013, p35).
However LaTeX is not simple to use.  There is lots of help, many free guides which a search for LaTeX guide will turn up, or one of my favourites, Tobi Oetiker’s The not so short guide to LaTeX2e.
What is missing are some of the little things.  At least, these useful snippets may be out there but they are buried under the important things. So I thought I would post my top tiny tips, the little bits it took me ages to discover, the things I see again and again in the reports from my students.
  1. Never leave an empty line after the end of an equation or eqnarray unless you really mean to start a new paragraph after the equation.
    An empty line in regular text always indicates the start of a new paragraph and the next line of text will be indented. Usually equations are meant to be read as being in the middle of a paragraph.
  2. Quotes in LaTeX are built from the two different types of single quotes, do not use the double quote symbol.
    The right way to get quotes is to put one or two single backwards quote characters ` (grave accent, ASCII 96, on a funny key to the left of the number 1 on my UK keyboard) at the start of a phrase, and match with the same number of normal single quotes  (apostrophe, ASCII code 39, with the @ symbol on my UK keyboard). Never use the single symbol for a double quote  (ASCII 34, on the key with the number 2 on my UK keyboard). LaTeX will interpret matched pairs of single quotes as a double quote and will produce a nice result.
  3. For labels as subscripts, use the \mathrm{text} to get the text in roman not italic style usually used for maths.
    Thus x_{\mathrm{max}} looks better than plain x_{max}. If you use this a few times, why not define a new command
    e.g. \newcommand{\xmax}{x_{\mathrm{max}}} which is usually placed before the \begin{document}.
  4. Likewise all the standard functions of mathmatics have their cown commands.  The function is then typeset in standard roman so that it stands out from the italic style usually used for maths.  That is use \ln not ln as the latter sometimes looks like it might be two variables l and n multiplied together.
  5. For “much less than” and “much more than” symbols do not use double less than or double greater than signs.
    There are special commands \ll and \gg which look much better than doing << and >>.
  6. To see all the labels used in equations figures, sections etc. while you are writing a document, put a
    \usepackage{showkeys}

    command near the top of the LaTeX file, just after the documentclass command.

  7. Dashes and hyphens:-
    • one for a hyphenated-word,
    • two for a number range 1–2,
    • three for a punctation dash — like this (note spaces either side of the three dashes).
  8. To get the name of the file used to start LaTeX use something like
    \texttt{{\jobname}.tex}}

    To get the names of all the constituent LaTeX files is harder.

  9. I may want to create my own simple symbol by placing two on top of each other. LaTeX has some standard symbols for spaces and it is useful to know some of them for minor tweaks.
    • \; a thick space
    • \: a medium space
    • \, a thin space
    • \! a negative thin space
    • \qquad a large space

    So for example I might write

            \begin{eqnarray}
             A &=& B \, , \qquad B=C
             \\
             I &=& \int_{0}^{1} dx \; x \cos(2 \pi x)
             \end{eqnarray}

 

Can you game google scholar?

The answer appears to be yes, according to a recent paper entitled Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting  by Emilio Delgado López-CózarNicolás Robinson-García, and Daniel Torres-Salinas from the Universities of Granada and Nararra in Spain.  I thought their experiment was illuminating and, while it is an obvious one to try, the results seemed pretty clear to me. The rest of the paper is generally informative and useful too. For instance there is a list of other studies which have looked at how to manipulate bibliographic indices.

For the experiment, six false “papers” were created, with authorship assigned to imaginary author.  Each of the six documents cited the same set of 129 genuine papers.  The cited papers all had at least one coauthor from the same EC3 research group as the authors of the study.  This could generate 774 (=129 x 6) new but artificial citations to members of the EC3 group (more if some papers had more than  one EC3 group member but this is not noted) . These six fake documents were then placed on web pages in an academic domain, much as many academics can do freely. Twenty five days later, these were picked up by google scholar and large increases in citations to the papers of the authors of this study are shown.

The basic conclusion does seem clear.  As its stands it is easy for many academic authors to boost their google scholar counts using false documents.  In that sense, as things stand, it seems one should not use these google scholar counts for any serious analysis without at least some checks on the citations themselves.  Of course, google scholar makes that easy to do and free.

However I do not feel we should rush to dismiss Google Scholar too quickly.  Any system can be gamed.  Useful references are given in the paper to other examples and studies of bibliometric manipulation, in both human edited/refereed sources and in uncontrolled electronic cases.  A major point of the paper is to point out that it is possible in both cases, just that it is much easier to do for web pages and google scholar.  What is less clear from the paper is that the solutions may be similar to those employed by traditional indices of refereed sources.  As the authors point out, the manipulation of the refereed/edited literature can and is spotted – journals are excluded from traditional bibliographic databases if they are caught manipulating indices.  The easiest way to do it is to look for sudden and unexpected increases in statistics.  One should always treat statistics with care and there needs to be some sort of assurance that the numbers are sound.  Looking for unusual behaviour, studying outliers should always be done as a check whenever statistics are being used.  The authors themselves present the very data that should be able to flag a problem in their case.  As they point out, their indices under google scholar went up by amazing amounts in a short time.  Given this indicator of an issue, it would be trivial to discover the source of the problem as google makes it trivial to find the source of the new citations.  Then of course, if such manipulation was being used for an important process, e.g. promotion or getting another job,  it becomes fraud and the research community and society at large already has severe sanctions to deal with such situations.  It may be easy to do but the sanctions may be enough to limit the problem.

So to my mind the main message of this paper is not so much that google can be manipulated easily, but that currently there are no simple tools to spot such issues.  The timeline for the citations to a set of papers, be they for a person, research group or journal, can not be obtained easily.  One can get the raw citation lists themselves, but you would have to construct the time line yourself, not an easy job.

However the same is also true of traditional paper based citation counts.  It is harder to manipulate them perhaps, but it is also hard to check on a person’s performance over time.  I imagine that checks like this will be done and the information to perform such checks will be provided in future for all such alt-metric measures based on information where there is little if any editorial control.

However there is another approach to this problem.  The authors of this paper reflect the focus of google scholar and most other bibliometric sites on the crudest of indices, citation counts and h-index.  Indeed too many academics quote these.  The UK’s REF procedure, which is used to assign research funding, will produce raw citation counts for individual papers for many fields  (Panel criteria and working methods, January 2012, page 8, para 51). This will be based on Elsevier’s SCOPUS data (Process for gathering citation information for REF 2014, November 2011), except for Computer Science where interestingly they claim google scholar will be used in a “systematic way” (Panel criteria and working methods, January 2012, page 45, para 57 and 61).  Yet it is well known that raw citation counts and the h-index these are badly flawed measures, almost useless for any comparison (of people, institutes or subjects) which is inevitably what they are used for.  Indeed where the REF document says citation data will be used, it specifically lists many of the obvious problems in interpreting the data they provide (Panel criteria and working methods, January 2012, page 8, para 51) so I am sure I can hear the sounds of hands being washed at this point.

One solution to the weakness of google scholar citation counts, or indeed counts derived from other sources, is to look for better measures.  For example in this study the six dummy papers will never gain any citations.  An index based on a weighted citation count, such as PageRank, would assign little or no value to a citation from an uncited paper.

Of course any index can be gamed.  PageRank was the original basis of google’s web index and people have been gaming this for as long as it has existed: google bombs, where many false web pages all point to the page being boosted, is the equivalent for web pages of the google scholar experiment performed in this paper .  It is equally well known that google strives to detect this and will exclude pages from its lists if people are found to be cheating.  So google has developed mechanisms to detect and counter artificial boosting of a web page’s rank. There is no reason (except perhaps a commercial one) why similar techniques could not be used on academic indexes.

My google Scholar citation count for the second part of 2012

As few other points struck me as worth noting.  The authors waited for  25 days for google to index their false papers, yet only allowed 17 days for google to remove them.  Slightly odd as data was valid up to the date on the paper, 29th May 2012, yet arXiv submission was made 6 months later.  Pity this information was not updated. There is a much wider debate here on who owns data and if individuals can or should be able to delete personal data e.g. from Facebook.  What exactly does google do if documents disappear   Monitoring my own google scholar counts, there was a massive rise then fall in my counts over a period of about a month in September/October 2012, before the count settled down to pretty much the same trend as it had been earlier in 2012.  It does seem that google Scholar is monitoring and changing its inputs.

As with many experiments of this kind, the ethics are a little unclear.  Interesting to note that authors reported that other researchers, who were not part of the team performing this experiment, noted changes in their citation counts coming from the six fake papers. Would this have been allowed by an ethics committee?  Should it have been run past an ethics committee?  Am I allowed to say what kind of documents are allowed to cite my own papers?  My colleagues in the bibliometrics business suggest there is no legal bar to anyone citing a paper, even if there is no automatic right to use the content of that paper.

And finally, surely the increase in citation counts reported for the authors of this paper should be divisible by 6, as the authors imply the same set of 129 papers was used as the bibliography in each of the six fake papers. Yet the results reported in their Figure 2 are not all divisible by 6.

This paper seems to be telling us that google Scholar is still at an early stage of its development.  However given the resources and experience of google at indices derived from open sources like the web, I would not be surprised if it was soon much harder to manipulate these indices.

Note added: There is a lot of other work on google scholar including other studies of how it might be tricked.  A good source of papers on google scholar, with references to other studies, are the papers of Peter Jasco.

Emilio Delgado López-Cózar, Nicolás Robinson-García, & Daniel Torres-Salinas (2012). Manipulating Google Scholar Citations and Google Scholar Metrics:
simple, easy and tempting Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012 arXiv: 1212.0638v1

ResearchBlogging.org