Exploring Big Historical Data

Exploring Big Historical Data

I’ve really enjoyed reading my copy of Exploring Big Historical Data: The Historian’s Macroscope (Macroscope for short here) by Shawn GrahamIan Milligan and Scott Weingart. As the authors suggest the book will be ideal for students or researchers from humanities asking if they can use big data ideas in their work. While history is the underlying context here, most of the questions and tools are relevant whenever you have text based data, large or small. For physical scientists, many of whom are not used to text data, Macroscope prompts you to ask all the right questions. So this is a book which can really cross the disciplines.  Even if some readers are like me and they find some aspects of the book very familiar, they will still find some new stimulating ideas.  Failing that, will be able to draw on the simplicity of the explanations in Macroscope for their own work. I know enough about text and network analysis to see the details of the methods were skipped over but enough of a broad overview was given for someone to start using the tools. PageRank and tf-idf (term frequency–inverse document frequency) are examples where that practical approach was followed. Humanities has lot of experience of working with texts and a physical scientist like myself can learn a lot from their experience. I have heard this piecemeal in talks and articles over the last ten years or so but I enjoyed having them reinforced in a coherent way in one place. I worry a bit that that the details in Macroscope of how to use one tool or another will rapidly date but on the other hand it means a novice has a real chance to be able to try these ideas out just from this book alone. It is also where the on line resources will come into their own. So I am already planning to recommend this text to my final year physics students tackling projects involving text. My students can handle the technical aspects without the book but even there they will find this book gives them a quick way in.

Imperial College Physics Staff

Staff in the Physics Department of Imperial College London clustered on the basis of the abstracts of their recent papers.

I can see that this book works as I picked up some of the simpler suggestions and used it on a pet project which is to look at the way that the staff in my department are related through their research interests. I want to see if any bottom-up structure of staff research I can produce from texts written by staff matches up to existing and proposed top-down structures of faculties – departments – research groups. I started using by using python to access to the Scopus api. I’m not sure you can call Elsevier’s pages on this api documentation and even stackoverflow struggled to help me but the blog Getting data from the Scopus API helped a lot. A hand collected list of Scopus author ids enabled me to collect all the abstracts from recent papers coauthored by each staff member. I used python libraries to cluster and display the data, following a couple of useful blogs on this process, and got some very acceptable results. However I then realised that I could use the text modelling discussed in the book on the data I had produced. Sure enough a quick and easy tool was suggested in Macroscope, one I didn’t know, Voyant Tools.  I just needed a few more lines to my code in order to produce text files, initially one per staff member containing all their recent abstracts in one document. With the Macroscope book in one hand, I soon had a first set of topics, something easy to look at and consider. This showed me that words like Physical and American were often keywords, the second of these being quite surprising initially. However, a quick look at the documents with a text editor (a tool that is rightly never far away in Macroscope) revealed that many abstracts start with a copyright statement such as “2015 American Physical Society”, something I might want to remove as this project progresses. I am very wary of such data clustering in general but with proper thought, with checks and balances of the sort which are a key part of Macroscope, you can extract useful information which was otherwise hidden.

So even for someone like me who has used or knows about sophisticated tools in this area and is (over) confident  that they can use such tools, the technical side of Macroscope should provide a very useful short cut despite my initial uncertainty. Beyond that I found that having the basic issues and ideas behind these approaches reinforced and well laid out was really  helpful for me. For someone starting out, like some of my own physical science masters and bachelors students working on some of my social science projects, they will find this book invaluable. A blog or intro document will often show you how to run a tool but they will not always emphasise the wider principles and context for such studies, something you get with Macroscope.

I should make clear that I do have some formal connections with this book, one of my contributions to the pool of academic goodwill. I suggested the general topic of digital humanities and Shawn Graham in particular as a potential author at an annual meeting of the physics and maths advisory committee for ICP (Imperial College Press). For free sandwiches we pass on ideas for topics or book projects to the publisher. I also commented on the formal proposal from all three authors to ICP, for which I often get a free book. My copy of Macroscape was obtained for reviewing a recent book proposal for ICP. Beyond this I get no remuneration from ICP. It is nice to see a topic and an author I highlighted to come together in a real book but the idea is the easy bit and hardly novel in this case. Taking up the idea and making it into a practical publishing project is down to Alice Oven and her ICP colleagues, and to the authors Shawn Graham, Ian Mulligan and Scott Weingart. That’s particularly true here as the book was produced in an unusual open source way and ICP had the guts to go along with the authors to try this different type of approach to publishing.

References

Exploring Big Historical Data: The Historian’s Macroscope
Shawn Graham (Carleton University, Canada),
Ian Milligan (University of Waterloo, Canada),
Scott Weingart (Indiana University, USA)
ISBN: 978-1-78326-608-1 (hardback)
ISBN: 978-1-78326-637-1 (paperback)

Myths and Networks

Myths and Networks

I have just read an intriguing paper by Carron and Kenna entitled the  ‘Universal properties of mythological networks‘. In it they analyse the character networks  in three ancient stories, Beowulf , the Iliad and the Irish story Táin Bó Cuailnge.  That is the characters form the nodes of a network and they are connected if they appear together in the same part of the story. It has caused quite a bit of activity.  It has prompted two posts on The Networks Network already and has even sparked activity in the UK newspapers (see John Sutherland writing in the Guardian Wednesday 25 July 2012 and the follow up comment by Ralph Kenna one of the authors).  Well summer is the traditional silly season for newspapers.

However I think it is too easy to dismiss the article. I think Tom Brugmans posting on The Networks Network has it right that  “as an exploratory exercise it would have been fine”.  I disagreed with much in the paper, but it did intrigue me and many papers fail to do even this much.  So overall I think it was a useful publication. I think there are ideas there waiting to be developed further.

I like the general idea that there might be some information in the character networks which would enable one to say if it was based on fact or was pure fiction. That is if the character networks have the same characteristics as a social network it would support the idea that it was based on historical events. I was intrigued by some of the measures suggested as a way to differentiate between different types of literary work.  However like both Tom Brugmans and Marco Büchler, I was unconvinced the authors’ measures really do the job suggested. I’d really like to see a lot more evidence from many more texts before linking a particular measurement to a particular feature in character networks.

For instance Carron and Kenna suggest that in hierarchical networks for every node the degree times the clustering coefficient is a constant, eqn (2).  That is each of your friends is always connected to the same (on average) number of your friends.  By way of contrast, in a classical (Erdos-Reyni) random graph the clustering coefficient is a constant. However I don’t see that as hierarchical but an indication that everyone lives in similar size communities, some sort of fiction character Dunbar number. I’m sure you could have a very flat arrangement of communities and get the same result. Perhaps we mean different things by hierarchical.

Another claim was that in collaboration networks less than 90% of nodes are in the giant component.  The Newman paper referred to is about scientific collaboration derived from coauthorships which is very different from the actual social network of scientists (science is not done in isolation no one is really isolated). I’m not sure the Newman paper tells us anything about character structure in fictional or non-fictional texts.  I can not see why one would introduce any set of characters in any story (fictional or not) who are disconnected from the rest. Perhaps some clever tale with two strands separated in time yet connected in terms other than social relationships (e.g. through geography or action) – David Mitchell’s “Cloud Atlas” comes to my mind  - but these are pretty contrived structures.

I think a real problem in the detail of the paper, as Marco Büchler points out, is that these texts and their networks are just too small.  There is no way one can talk rigorously about power laws, and certainly not to two decimal place accuracy. I thought Michael Stumpf and Mason Porter’s commentary (Critical Truths about Power Laws) was not needed since every one knew the issues by now (I don’t in fact agree with some of the interpretation of mathematical results in Stumpf and Porter).  Perhaps this mythological networks paper shows I was wrong. At best power law forms for small networks (and small to me means under a million nodes in this context) give a reasonable description or summary of fat tailed distributions found here but many other functional forms will do this too.  I see no useful information in the specific forms suggested by Carron and Kenna.

Another point raised in the text was the idea that you could extract subnetworks representing `friendly’ social networks. That is interesting but really they are suggesting we need to do a semantic analysis of the links in the text, indicating where links are positive or negative (if they are that simple of course) and form signed networks (e.g. see Szell et al. on how this might be done on a large scale http://arxiv.org/abs/1003.5137).  I think that is a much harder job to do in these texts than the simple tricks used here suggest but it is an important aspect in such analysis and I take the authors’ point.

Finally I was interested that they mention other character networks derived from five other fictional sources.  I always liked the Marvel comic character example for instance (Alberich et al, http://arxiv.org/abs/cond-mat/0202174) as it showed that while networks were indeed trendy and hyped (everything became a network) there was often something useful hiding underneath and trying to get out in even the most bizarre examples.  However what caught my eye in the five extra examples mentioned by Carron and Kenna was that they treated these five as ‘fictional literature’.  One, Shakespeare’s Richard III, is surely a fictionalised account of real history written much closer to the real events and drawing on `historical’ accounts.  I’d would have expected it to show the same features as they claim for their three chosen texts.

So I was intrigued and in that sense that always makes a paper/talk worthwhile to me.  However while I was interested I’d need to see much more work on the idea.  You might try many different tests and measurements and see if they cumulatively point in one direction or another – I imagine a PCA type plot showing different types of network in tight clusters in some `measurement’ space.  I’d still need convincing on a large number of trial texts.  These do now exist though, so surely there is a digital humanities project here? Or is it already happening somewhere?