Exploring Big Historical Data

Exploring Big Historical Data

I’ve really enjoyed reading my copy of Exploring Big Historical Data: The Historian’s Macroscope (Macroscope for short here) by Shawn GrahamIan Milligan and Scott Weingart. As the authors suggest the book will be ideal for students or researchers from humanities asking if they can use big data ideas in their work. While history is the underlying context here, most of the questions and tools are relevant whenever you have text based data, large or small. For physical scientists, many of whom are not used to text data, Macroscope prompts you to ask all the right questions. So this is a book which can really cross the disciplines.  Even if some readers are like me and they find some aspects of the book very familiar, they will still find some new stimulating ideas.  Failing that, will be able to draw on the simplicity of the explanations in Macroscope for their own work. I know enough about text and network analysis to see the details of the methods were skipped over but enough of a broad overview was given for someone to start using the tools. PageRank and tf-idf (term frequency–inverse document frequency) are examples where that practical approach was followed. Humanities has lot of experience of working with texts and a physical scientist like myself can learn a lot from their experience. I have heard this piecemeal in talks and articles over the last ten years or so but I enjoyed having them reinforced in a coherent way in one place. I worry a bit that that the details in Macroscope of how to use one tool or another will rapidly date but on the other hand it means a novice has a real chance to be able to try these ideas out just from this book alone. It is also where the on line resources will come into their own. So I am already planning to recommend this text to my final year physics students tackling projects involving text. My students can handle the technical aspects without the book but even there they will find this book gives them a quick way in.

Imperial College Physics Staff

Staff in the Physics Department of Imperial College London clustered on the basis of the abstracts of their recent papers.

I can see that this book works as I picked up some of the simpler suggestions and used it on a pet project which is to look at the way that the staff in my department are related through their research interests. I want to see if any bottom-up structure of staff research I can produce from texts written by staff matches up to existing and proposed top-down structures of faculties – departments – research groups. I started using by using python to access to the Scopus api. I’m not sure you can call Elsevier’s pages on this api documentation and even stackoverflow struggled to help me but the blog Getting data from the Scopus API helped a lot. A hand collected list of Scopus author ids enabled me to collect all the abstracts from recent papers coauthored by each staff member. I used python libraries to cluster and display the data, following a couple of useful blogs on this process, and got some very acceptable results. However I then realised that I could use the text modelling discussed in the book on the data I had produced. Sure enough a quick and easy tool was suggested in Macroscope, one I didn’t know, Voyant Tools.  I just needed a few more lines to my code in order to produce text files, initially one per staff member containing all their recent abstracts in one document. With the Macroscope book in one hand, I soon had a first set of topics, something easy to look at and consider. This showed me that words like Physical and American were often keywords, the second of these being quite surprising initially. However, a quick look at the documents with a text editor (a tool that is rightly never far away in Macroscope) revealed that many abstracts start with a copyright statement such as “2015 American Physical Society”, something I might want to remove as this project progresses. I am very wary of such data clustering in general but with proper thought, with checks and balances of the sort which are a key part of Macroscope, you can extract useful information which was otherwise hidden.

So even for someone like me who has used or knows about sophisticated tools in this area and is (over) confident  that they can use such tools, the technical side of Macroscope should provide a very useful short cut despite my initial uncertainty. Beyond that I found that having the basic issues and ideas behind these approaches reinforced and well laid out was really  helpful for me. For someone starting out, like some of my own physical science masters and bachelors students working on some of my social science projects, they will find this book invaluable. A blog or intro document will often show you how to run a tool but they will not always emphasise the wider principles and context for such studies, something you get with Macroscope.

I should make clear that I do have some formal connections with this book, one of my contributions to the pool of academic goodwill. I suggested the general topic of digital humanities and Shawn Graham in particular as a potential author at an annual meeting of the physics and maths advisory committee for ICP (Imperial College Press). For free sandwiches we pass on ideas for topics or book projects to the publisher. I also commented on the formal proposal from all three authors to ICP, for which I often get a free book. My copy of Macroscape was obtained for reviewing a recent book proposal for ICP. Beyond this I get no remuneration from ICP. It is nice to see a topic and an author I highlighted to come together in a real book but the idea is the easy bit and hardly novel in this case. Taking up the idea and making it into a practical publishing project is down to Alice Oven and her ICP colleagues, and to the authors Shawn Graham, Ian Mulligan and Scott Weingart. That’s particularly true here as the book was produced in an unusual open source way and ICP had the guts to go along with the authors to try this different type of approach to publishing.


Exploring Big Historical Data: The Historian’s Macroscope
Shawn Graham (Carleton University, Canada),
Ian Milligan (University of Waterloo, Canada),
Scott Weingart (Indiana University, USA)
ISBN: 978-1-78326-608-1 (hardback)
ISBN: 978-1-78326-637-1 (paperback)

The Pools of Academic Goodwill

The Pools of Academic Goodwill

Much of academia is not run like a commercial business, or at least not yet. Many of the jobs I do are not paid for by the recipient of my efforts: referee reports for journals, examining of some PhDs and writing references are three examples which come to mind straight away. Rather than being directly paid for these tasks by the recipient, my home institution understands that they are paying for me to spend some of my time on external matters. Of course my university also draws from that pool as do I – journals use referees for my papers, my students need examiners, I needed references to pursue my own academic career. Overall, everything probably balances out.

For some of this work I may get paid, though anyone in the commercial world would probably find the rates at best to be humorous and at worst insulting. For a PhD viva in the UK I get around £150 (around U$200), which is about 24 hours work at the rate of the UK’s minimum wage. I reckon it takes me three working days to read a thesis (if there are no problems and if I am relatively familiar with the work) so that leaves the actual exam unpaid even at minimum wage rates. I was recently an examiner for a PhD in Vienna. The trip alone took more than 24 hours and in this case it was expenses only. Some types of external academic work may have some benefits for me. I read the Viennese PhD thesis from cover to cover: it was a pleasure that I would never have had if I had not been an examiner on this particular thesis. Such detailed reading time is a precious commodity these days. Some of this work can be used to support the case for my own career progression. Here, being an external examiner for another University’s undergraduate programme is an example of a measure of esteem that might count in my favour in a review meeting. Of course the link between such work and promotion is a very tenuous link, while the work itself is quite demanding and invariably underpaid. Again we all do this work as we understand that we need examiners for our own PhD students and for our own undergraduate exams, we will draw from the pool of academic goodwill.

A more interesting case is the value of the work done by academic refereeing for journals which has been estimated at about £1.9bn per year, and £165 million for the UK alone. There is real value in this work spent commenting on academic papers yet while journals charge others  for their service and they make profits, none of the fees charged by journals make it to referees.  So journals also draw on academic goodwill.

In the book Whackademia Richard Hil says that academics are no longer trusted as professionals but are to be monitored and measured. He suggests that before this, internal pressure and support from within an academic community ensured everyone made their contribution, even if this was in different ways. Richard Hil suggests that the new neo-liberal business-like approach encourages an individualism that destroys fails to value contributions to a shared pool of academic goodwill and so actually reduces overall returns.  We have to maximise our individual measured outputs so everything else, useful or not, gets dropped.

So if the UK government, and maybe others, want to push a more business approach on universities they ought to think carefully. Perhaps they should first try to value the cost of business style consultancy over academic goodwill.