Exploring Big Historical Data

Exploring Big Historical Data

I’ve really enjoyed reading my copy of Exploring Big Historical Data: The Historian’s Macroscope (Macroscope for short here) by Shawn GrahamIan Milligan and Scott Weingart. As the authors suggest the book will be ideal for students or researchers from humanities asking if they can use big data ideas in their work. While history is the underlying context here, most of the questions and tools are relevant whenever you have text based data, large or small. For physical scientists, many of whom are not used to text data, Macroscope prompts you to ask all the right questions. So this is a book which can really cross the disciplines.  Even if some readers are like me and they find some aspects of the book very familiar, they will still find some new stimulating ideas.  Failing that, will be able to draw on the simplicity of the explanations in Macroscope for their own work. I know enough about text and network analysis to see the details of the methods were skipped over but enough of a broad overview was given for someone to start using the tools. PageRank and tf-idf (term frequency–inverse document frequency) are examples where that practical approach was followed. Humanities has lot of experience of working with texts and a physical scientist like myself can learn a lot from their experience. I have heard this piecemeal in talks and articles over the last ten years or so but I enjoyed having them reinforced in a coherent way in one place. I worry a bit that that the details in Macroscope of how to use one tool or another will rapidly date but on the other hand it means a novice has a real chance to be able to try these ideas out just from this book alone. It is also where the on line resources will come into their own. So I am already planning to recommend this text to my final year physics students tackling projects involving text. My students can handle the technical aspects without the book but even there they will find this book gives them a quick way in.

Imperial College Physics Staff

Staff in the Physics Department of Imperial College London clustered on the basis of the abstracts of their recent papers.

I can see that this book works as I picked up some of the simpler suggestions and used it on a pet project which is to look at the way that the staff in my department are related through their research interests. I want to see if any bottom-up structure of staff research I can produce from texts written by staff matches up to existing and proposed top-down structures of faculties – departments – research groups. I started using by using python to access to the Scopus api. I’m not sure you can call Elsevier’s pages on this api documentation and even stackoverflow struggled to help me but the blog Getting data from the Scopus API helped a lot. A hand collected list of Scopus author ids enabled me to collect all the abstracts from recent papers coauthored by each staff member. I used python libraries to cluster and display the data, following a couple of useful blogs on this process, and got some very acceptable results. However I then realised that I could use the text modelling discussed in the book on the data I had produced. Sure enough a quick and easy tool was suggested in Macroscope, one I didn’t know, Voyant Tools.  I just needed a few more lines to my code in order to produce text files, initially one per staff member containing all their recent abstracts in one document. With the Macroscope book in one hand, I soon had a first set of topics, something easy to look at and consider. This showed me that words like Physical and American were often keywords, the second of these being quite surprising initially. However, a quick look at the documents with a text editor (a tool that is rightly never far away in Macroscope) revealed that many abstracts start with a copyright statement such as “2015 American Physical Society”, something I might want to remove as this project progresses. I am very wary of such data clustering in general but with proper thought, with checks and balances of the sort which are a key part of Macroscope, you can extract useful information which was otherwise hidden.

So even for someone like me who has used or knows about sophisticated tools in this area and is (over) confident  that they can use such tools, the technical side of Macroscope should provide a very useful short cut despite my initial uncertainty. Beyond that I found that having the basic issues and ideas behind these approaches reinforced and well laid out was really  helpful for me. For someone starting out, like some of my own physical science masters and bachelors students working on some of my social science projects, they will find this book invaluable. A blog or intro document will often show you how to run a tool but they will not always emphasise the wider principles and context for such studies, something you get with Macroscope.

I should make clear that I do have some formal connections with this book, one of my contributions to the pool of academic goodwill. I suggested the general topic of digital humanities and Shawn Graham in particular as a potential author at an annual meeting of the physics and maths advisory committee for ICP (Imperial College Press). For free sandwiches we pass on ideas for topics or book projects to the publisher. I also commented on the formal proposal from all three authors to ICP, for which I often get a free book. My copy of Macroscape was obtained for reviewing a recent book proposal for ICP. Beyond this I get no remuneration from ICP. It is nice to see a topic and an author I highlighted to come together in a real book but the idea is the easy bit and hardly novel in this case. Taking up the idea and making it into a practical publishing project is down to Alice Oven and her ICP colleagues, and to the authors Shawn Graham, Ian Mulligan and Scott Weingart. That’s particularly true here as the book was produced in an unusual open source way and ICP had the guts to go along with the authors to try this different type of approach to publishing.

References

Exploring Big Historical Data: The Historian’s Macroscope
Shawn Graham (Carleton University, Canada),
Ian Milligan (University of Waterloo, Canada),
Scott Weingart (Indiana University, USA)
ISBN: 978-1-78326-608-1 (hardback)
ISBN: 978-1-78326-637-1 (paperback)