Position paper: From libraries as patchwork to datasets as assemblages?

Photo of beach view

My position paper for Always Already Computational: Collections as Data. Every attendee wrote one – read the others at Collections as Data – National Forum Position Statements.

From libraries as patchwork to datasets as assemblages?

Dr Mia Ridge, Digital Curator, British Library

The British Library’s collections are vast, and vastly varied, with 180-200 million items in most known languages. Within that, there are important, growing collections of manuscript and sound archives, printed materials and websites, each with its own collecting history and cataloguing practices. Perhaps 1-2% of these collections have been digitised, a process spanning many years and many distinct digitisation projects, and an ensuing patchwork of imaging and cataloguing standards and licences. This paper represents my own perspective on the challenges of providing access to these collections and others I’ve worked with over the years.

Many of the challenges relate to the volume and variety of the collections. The BL is working to rationalise the patchwork of legacy metadata systems into a smaller number of strategic systems.[1] Other projects are ingesting masses of previously digitised items into a central system, from which they can be displayed in IIIF-compatible players.[2]

The BL has had an ‘open metadata’ strategy since 2010, and published a significant collection of metadata, the British National Bibliography, as linked open data in 2011.[3] Some digitised items have been posted to Wikimedia Commons,[4] and individual items can be downloaded from the new IIIF player (where rights statements allow). The BL launched a data portal, https://data.bl.uk/, in 2016. It’s work-in-progress – many more collections are still to be loaded, the descriptions and site navigation could be improved – but it represents a significant milestone many years in the making. The BL has particularly benefitted from the work of the BL Labs team in finding digitised collections and undertaking the paperwork required to make the freely available. The BL Labs Awards have helped gather examples for creative, scholarly and entrepreneurial uses of digitised collections collection re-use, and BL Labs Competitions have led to individual case studies in digital scholarship while helping the BL understand the needs of potential users.[5] Most recently, the BL has been working with the BBC’s Research and Education Space project,[6] adding linked open data descriptions about articles to its website so they can be indexed and shared by the RES project.

In various guises, the BL has spent centuries optimising the process of delivering collection items on request to the reading room. Digitisation projects are challenging for systems designed around the ‘deliverable item’, but the digital user may wish to access or annotate a specific region of a page of a particular item, but the manuscript itself may be catalogued (and therefore addressable) only at the archive box or bound volume level. The visibility of research activities with items in the reading rooms is not easily achieved for offsite research with digitised collections. Staff often respond better to discussions of the transformational effect of digital scholarship in terms of scale (e.g. it’s faster and easier to access resources) than to discussions of newer methods like distant reading and data science.

The challenges the BL faces are not unique. The cultural heritage technology community has been discussing the issues around publishing open cultural data for years,[7]in part because making collections usable as ‘data’ requires cooperation, resources and knowledge from many departments within an institution. Some tensions are unavoidable in enhancing records for use externally – for example curators may be reluctant or short of the time required to pin down their ‘probable’ provenance or date range, let alone guess at the intentions of an earlier cataloguer or learn how to apply modern ontologies in order to assign an external identifier to a person or date field.

While publishing data ‘as is’ in CSV files exported from a collections management system might have very little overhead, the results may not be easily comprehensible, or may require so much cleaning to remove missing, undocumented or fuzzy values that the resulting dataset barely resembles the original. Publishing data benefits from workflows that allow suitably cleaned or enhanced records to be re-ingested, and export processes that can regularly update published datasets (allowing errors to be corrected and enhancements shared), but these are all too rare. Dataset documentation may mention the technical protocols required but fail to describe how the collection came to be formed, what was excluded from digitisation or from the publishing process, let alone mention the backlog of items without digital catalogue records, let alone digitised images. Finally, users who expect beautifully described datasets with high quality images may be disappointed when their download contains digitised microfiche images and sparse metadata.

Rendering collections as datasets benefits from an understanding of the intangible and uncertain benefits of releasing collections as data and of the barriers to uptake, ideally grounded in conversations with or prototypes for potential users. Libraries not used to thinking of developers as ‘users’ or lacking the technical understanding to translate their work into benefits for more traditional audiences may find this challenging. My hope is that events like this will help us deal with these shared challenges.

[1] The British Library, ‘Unlocking The Value: The British Library’s Collection Metadata Strategy 2015 – 2018’.

[2] The International Image Interoperability Framework (IIIF) standard supports interoperability between image repositories. Ridge, ‘There’s a New Viewer for Digitised Items in the British Library’s Collections’.

[3] Deloit et al., ‘The British National Bibliography: Who Uses Our Linked Data?’

[4] https://commons.wikimedia.org/wiki/Commons:British_Library

[5] http://www.bl.uk/projects/british-library-labs, http://labs.bl.uk/Ideas+for+Labs

[6] https://bbcarchdev.github.io/res/

[7] For example, the ‘Museum API’ wiki page listing machine-readable sources of open cultural data was begun in 2009 http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs following discussion at museum technology events and on mailing lists.

Photo of beach view
The view from UC Santa Barbara is alright, I suppose

Workshop: Information Visualisation, CHASE Arts and Humanities in the Digital Age 2017

I ran a full-day workshop on Information Visualisation for the CHASE Arts and Humanities in the Digital Age training programme at Birkbeck, London, in February 2017. The abstract:

Visualising data to understand it or convince others of an argument contained within it has a long history. Advances in computer technology have revolutionised the process of data visualization, enabling scholars to ask increasingly complex research questions by analysing large scale datasets with freely available tools.

This workshop will give you an overview of a variety of techniques and tools available for data visualisation and analysis in the arts and humanities. The workshop is designed to help participants plan visualisations by discussing data formats used for the building blocks of visualisation, such as charts, maps, and timelines. It includes discussion of best practice in visual design for data visualisations and practical, hands-on activities in which attendees learn how to use online tools such as Viewshare to create visualisations.

At the end of this course, attendees will be able to:

  • Create a simple data visualisation
  • Critique visualisations in terms of choice of visualisation type and tool, suitability for their audience and goals, and other aspects of design
  • Recognise and discuss how data sets and visualisation techniques can aid researchers

Please remember to bring your laptop.

Slides

Exercises for CHASE’s ADHA 2017 Introduction to Information Visualisation

  • Exercise 1: comparing n-gram tools
  • Exercise 2: Try entity extraction
  • Exercise 3: exploring scholarly data visualisations
  • Viewshare Exercise 1: Ten minute tutorial – getting started
  • Viewshare Exercise 2: Create new views and widgets

Workshop: Crowdsourcing and Cultural Heritage, Rice University

Photo of campus gate

As part of my trip to Texas for SXSW, I was invited to give a workshop on ‘Crowdsourcing and Cultural Heritage’ in the Fondren Library at Rice University’s Humanities Research Center Sawyer Seminar series on March 7, 2016. My slides are below. My visit was a great chance to find out more about the teaching and projects at the Research Center, and my thanks go to the organisers for their excellent hospitality.

Abstract: This workshop will provide an overview of crowdsourcing in cultural heritage and consider the ethics and motivations for participation. International case studies will be discussed to provide real life illustrations of design tips and to inspire creative thinking.

Photo of campus gate
Rice University

Exercises for CHASE’s Introduction to Information Visualisation

These exercises were prepared for the CHASE Arts and Humanities in the Digital Age event’s workshop on Information Visualisation but they’re also useful for people who want to learn more about data visualisations in cultural heritage and the humanities.

Exercise 1: compare simple text tools

Time: c. 5 minutes.

Goal: compare the ability of two different tools to help you understand a new text corpus

1.     Load the word cloud site

2.     Then, grab some text:

  • Open another browser tab
  • Go to http://pastebin.com/Nd0a86tm
  • Select and copy the 8 lines of text. The easiest way is to click into the box under ‘RAW Paste Data’
  • Paste them into the text box on the Wordle site and hit ‘go’
  • You can customise your visualisation using the menu. Which options create a more informative visualisation?

3.     Load the word tree site

  • Go to http://www.jasondavies.com/wordtree/
  • Paste the text into the ‘Paste Text’ box and hit ‘Generate WordTree!’ (Grab the text again from Step 2 if necessary)
  • You can click on words on the screen – which words produce the most options?

4.     Discuss

Bearing in mind that this is an unusual corpus, which tool gave you a better sense of its content? Why?

Are these tools better for exploring or explaining data? Why?

If tidying up the data provided – removing punctuation, making spelling consistent, etc – would improve the visualisation, then try editing the text and re-running the visualisation. Did it help? What else could you do?

Exercise 2: exploring scholarly data visualisations

Time: c. 10-15 minutes.

Goal: get hands-on experience and practice critical analysis.

Pair up with your neighbour to explore and discuss one of the visualisations listed on the following page.

Instructions

  1. In your browser, go to one of the sites below
  2. Take a few minutes to explore the visualisation
  3. Then discuss with your neighbour:
    • What do you think is being presented here?
    • Can you easily see where to start and how to use it?
    • What stories or trends can you start to see?
    • Does it work better at one scale over another?
    • Do you find it more effective at aggregate or detail level?
    • Does it present an argument or provide a space for you to explore and develop one?
    • What arguments (statements about the data) does the site present?
    • What have you learned from visualisation that you might not have learned from looking at the data or reading a description of it?
  4. Be prepared to report back to the group. e.g. summarise the site’s purpose, visualisation formats and data types, or share unresolved questions or the most interesting parts of your discussion


University of Richmond, ‘Visualizing Emancipation’

http://www.americanpast.org/emancipation/

Further information: http://dirt.terrypbrock.com/2012/04/visualizing-emancipation-examining-its-process-through-digital-tools/

Stanford ‘Mapping the Republic of Letters’

http://www.stanford.edu/group/toolingup/rplviz/rplviz.swf

Further information: http://openglam.org/2012/03/21/mapping-the-republic-of-letters/, http://danbri.org/words/2010/11/22/603

Locating London’s Past

http://www.locatinglondon.org/

GAPVis Ancient Places

http://gap.alexandriaarchive.org/gapvis/index.html#index

Further information: http://googleancientplaces.wordpress.com/

Digital Harlem :: Everyday Life 1915-1930

http://digitalharlem.org/

Further information: http://digitalharlemblog.wordpress.com/ http://writinghistory.trincoll.edu/evidence/robertson-2012-spring/

Digital Public Library of America’s timeline, map, bookshelf

http://dp.la/

Further information: http://dp.la/info/ and http://dp.la/info/news/blog/

Orbis

http://orbis.stanford.edu/

Further information: http://hestia.open.ac.uk/updating-orbis/

Lost Change

http://tracemedia.co.uk/lostchange/

Further information: http://blog.britishmuseum.org/2014/02/19/lost-change-mapping-coins-from-the-portable-antiquities-scheme/

The State of the Union in Context

http://benschmidt.org/poli/2015-SOTU

Further exercises

Learn more: explore and analyse more visualisations

Sketch out ideas for a visualisation

  • Work out what data you need and the best way to prepare and present it. http://www.dear-data.com has some lovely examples of creative sketches.

Create your own visualisations

These sites can be used with your own or public data:

If you have sensitive data you must check whether any data you load will be made public.

HILT Summer School 2015: ‘Crowdsourcing Cultural Heritage’

Photo

Resources for the course on Crowdsourcing Cultural Heritage at HILT 2015 I’m teaching with Ben Brumfield.

Course Google Doc for collaborative note-taking, links, etc.

Flickr Group for HILT 2015 Crowdsourcing photos

Mia’s storify of the week and the class presentation for the HILT Show and Tell.

Projects made in the class

Well done @cmderose_wisc @nebrown63 @ElizHansen @ESPaul @vac11 @kmthomas06 @WendyJ1226 @HistorianOnFire @Jim_Salmons @TimlynnBabitsky + Nancy!

Monday: overview, speed dating

HILT Crowdsourcing Slides and Exercises for Monday

Session 2: links to find a project you love! For non-English language projects, try Crowdsourcing the world’s heritage.

Prompts for thinking about projects:

  • How clear was the purpose of the site? How well was it reflected in the ‘call to action’ and other text?
  • How easy was it to get started?
  • Were the steps to complete the task clear?
  • How enjoyable was the task?
  • Did the reward (if any) feel appropriate?
  • Looking at the site overall, does the project appear to be effective?
  • What is the input content? What is the output content?
  • What validation methods appear to have been used?
  • Who is the probable audience and what motivates them to participate?
  • How does the project let participants know they’re making a difference?
  • Does the site support communication between participants?
  • How was the site marketed to potential participants?
  • Did the site anticipate your questions about the tasks?

HILT Crowdsourcing Slides and Exercises Tuesday

http://tinyurl.com/EminentScotsmen

http://tinyurl.com/Graves1845

HILT Crowdsourcing Slides Wednesday

HILT Crowdsourcing Slides Thursday

HILT Crowdsourcing Slides Friday

Photo
HILT 2015 Crowdsourcing class

Continue reading “HILT Summer School 2015: ‘Crowdsourcing Cultural Heritage’”

Workshop: Information Visualisation, CHASE Arts and Humanities in the Digital Age

I’ve been asked to give a workshop on Information Visualisation for the CHASE Arts and Humanities in the Digital Age training programme in June 2015.

The workshop will introduce students to the use of visualisations for understanding, analysing and presenting large-scale datasets in the Humanities, enabling scholars to ask increasingly complex research questions.

Slides, sample data and instructions for exercises are downloadable here: CHASE InfoVis Handouts 2015.

Links for the various exercises are collected below for ease of access.

Exercise 1: Exploring network visualisations

Exercise 2: Comparing N-gram tools

Books

Newspapers

Exercise 3: Trying entity recognition

Exercise 4: Exploring scholarly data visualisations

Exercise 5: create a chart using Google Fusion Tables

Google Fusion Tables: https://www.google.com/fusiontables/data?dsrcid=implicit

An Excel version of this exercise is available at http://www.openobjects.org.uk/2015/03/creating-simple-graphs-with-excels-pivot-tables-and-tates-artist-data/

Exercise 6: Geocoding data and creating a map using Google Fusion Tables

Google Fusion Tables: https://www.google.com/fusiontables/data?dsrcid=implicit

Exercise 7: Applying data visualisation to your own work

Explore more visualisations:

Sketch ideas for visualisations:

Try visualising data in different tools:

Try visualising existing data

Workshop: Visualising Collections, Geffrye Museum

Ananda Rutherford organised a workshop for the Documenting Homes project at the Geffrye Museum, which  is researching visualisation models for presenting the archive and other collections information across digital platforms. The workshop is a chance to explore the role of visualisations in organising, interrogating and interpreting collections in context and to develop critical and planning skills for designing visualisations. It will include guided exercises for turning data in a spreadsheet into simple visualisations and an optional hour for trying out visualisation tools with your own data.

Contact me for the workshop slides and datasets. The exercises are below.

Continue reading “Workshop: Visualising Collections, Geffrye Museum”

HILT Summer School: ‘Crowdsourcing Cultural Heritage’

In August 2014 I taught ‘Crowdsourcing Cultural Heritage’ with Ben Brumfield at HILT (Humanities Intensive Learning + Teaching) at MITH in Maryland. Thanks to all the participants for making it such a great workshop!

The Course Syllabus and Slide Decks are available for download below.

If you found this post useful, you might be interested in my book, Crowdsourcing Our Cultural Heritage.

Continue reading “HILT Summer School: ‘Crowdsourcing Cultural Heritage’”