Digitization is not a new activity for libraries and cultural heritage institutions, and indeed has become a critical tool for preserving and providing access to archival collections including rare books, manuscripts, and photographs. The potential research value of digitized collections is also not a new phenomenon. However, translating images of content into machine readable data that can be searched, sorted, and otherwise manipulated had not received much attention until crowdsourcing, citizen science, and other types of community collaboration models and platforms were constructed. A definition of transcription is useful to understand some of the competing elements when considering whether and how to transcribe digitized items. Huitfeldt and Sperberg-McQueen distinguish between transcription as an act, as a product, and as a relationship between documents.1 Cultural heritage institutions need to explicitly facilitate the creation and dissemination of each in order to host a successful transcription program. While crowdsourcing methods directly address the act of transcription, libraries are often better suited to produce viable representations of transcription products and relationships in digital repositories. Crowdsourcing thus becomes one of several methods or tools for libraries to develop successful transcription workflows.

Screen Shot 2017-04-07 at 11.38.52 AM.png

Image from William Brewster’s Diary from 1865 that identifies several birds by their common species names (http://biodiversitylibrary.org/page/40222552).

     Transcription helps bridge the gap between digitization and use by enhancing access through full text search, enriching metadata collection, and opening collections to digital textual analysis. Digitized natural history manuscript items are largely hidden due to the lack of item level description for most archival collections. While minimal processing is certainly the better option compared to maintaining an extensive backlog of unprocessed material, digitized handwritten documents are not discoverable based on their unique content without a machine readable facsimile. Indexing transcriptions facilitates discovery of historical records and improves catalog search results. By offering full text transcriptions, the digital collections are opened up to new types of searching, sorting, categorizing, and pattern finding. Research derived from these new data sets can illustrate changes over time across much larger magnitudes of collections and types of information resources. 

Screen Shot 2017-04-07 at 11.32.16 AM

Image and graph from CLIR report illustrating that the number of species observations recorded in field notebooks at a given location is typically larger than the number of specimens collected. From “Grinnell to GUIDs: Connecting Natural Science Archives and Specimens.”

     This is particularly important when considering biodiversity heritage literature and archives collections due to their significant value in documenting species occurrences, botanical observations, climate patterns, and meteorological events. Transcriptions facilitate the manipulation of this data and support research that extracts knowledge from formal and informal collecting and observation events. The growing research interest in all types of natural history resources including specimen records, species publications, and field recordings can be further enhanced by integrating access to data and images across scientific disciplines, institutions, and resource type.2 The Biodiversity Heritage Library can connect content by pulling information together from specimen, collecting events, and historical documentation into its portal and making it available to aggregators like GBIF and EOL. Initiatives for transcribing documents and records make this information available for digital use and should be developed as strategies optimized to support large scale integration.

     Transcription projects for collections are time consuming, intellectually intensive, and expensive for an organization to facilitate, yet crowdsourcing has been identified as a sustainable model for generating transcriptions for large collections and institutions with diverse holdings, and is an exciting way to improve data collection from a diverse range of users for metadata enhancement. Biodiversity research has a strong background in relying on non-scientist community members to collect data. These Citizen Scientist programs and the resulting data are understood as a “public good that is generated through increasingly collaborative tools and resources while supporting public participation in science and Earth stewardship.”3 Tracking and understanding biodiversity at varying scales requires fine-grained data to be collected over regions and continents, years and decades. Professional scientists alone are not generally capable of delivering the volume of data, analysis, and interpretation needed to support large-scale biodiversity research questions.4 “Studying large-scale patterns in nature requires a vast amount of data to be collected across an array of locations and habitats over span of years or even decades.”5

     Crowdsourcing transcriptions can be understood as a method of gathering data over wider geographical and temporal spaces. By transforming the existing data into a machine readable format, field notes, collection lists, and observation notes become a powerful and rich source of biodiversity information. By transcribing and generating structured data sets from field notes, scientists of yore can be recruited for current research projects. BHL’s content spans hundreds of years and the entire globe, creating a potential pool of observation data that can inform today’s research. In the same way that science departments have turned to public participation to enlist the public in creating scientific knowledge, crowdsourcing transcriptions creates global networks that can generate data to be analyzed for population trends, range changes, shifts in phenologies, climate changes, etc.

     Transcriptions will allow BHL to extract data from digitized items to improve the discoverability of hidden collections. My NDSR project addresses similar goals to the Art of Life and Purposeful Gaming projects that enriched the metadata of items to better facilitate access to collections. The Art of Life grant sought to “liberate natural history illustrations from the digitized books and journals in the online Biodiversity Heritage Library through the development of software tools for automated identification and description of visual resources.”6 Images in BHL are described structurally at the page level, facilitating navigation by human users and citation resolvers, but they lacked sufficient descriptive metadata to enable dynamic filtering and inquiry. The Art of Life grant project built new software tools and algorithms to automatically identify illustrations found within the text pages of the BHL corpus and push those illustrations to crowdsourcing environments like Flickr and Wikimedia Commons for their description. Similarly, full text searching of texts is significantly hampered by poor output from OCR software, and historic literature has proven to be particularly problematic because of its tendency to have varying fonts, typesetting, and layouts that make it difficult to accurately render. The Purposeful Gaming project was developed in order to identify a method to quickly and efficiently harness large numbers of users to review and correct particularly problematic works by presenting the task as a game. Each project improves the discoverability of and access to digital texts by enriching descriptive metadata for items at the page level to support full-text searching, data mining, and markup of content in BHL collections. The NDSR transcription project complements Art of Life and Purposeful Gaming by developing a similar method for generating machine readable content that will enhance access to handwritten text, a final category of “hidden content” in BHL.

Screen Shot 2017-04-07 at 11.31.23 AM

Screenshot from BHL Book Viewer for the Journals of William Brewster that shows the poor OCR output and the inability to index pages or Scientific Names without quality transcriptions (http://biodiversitylibrary.org/page/44700560).

     The Internet’s speed, reach, temporal flexibility, anonymity, interactivity, and convergence brings people into conversation with each other, lowers barriers to information by creating easier access to professional bodies of knowledge, increases access to useful tools, and enables an online participatory culture.7 By externalizing transcriptions of manuscript items we can leverage the collective intelligence and wisdom of crowds and exploit a large and diverse set of skills, tools, and ideas to bear on archival materials and special collections. The Internet encourages ongoing co-creation of new ideas in which content is generated through a mix of bottom-up (from the people) and top-down (policy-makers, businesses, and media organizations) processes.8 Libraries are ideal institutions to encourage and utilize crowdsourcing initiatives due to their unique placement at the intersection of these processes. Libraries and cultural heritage institutions have the advantages of mission statements and codified ideologies dedicated to enriching the knowledge of the people as well as the organizational structures to mobilize, energize, and capitalize reciprocally on the capabilities of its users. This symbiotic relationship is not only mutually beneficial, but is likely one of the spaces in which GLAMs can thrive in the digital age.

--Katie Mika, BHL NDSR Resident at the Ernst Mayr Library


1. Huitfeldt, Claus, and C. M. Sperberg-McQueen. “What Is Transcription?” Literary and Linguistic Computing 23, no. 3 (September 1, 2008): 295–310. doi:10.1093/llc/fqn013.

2. Christina Fidler, Barbara Mathé, Rusty Russell, and Russel D. White. “Grinnell to GUIDs: Connecting Natural Science Archives and Specimens.” In Proceedings of the CLIR Cataloging Hidden Special Collections and Archives Symposium, 2015. https://www.clir.org/pubs/reports/pub169/fidler-et-al.

3. Janis L. Dickinson, Jennifer Lynn Shirk, and David N. Bonter. “Publication: The Current State of Citizen Science as a Tool for Ecological Research and Public Engagement.” Frontiers in Ecology and the Environment 10 (August 1, 2012): 291–97.

4. Theobald, E. J., A. K. Ettinger, H. K. Burgess, L. B. DeBey, N. R. Schmidt, H. E. Froehlich, C. Wagner, et al. “Global Change and Local Solutions: Tapping the Unrealized Potential of Citizen Science for Biodiversity Research.” Biological Conservation 181 (January 2015): 236–44. doi:10.1016/j.biocon.2014.10.021.

5. Bonney, Rick, Caren B. Cooper, Janis Dickinson, Steve Kelling, Tina Phillips, Kenneth V. Rosenberg, and Jennifer Shirk. “Citizen Science: A Developing Tool for Expanding Science Knowledge and Scientific Literacy.” BioScience 59, no. 11 (December 1, 2009): 977–84. doi:10.1525/bio.2009.59.11.9.

6. Trish Rose-Sandler. “The Art of Life: Data Mining and Crowdsourcing the Identification and Description of Natural History Illustrations from the Biodiversity Heritage Library.” Grant narrative, 2012.

7. Daren C. Brabham. Crowdsourcing. MIT Press, 2013. http://wtf.tw/ref/brabham.pdf.

8. Ibid.