Jessica M. Otis (http://orcid.org/0000-0001-5519-8331)
As those of us in the US gear back up for the new school year, I thought now might be a good time to write a blog post introducing myself and – more importantly – my position to the SDFB community. So hello, everyone! My name is Jessica Otis and my research focuses on the history of popular mathematics in early modern England.
I’m also the new CLIR/DLF Early Modern Data Curation Postdoctoral Fellow for SDFB.
The operative part of my title is “Data Curation” and that’s what I want to focus on in this blog post. One of my main responsibilities will be to provide data curation services for the SDFB project, including data management planning, metadata generation, and crowdsourcing oversight. For many people, “data” is a strange word to hear in an humanities context. So what does “data curation” mean? And why is are these three elements of data curation important to SDFB and the humanities more generally?
Data Management Planning
Data management planning is something that scientists and social scientists still have to deal with more than humanists, although that’s changing. For most humanists, the “data” that forms the foundation for our research are physical, not digital, objects. We work with textual artifacts, such as court records or novels, as well as more material artifacts ranging from bricks to textiles to skeletons. But no matter how physical our original sources, we are all are increasingly functioning in a digital world. Thus humanists must also develop methods for electronically “managing” our data – determining how to collect it, process it, index it, store it, and preserve it for future scholarship.
Many of us are already collecting, processing, and indexing our data using digital methods. During my own dissertation research, for example, I used a digital camera to photograph historical documents, created Word files of notes and transcriptions associated with the digital images, and used Excel spreadsheets to keep track of everything. Many historians I know have similar processes. Storage and preservation can then range from keeping hard drive backups to placing the data in an online repository.
No matter what storage method you choose, the best practice is to choose a secure, off-site location for a backup of your data, in case a natural disaster takes out your computer, your house, or your entire city! Off-site storage of your data can be as simple as keeping a spare hard drive at a relative’s or in a safety deposit box. However there are also a number of options for online repositories. These include commercial services such as DropBox and university-supported services such as UCSD’s Chronopolis or UVA’s Libra.
We are also beginning to see the creation of large numbers of discipline- or subject-specific repositories, such as the UK’s History Data Service. For a centralized list of many such repositories, see Databib. You’ll notice humanities repositories are drastically outnumbered on Databib, but more repositories should begin to appear as the demand for such services increases.
For bigger projects, curating data obviously requires a bit more work than for individual researchers. This is particularly evident when we look at the long term storage and preservation of large amounts of data that aren’t the preserve of a single individual, such as the relationship probabilities and typology data being compiled by the SDFB project. What is going to happen to our data five, ten, or even fifty years down the road?
One of the ways we plan to address this question is to make our data as “open access” as possible – that is, anyone will be able to come to the SDFB website and download our data for their own use. If there are copies of our data on hard drives all over the world, the chances of a catastrophic data loss will be dramatically reduced. Or, as they say in the preservation world, LOCKSS: lots of copies keeps stuff safe.
Open access is not a one-size-fits-all solution for preserving data, but in the case of SDFB, I believe this is something we owe to the crowd-sourcing community we hope to build. Our data will belong to them as much as to the project’s creators. By making our data open to the community, other scholars can download our data and hopefully find new, exciting ways to analyze it in the years to come.
Metadata is “data about your data,” which can be embedded directly into many types of files. One of my favorite examples of this is the geographical tags that are attached to most digital photos. It’s not part of the actual photo that people see, but someone who’s interested in creating a world map of cat photos (seriously: http://iknowwhereyourcatlives.com ) can look at the photo metadata and find out where the photo was taken. More usefully as a professor, faced with students *swearing* they had the paper done on time, programs like Word often put a timestamp in a document’s metadata that indicates the last time the file was modified.
Metadata is useful for research, as well. Ever go back to research notes from a few years ago and wonder what in the world you were trying to say in a cryptic few lines? Frustratingly, you end up trying to read the mind of your younger self. This is a problem caused by a lack of metadata and it’s magnified exponentially when you’re trying to read the mind of another scholar. You can download the SDFB data all day long, but it won’t be useful if you can’t make heads or tails of what’s actually in the files you downloaded. Part of my job, then, is making sure SDFB has metadata in place to help other people understand the data behind our network visualizations.
One of the most exciting moments of this project will be when the new website goes live and the SDFB community is able to begin expanding on our original network research. As powerful as computers are, they’re still no match for the human brain when it comes to imprecise, inconsistent, and incomplete data – all of which are unfortunately common in the early modern period. Nor is there an algorithmic way to easily classify early modern relationships without the knowledge base that humanist scholars acquire as part of their training and research.
Crowd-sourcing requires oversight to make certain the data – information about people and their relationships – that users add to our network are valid. We are also working with our programmers to create new features for the website that will enable users to engage in scholarly debate about the data, analogous to the comment section on a blog post or to a Wikipedia talk page. These shared spaces will require oversight, as well, and strategies will need to be developed to indicate when there is no community consensus regarding certain data in our network.
This is a different type of data curation, but is nonetheless vital to the long-term success of the SDFB project. And I feel safe in speaking for the whole SDFB team when I say that we are hoping for this to be an active, expanding community of knowledge for years to come. So thanks for welcoming me to the community and I look forward to building great things with you.