High-Degree Nodes: Tales from the Raw NER Data

Jessica M. Otis ( http://orcid.org/0000-0001-5519-8331 )

This is the third in a series of posts on patterns I found while curating Named Entity Recognition (NER) data from the ODNB.  The first two were on nine days’ wonders and the women “behind” men.

Today, I’m going to delve into a little bit of graph theory so I want to begin by defining what I mean by “degree.”  We’re all familiar with the way the word is used in the phrase “six degrees of separation” – the length of the path or number of steps it takes to get from Person A to Person B.  However in graph theory, “degree” has another, very specific definition – the number of edges (relationships) associated with each node (person) on the graph.  So, for example, the node “Samuel Harsnett” has a degree of seven because he has seven relationship edges depicted in the graph below.

Several of the nodes in our existing graph have extremely high degrees for obvious reasons.  Each of the monarchs, for example, was the center of a web of patronage and every ambitious man or woman in the country wanted to have a relationship with them.  These are generally “weak” relationship ties, as most of a monarch’s relationships with aspiring courtiers would be in passing rather than intimate.  However, it is precisely these weak relationship ties that enable them to act as bridges or hubs between more strongly-related social groups.  These hubs then allow us find short paths from one person to another in the network – our “degrees of separation.”  (For more on this subject, see “The Strength of Weak Ties,” by Mark S. Granovetter.) 

Since high-degree nodes are useful in navigating our social network, it is very interesting when our NER programs turn up potentially high-degree nodes – people who weren’t significant enough to merit their own biographies in the ODNB, but who had relationships with large numbers of people.

Several of these potentially high-degree nodes are schoolmasters, which should not surprise anyone who teaches large numbers of students on a regular basis.  Our NER programs turned up men like Thomas Smelt, an ardent royalist who taught at the Northallerton Free School in Yorkshire, and Edward Sylvester, who ran a grammar school in Oxford.  These schoolmasters’ relationships with an ever-changing roster of boys particularly enable them to create short paths between people from different generations.  As we begin adding to the original ODNB data, it will be interesting to see if these nodes can compete with the degree of more studied courtier and politician nodes.

Other potentially high-degree nodes are the printers and publishers.  This includes men such as William Barley, a prolific music publisher at the turn of the seventeenth century, and John Kingston, whose diverse publication list included everything from almanacs to a sermon by John Foxe.  These printers and publishers would have had only passing relationships with many of their authors, but this is still enough to create short paths between anyone who published during the early modern period.  While the SDFB network is currently based only on ODNB data, one of our priorities is to incorporate the publishing data from the ESTC, which should significantly decrease the average path length in our network.

While I’m discussing high-degree nodes, it’s worth mentioning that there are several nodes in our network that will initially appear to have abnormally high degree: the diarist nodes.  The social networks of men like Sir Thomas Aston, Henry Machyn, and Samuel Pepys can be extensively mapped thanks to the diaries they left behind.  In the short term, these will be interesting cases to examine as models for how the rest of the network might eventually appear once we’ve acquired more relationship data.  In the long term, I look forward to the day when the schoolmasters can put the philanderous Samuel Pepys in his place.

"The realm of human affairs, strictly speaking, consists of the web of human relationships which exits wherever men live together."

— Hannah Arendt (born on this day in 1906), The Human Condition

Behind Every Great Man: Tales from the Raw NER Data

Jessica M. Otis ( http://orcid.org/0000-0001-5519-8331 )

This is the second in a series of posts on patterns I found while curating Named Entity Recognition (NER) data from the ODNB.  The first, on nine days’ wonders, can be read here.

Looking at the NER data has been a fascinating experience, as it exposed some of the benefits and drawbacks of both our source material and our current methodology.  Today I want to tackle an issue that has been raised on this blog before – the lack of women in our dataset.

What do Thomas Bellenden, John Boteler, and Richard Stubbe all have in common?  Besides the fact that you won’t find their biographies in the ODNB, that is.  All three of them passed the five-mentions threshold applied to our NER data, but not because of anything they did.  Instead, they survived in the dataset because of the women in their lives – women who usually, though not always, are themselves absent from our dataset.

Sir Thomas Bellenden of Auchinoul was a justice-clerk who died in 1546.  He appears in only one biography as an active participant – a 1540 meeting with Henry Balnaves and Sir William Evers – but crops up in six more on the strength of his family ties.  His eldest son, John, became a judge and legal writer, meriting his own biography.  His sister, Katherine, married a courtier named Oliver Sinclair, while his daughters Katherine and Margaret produced four biography-worthy sons between them.  Thus Sir Thomas passed the five-mentions threshold primarily because of the women in his life.

We see a similar pattern with John Boteler.  Despite being the first Baron Boteler, he only appears in the NER data as a reference point for his daughters: Audrey, Anne, Mary, and Olivia or Olive.  In looking at the way his daughters are commonly referenced in the biographies, it becomes clear that they never stood a chance of being identified by the NER, even if they could pass our five-mentions threshold.  The woman who today we would identify as Audrey Boteler, Audrey Anderson, or even Audrey Leigh, Baroness of Chichester, is referred to by her first name and her relationship to the men around her.  So while “Audrey, widow of Sir Francis Anderson and eldest daughter of John Boteler, Baron Boteler of Brantfield” is recognizably a name to a human, the way she is identified by her first name in combination with her social relationships prevents her recognition by our NER software.

Last, but not least, is Richard Stubbe, a lawyer who died in 1619.  He appears in six biographies – once in reference to a nephew, once in reference to his wife, and four times in reference to his daughter, Alice.   But unlike the other women I mention in this blog post – and the dozens of other women whose examples I could have chosen – Alice Stubbe actually outperforms her father in the dataset.  Under her married name, Alice L’Estrange merits a biography of her own.

While there are several conclusions that could be drawn from this data, I’m enough of an optimist that I’d like to end on a relatively positive note: behind every great man, there is an entire network of great women whom I hope to see emerge in the SDFB dataset and network visualizations.  They are mothers, sisters, and daughters – women whose relationships with men form the foundation of the kinship networks, and whose relationships with other women form back channels for the negotiation, preferment, and other alliances necessary for men to succeed.  Or, at the very least, survive the five-mentions threshold of our NER.

Nine Days’ Wonder: Tales from the Raw NER Data

Jessica M. Otis ( http://orcid.org/0000-0001-5519-8331 )

During my first few weeks on Six Degrees of Francis Bacon, I spent a lot of time poking into every nook and cranny of the project, trying to get up to speed on what the team had already done.  I soon identified the dataset itself – particularly the Named-Entity Recognition (NER) data, compiled from the ODNB using Stanford’s NLP toolset and Lingpipe – as being the most productive place for me to start working.

Much of the NER data had been excluded from our current working node set for practical reasons.  It’s easy to see at a glance that “River Thames” is not a person, but it’s much more time consuming to determine that “Richard Ingoldsby” was in fact three different people with overlapping life spans.  Now multiply that by 1,200!  Luckily, I have the research skills of an historian and the patience of a programmer debugging code.  There are probably still a few errors lurking in the NER data, but I’ve dealt with most of them.

While I was curating the NER data, a number of interesting patterns showed up in the data that are worth examining to explore the benefits and drawbacks of both our source material and our current methodology.  The ODNB limits its biographical entries to people of historical significance – those who “shaped British life between the 4th century BC and the year 2008.”[1]  By contrast our NER programs capture a range of people who are not considered historically significant themselves, but who nonetheless repeatedly appear in the biographies of the historically significant.

One category of such people is those who were famous for a single incident.  In the late sixteenth or seventeenth centuries, such people might have been called a “nine days’ wonder.”  Or, in more modern jargon, they are people who experienced “fifteen minutes of fame.”

These are men like Thomas Sandys, whose claim to ODNB fame rested on being sued by the East India Company for violating its trading privileges, or Edmund Hampden, one of the ship money objectors in the Five Knights’ Case.  Both legal cases had significant consequences – indeed, I suspect I am not the only one to make time for the Five Knights’ Case in my early modern survey courses – but the participants themselves are less important.  They appear from nowhere in historical biographies, then disappear again with equal speed.

Not all the nine days’ wonders were associated with legal cases.  Robert Nowell’s claim to ODNB fame lies in his deathbed philanthropy, particularly establishing a trust fund for poor scholars at Oxford.  Others briefly claimed the limelight in more active ways.  Captain Robert Gorges sailed to Massachusetts in 1623-4 and established a short-lived government.  He figures in the ODNB biographies of several presumed shipmates and colonial authorities before returning to England and vanishing from the historical records.

These nine days’ wonders do not merit a full biography in the ODNB.  Indeed, as of this writing, only Robert Gorges has enough notoriety for his own Wikipedia entry.  Yet they have still carved out a place for themselves in the SDFB social network.  The links they established – between lawyers or ship money objectors, poor scholars or shipmates – may have been transitory, but it is these moments of transitions that historians often study.  And it is precisely these sorts of unassuming nodes, which nevertheless connect nodes of known historical importance, that I am excited to see emerging from the SDFB project.

Six Degrees of Francis Bacon and Undergraduate Research Part III

Jordan Cox, Emmett Eldred, Alexandra George, Sarah Hodgson, Rebecca Smith

(Part I, Part II)

Six Degrees of Francis Bacon helps address the problematic outcome of students taking away over-simplified understandings of historical societies.  This is particularly useful for courses whose main focus is one specific aspect of those societies, such as literature, art history, musicology, the history of science, or intellectual history.

Undergraduates in such courses sometimes encounter individuals and artifacts in remarkably thin social contexts.  This is hardly because our teachers are poor. Rather, a single-semester focused on great and/or canonical figures - Shakespeare, Milton, Bacon, Rubens, Galileo, or Hobbes - leaves little time for a deep engagement with contexts.  As a result, the so-called geniuses of the early modern period remain shrouded in myths, not least among them that people “back then” had no social networks.  Perhaps students might be introduced to the most daring conspiracies, the most consequential wars, and the most remarkable friendships.  But even this limited exposure to action, adventure, and scandal does little to make Shakespeare and Milton more real than giants or centaurs.  Working with Six Degrees of Francis Bacon, we found, helped to address such problematic outcomes.

Six Degrees addressed in particular a notable disconnect between what scholars know and what students learn. Expert scholars have had several years to immerse themselves in the rich web of people and ideas that crisscrossed their period of expertise.  But the traditional model of teaching  simply cannot do justice to the rich contexts for the works that sparked their curiosity in the first place.  Who has time for names like Augustine Phillips and Philip Henslowe when there’s King Lear?  Our expert professors know that Shakespeare’s plays, Donne’s poems, Hobbes’ philosophy, and Newton’s physics weren’t pulled from the ether but instead had meaningful roots in conversations, exchanges of letters, financial arrangements, reading, and contemporary events.  

But many students still take away a version of things not far away from the Romantic idea of the solitary genius.  Sure, professors might
tell them that major figures had interesting relationships just like mere mortals, but there just isn’t time to understand the past in anything like its full richness and complexity.  Students are left with authors somehow hovering above the very social networks that best explain and enrich authors’ works.  Six Degrees of Francis Bacon became a way for us to pierce some enduring myths of solitary genius and see early modern figures in fuller historical context. 

We can imagine future literature, art history, musicology, history of science, and intellectual history courses in which students will be able to explore connections on SDFB that their professors don’t have time to discuss or treat in depth.

About the Authors

Jordan Cox is a student at Carnegie Mellon University interested in the interactions between technology and writing.

Emmett Eldred (email; @emmetteldred on Twitter) is a sophomore at Carnegie Mellon University, studying Creative Writing, Professional Writing, and Ethics, History, and Public Policy.

Alexandra George (email) is a student in Carnegie Mellon University’s class of 2017, working toward her B.A. in Professional Writing.

Sarah Hodgson is student at Carnegie Mellon University.  

Rebecca Smith (email; @rmsmithcmu on Twitter) reports that her Research Training Course with Six Degrees of Francis Bacon at Carnegie Mellon University helped her identify a passion for working with teams in both the humanities and in technology. She looks forward to pursuing this love with a minor in Human-Computer Interaction, and hopes to find similar interdisciplinary work after graduation.

Six Degrees of Francis Bacon and Undergraduate Research Part II

Jordan Cox, Emmett Eldred, Alexandra George, Sarah Hodgson, Rebecca Smith

In this, our first post about SDFB from a student’s perspective, we will outline briefly some of the mechanics of our course, specifically who was involved, where and when we met, the major resources at our disposal, and the project-specific roles we took on. This post is meant to be mostly descriptive.  Subsequent posts will draw out some lessons from the framework we describe here.    

The most important tools we used in our course included a purpose-built web application, Google Docs, Google forms, a conference room with a white board, and the online Oxford Dictionary of National Biography.  

Much of our work involved an online database and web application that displays, through a graphic, roughly 6000 persons who lived in Great Britain between the years of 1550 and 1700.  The web application was constructed by the PIs of the project in collaboration with Steve Melnikoff (Knalij) and Carnegie Mellon University Information Systems undergraduates Katarina Shaw, Adetunji Olojede, Amiti Uttarwar, Miko Bautista, Leonard Sokol, and advisor, Raja Sooriamurthi.  Earlier work with colleagues in Statistics had inferred roughly 19 million relationships by (a.) extracting names of persons from the 62 million words of the DNB (58,000 entries) and (b.) developing models that predicted the co-occurrence of any two names.  The web application allows students to see and interact with the fruits of the earlier text mining and inference stage. Each person is represented by a colorful dot, or node, and relationships are represented by connecting the nodes with lines. 19 million inferences can only be so accurate, however. The computer makes potential relationships between people visible, but only experts can verify probabilities in any particular case.  Thus, the project PIs needed human researchers, and that was where the five of us came in.  Assessing and validating inferred relationships was a main priority of our work.

Though our Research Training Course met weekly for one hour, the majority of the learning and work involved took place outside of the classroom. Our starting point was to use the web application to validate relationships between two people joined in the database.

The first step involved finding who or what interested us  (i.e. a group of people, a specific person, or a time period). We explored relationships via the web application and called regularly on The Oxford Dictionary of National Biography (ODNB) for contexts and support.  Eventually, each of us would identify a person we wanted to learn more about (say, John Donne), read his or her biographical entry in the DNB and then follow suit for another person (perhaps Lady Anne Clifford) who the application suggested was connected to Donne. From there we had several more steps.

imageThe first was to decide whether Donne and Anne Clifford had a relationship. If this was the case and significant evidence was given in the DNB to validate it, then on the web interface our first task was describe the relationship using a drop down menu of relationship types.  The classifications for the relationships included possibilities such as “knew of one another”, “close friends”, “coworkers” and so on.  


After classifying the relationship, our next step was to rate the confidence of the likeliness that this relationship existed. The choices ranged from Certain, which receives a 95% confidence estimate, to Possible, receiving a 50% confidence estimate, to Highly unlikely, which receives only a 5% confidence estimate that any relationship existed. Underneath the drop down box for our level of confidence was an open text box, where we added additional background information and context to clarify the relationship, providing evidence to support its existence. Normally these entries were only a few sentences, but if ample information existed, some entries ran up to a paragraph in length. To finish our evaluation of the relationship, we cited our sources and identified ourselves as contributors.



Some cases were more difficult than others.  If it seemed likely that some sort of relationship existed between two people, but there was not enough information to classify the type of a relationship from the DNB, then we turned to additional sources. Looking to the sources used by the DNB authors, using our library’s catalog system to find biographies, and also searching through online databases and journals became ways of finding more information about how two people could have a relationship. In this case, one has to be committed to go on a journey with these two people. It is hard to gauge when entering the process which relationships will be easiest to find and which will prove slightly more challenging, but there are definitive relationships that pose problems in terms of tracking information and deciding what type of relationship, if any, existed. These types of relationships, and even those whose information is gathered solely from the DNB, are instances where a degree of uncertainty still remains.

Hardest was researching relationships that, in reality, did not exist. It is much easier to prove something exists than to disprove it, a fact we all found rather abruptly, as we all seemed to avoid these cases until Professor Warren prodded us about it. If the the application was giving us a spurious relationship, we had to dig deep enough to be confident the two had never known each other or that the supposed relationship was otherwise mistaken.


A typical kind of problem concerned famous diarists, such as Francis Meres or Samuel Pepys, and the people about whom they wrote. Many times these proved to be one-sided relationships where the diarist knew of the other person but the person probably did not know the diarist. What made these so difficult is that sometimes many resources had to be explored to confirm that the two people did not know one another. In many cases, the two names just do not exist alongside each other. It is rare that a source would explicitly state that one person has no relationship with another person. Often it was up to us to draw a conclusion. Our assignment was to complete ten of these relationships per week.

Aside from spending a majority of the time outside of the classroom, we held weekly meetings devoted to the discussion of our individual findings. Every Monday at 4:30 pm, we met in Professor Warren’s office to discuss issues, concerns, exciting finds, or ideas that we wanted to discuss collectively.

These discussions gave us an idea of where we stood as a team in the project. Early on, we decided it would be good if we created specific project-related roles for ourselves. Our weekly meetings normally began with role-specific updates from each member. Emmett, who took the role of Chief Research Specialist of Early Modern British History, filled us in on queries he had fielded from group members throughout the week. He described the kinds of questions others in the group were asking him about early modern British history, and drew generalizations of interest to the group.  Jordan, our Quality Control leader, monitored an “issues log,” a Google form where we entered any problems we had encountered while researching and using the interface.


Often, many of the issues that came up were problems of disambiguation (when one node seemed to refer to two or more people of the same name).  Other “quality control” issues included identifying cases of duplication, when the same person appeared twice under different names (for example Elizabeth I. and Elizabeth I); relationship vocabulary (how to describe, say, the relationship of a patron and the writer s/he supports); confidence (coordinating our evaluations of relationship likelihood); or typos and grammar (which we wanted to flag for a later time when our contributions might be edited).   One of us, Rebecca, took charge of Technology and Pedagogy, coordinating with a development team from Information Systems and updating both them and us about ongoing developments to improve researchers’ experience with the web application. Sarah, the Editor-in-Chief, coordinated the composition of the present reflections.   Alexandra, dubbed our Chief Documentarian, administered our “Thought Forum,” a Google Doc created for us to document thoughts, feelings, and experiences in real time.  Because our class time was limited to a single hour per week, the Thought Forum became a way for us to communicate without all being in the same place at the same time.

However, these meetings were not solely updates from each member and reminders of what we should be doing. The discussion space of the meetings was a time for all members to learn. Our roles only started taking form after a couple weeks, when we figured out exactly what each role entailed. This happened both according to what the person in the role was interested in and what the group demanded of each person. We also used the time in meetings to develop strategies for entering the giant pool of knowledge that is the ODNB. It was here that we really began to reach our goal of deeper learning.


About the Authors

Jordan Cox is a student at Carnegie Mellon University interested in the interactions between technology and writing.

Emmett Eldred (email; @emmetteldred on Twitter) is a sophomore at Carnegie Mellon University, studying Creative Writing, Professional Writing, and Ethics, History, and Public Policy.

Alexandra George (email) is a student in Carnegie Mellon University’s class of 2017, working toward her B.A. in Professional Writing.

Sarah Hodgson is a student at Carnegie Mellon University.  

Rebecca Smith (email; @rmsmithcmu on Twitter) reports that her Research Training Course with Six Degrees of Francis Bacon at Carnegie Mellon University helped her identify a passion for working with teams in both the humanities and in technology. She looks forward to pursuing this love with a minor in Human-Computer Interaction, and hopes to find similar interdisciplinary work after graduation.

What’s In A Name? The Many Nodes of King James VI and I

Jessica M. Otis (http://orcid.org/0000-0001-5519-8331)

In honor of the Scottish Independence Referendum coming up this Thursday, I thought it would be an appropriate time to look at the monarch who started England and Scotland down the path to union: King James VI of Scotland, who later also became King James I of England.

One of James’ main goals upon taking the throne of England was the political unification of England and Scotland.  He assumed the style of King of Great Britain and introduced the Union Jack.  But political unification would remain a distant dream in his lifetime and only be achieved during the reign of his great-granddaughter Queen Anne, in 1707.

Intriguingly, the disjunction between England and Scotland manifested in an early version of the SDFB network.  Due to conventions in the historiography, James is a man of many names – King James, James Stewart, James Stuart, James VI, James I, and James VI and I.  When the SDFB team ran Named Entity Recognition programs on the Oxford Dictionary of National Biography, this multiplicity of names led to a striking error: James was assigned to four different nodes.  This, in turn, gave us distinct visualizations of the disjunction between the English and Scottish courts in the historical record.

Here is a network centered on the node of James Stewart, a name which encompasses both James the king as well as a number of his relatives, and which were disambiguated in a later version of the network.  The visualization also includes all the people within one degree of separation – that is, the nodes are directly connected to the James Stewart node by a single edge.  These edges indicate a high probability that James Stewart and these other people’s names will co-occur in biographical entries.

You can already see the other three nodes for James: James VI, James I, and King James.  Aside from a few yellow-coded nodes that belong to members of the English court, most of James Stewart’s connections are to orange nodes.  The density of these same-colored nodes is not a coincidence; it is a product of a clustering algorithm that groups nodes together based on their shared connections, and assigns a different color to major connected components.  James Stewart is therefore primarily connected to this dense group of orange – Scottish – nodes, with a few outlying connections. 

When we shift our focus to the James VI node, we maintain connections to two of the three other James nodes – historians, apparently, are unlikely to refer to James as both James I and James VI in a single biography.  James VI maintains connections to many, though not all, of his original orange nodes, while picking up a large number of additional nodes in orange, yellow, and purple.

There are obviously further errors persisting in this dataset – James was dead long before the birth of his grandson, Charles II, not to mention William Pitt or Samuel Johnson.  However, it is interesting to note that James VI, unlike James Stewart, appears in the same biographies as important members of the English court in both the sixteenth and the early seventeenth centuries (yellow and purple, respectively). 

An even more striking shift appears when we look to the James I node.  Here, James I has lost the vast majority of his connections to the orange-coded nodes and instead picked up a host of associations with very English purple nodes.  Indeed his node itself has become purple, reflecting his ascension to the English throne in 1603 and thus his “rebirth” – according to NER, at least – as James I.

This dramatic shift has led the SDFB team to speculate about the disjunction between the English and Scottish courts.  Did James’ transformation from James VI of Scotland to James I of England unify the two, or did they remain two relatively distinct social networks that were only joined by a few people who made the journey between Edinburgh and London?  How much of this striking visualization reflects the realities of the early seventeenth century and how much the conventions of historians discussing the early modern period?  We don’t yet know, but it is one of the many questions we’re hoping the SDFB network will eventually be able to answer.

The last image I want to show you is the King James node, stripped of any Roman numerals.  Again there are a few obvious errors, as the NER cannot distinguish between James VI/I and James VII/II, however the majority of references to King James appear to have referred to the first (orange) James, not his (pink) grandson.

In this King James, unlike his Roman numeraled alter egos, we see the node colors shifting towards an equilibrium.  He maintains connections with a large number of the orange nodes that were associated with James Stewart and James VI, while also keeping touch with the purple nodes of James I.  There is one very prominent node missing, however: King James and Elizabeth I no longer have any connection.

What Thursday’s referendum will do to the connections between Edinburgh and London, however, we’ll have to wait to see.

Six Degrees of Francis Bacon and Undergraduate Research Part I

Christopher Warren (http://orcid.org/0000-0002-9881-682X)

How might a web application like Six Degrees of Francis Bacon be used for undergraduate teaching?  What might undergraduate students discover by thinking and learning with SDFB?

Last spring, I worked in an experimental learning setting with five thoughtful, adventurous undergraduates in Carnegie Mellon University’s Dietrich College of Social Sciences and the Humanities.   We came together as part of something called a Research Training Course, part of a broad Dietrich College initiative intended to bring exceptional first- and second-year humanities undergraduates closer to the cutting-edge of academic research.  The students responded to a course description circulated by the Dean’s Office that emphasized an innovative learning environment focused on reconstructing the social network of Great Britain in the years 1550-1700.   None of the students had previously studied early modern Britain in a university setting.  They received nine course credits for their participation.   

Throughout the semester, the students worked with an interactive web application capable of producing network visualizations for roughly 6,000 individuals who lived in early modern Britain. A screenshot is included above.  (Additional screenshots from the alpha web app will appear in coming posts -  development is ongoing). While the students were invited to explore the network according to their particular interests, a foundational task was to research relationships depicted in the visualizations and to evaluate their validity.  It mattered decisively to the experience that not everything presented to them in the network visualizations was empirically true.  

In addition to researching the history and culture of early modern Britain, the students reflected on their collaborative digital learning process in some remarkably fruitful ways.  Over the next month or so, we’ll be dedicating some space on the blog to their written reflections—reflections that we’ll be posting in installments, with the ultimate goal of possibly turning the series into a publishable essay.    

The students’  work together was highly collaborative.  As such, the writing you’ll be seeing in this series is highly collaborative too.  It is officially the work of students Jordan Cox, Emmett Eldred, Alexandra George, Sarah Hogdson, Rebecca Smith, and occasionally myself, as the course instructor.  But as befits the students’  shared interest in networks, few of us can identify where exactly one student’s contributions end and another’s begin.  Each of the ideas, phrases, examples, and facts in this series—even the overarching organizational structure—was introduced by one of the group somewhere along the line, but everyone refined and ratified each others’ labors throughout.  Six Degrees of Francis Bacon is networks all the way down.  

Some of the upcoming topics include: the roles and processes students developed to cope with the unique challenges presented by our idiosyncratic course; the importance of error and imagination in historical thought; the ways social networks challenge persistent notions of solitary genius; and the heightened importance of historical investigation when students perceive that historical truth really is at stake.     

We look forward to sharing the students’ reflections in coming posts.

What Is A Data Curation Fellow?

Jessica M. Otis (http://orcid.org/0000-0001-5519-8331)

As those of us in the US gear back up for the new school year, I thought now might be a good time to write a blog post introducing myself and – more importantly – my position to the SDFB community.  So hello, everyone!  My name is Jessica Otis and my research focuses on the history of popular mathematics in early modern England.

I’m also the new CLIR/DLF Early Modern Data Curation Postdoctoral Fellow for SDFB.

The operative part of my title is “Data Curation” and that’s what I want to focus on in this blog post.  One of my main responsibilities will be to provide data curation services for the SDFB project, including data management planning, metadata generation, and crowdsourcing oversight.  For many people, “data” is a strange word to hear in an humanities context.  So what does “data curation” mean?  And why is are these three elements of data curation important to SDFB and the humanities more generally?

Data Management Planning

Data management planning is something that scientists and social scientists still have to deal with more than humanists, although that’s changing.  For most humanists, the “data” that forms the foundation for our research are physical, not digital, objects.  We work with textual artifacts, such as court records or novels, as well as more material artifacts ranging from bricks to textiles to skeletons.  But no matter how physical our original sources, we are all are increasingly functioning in a digital world.  Thus humanists must also develop methods for electronically “managing” our data – determining how to collect it, process it, index it, store it, and preserve it for future scholarship. 

Many of us are already collecting, processing, and indexing our data using digital methods.  During my own dissertation research, for example, I used a digital camera to photograph historical documents, created Word files of notes and transcriptions associated with the digital images, and used Excel spreadsheets to keep track of everything.  Many historians I know have similar processes.  Storage and preservation can then range from keeping hard drive backups to placing the data in an online repository.

No matter what storage method you choose, the best practice is to choose a secure, off-site location for a backup of your data, in case a natural disaster takes out your computer, your house, or your entire city!  Off-site storage of your data can be as simple as keeping a spare hard drive at a relative’s or in a safety deposit box.  However there are also a number of options for online repositories.  These include commercial services such as DropBox and university-supported services such as UCSD’s Chronopolis or UVA’s Libra

We are also beginning to see the creation of large numbers of discipline- or subject-specific repositories, such as the UK’s History Data Service.  For a centralized list of many such repositories, see Databib.  You’ll notice humanities repositories are drastically outnumbered on Databib, but more repositories should begin to appear as the demand for such services increases.

For bigger projects, curating data obviously requires a bit more work than for individual researchers.  This is particularly evident when we look at the long term storage and preservation of large amounts of data that aren’t the preserve of a single individual, such as the relationship probabilities and typology data being compiled by the SDFB project.  What is going to happen to our data five, ten, or even fifty years down the road?

One of the ways we plan to address this question is to make our data as “open access” as possible – that is, anyone will be able to come to the SDFB website and download our data for their own use.  If there are copies of our data on hard drives all over the world, the chances of a catastrophic data loss will be dramatically reduced.  Or, as they say in the preservation world, LOCKSS: lots of copies keeps stuff safe.

Open access is not a one-size-fits-all solution for preserving data, but in the case of SDFB, I believe this is something we owe to the crowd-sourcing community we hope to build.  Our data will belong to them as much as to the project’s creators.  By making our data open to the community, other scholars can download our data and hopefully find new, exciting ways to analyze it in the years to come. 

Metadata Generation

Metadata is “data about your data,” which can be embedded directly into many types of files.  One of my favorite examples of this is the geographical tags that are attached to most digital photos.  It’s not part of the actual photo that people see, but someone who’s interested in creating a world map of cat photos (seriously: http://iknowwhereyourcatlives.com ) can look at the photo metadata and find out where the photo was taken.  More usefully as a professor, faced with students *swearing* they had the paper done on time, programs like Word often put a timestamp in a document’s metadata that indicates the last time the file was modified.

Metadata is useful for research, as well.  Ever go back to research notes from a few years ago and wonder what in the world you were trying to say in a cryptic few lines?  Frustratingly, you end up trying to read the mind of your younger self.  This is a problem caused by a lack of metadata and it’s magnified exponentially when you’re trying to read the mind of another scholar.  You can download the SDFB data all day long, but it won’t be useful if you can’t make heads or tails of what’s actually in the files you downloaded.  Part of my job, then, is making sure SDFB has metadata in place to help other people understand the data behind our network visualizations.

Crowdsourcing Oversight

One of the most exciting moments of this project will be when the new website goes live and the SDFB community is able to begin expanding on our original network research.  As powerful as computers are, they’re still no match for the human brain when it comes to imprecise, inconsistent, and incomplete data – all of which are unfortunately common in the early modern period.  Nor is there an algorithmic way to easily classify early modern relationships without the knowledge base that humanist scholars acquire as part of their training and research.

Crowd-sourcing requires oversight to make certain the data – information about people and their relationships – that users add to our network are valid.  We are also working with our programmers to create new features for the website that will enable users to engage in scholarly debate about the data, analogous to the comment section on a blog post or to a Wikipedia talk page.  These shared spaces will require oversight, as well, and strategies will need to be developed to indicate when there is no community consensus regarding certain data in our network.

This is a different type of data curation, but is nonetheless vital to the long-term success of the SDFB project.  And I feel safe in speaking for the whole SDFB team when I say that we are hoping for this to be an active, expanding community of knowledge for years to come.  So thanks for welcoming me to the community and I look forward to building great things with you.

On Categories of Relations in Networks: or, Most Abstract Blog Post Title Ever?

Dan Shore (https://orcid.org/0000-0001-7073-3208)

Any project that sets out to map a social network - a network of relations between persons - will need to decide how to represent and categorize the types of relationships between people - the various ways that people are associated with one another.  The big decision will be a matter of choosing between a controlled vocabulary (we pick a limited number of relationship types in advance) and an uncontrolled vocabulary (users can add new relationship types without restrictions).  Some of our initial thinking about this significant decision can be found in this podcast (especially around the 22-minute mark), but this post is intended to survey the advantages and disadvantages of these choices more fully.

The advantage of a controlled vocabulary of types is that all relationships can be sorted, searched, and ordered by finite categories chosen in advance.  There’s no danger of users adding redundant or specious types or of unseen overlapping hierarchies.  Presumably two nodes can be connected by more than one relationship type (one person could be an uncle and a trade master and a guardian of another), so that one way to achieve specificity is by layering multiple relationship types.  But the downsides of controlled vocabularies are even more glaring.  They are inherently tendentious.  One can always ask why some relationship (teacher and student? Father and “natural” (i.e. born out of wedlock) son?) is omitted while others are included.  By the same token they are inherently normative, giving recognition to some relationships but not others.  They universalize, imposing relationship types across different periods and communities.  A Controlled vocabulary will normalize and universalize regardless of how carefully one assembles them, subjecting them both to historical critique (are their terms as appropriate in 1500 as in 1700?) and localist critique (do the same types apply in rural communities as in urban ones? in the north as in the south of England?).  Historians and literature scholars notice and care about these problems and will (should?) bridle at the constraints they place on their ability to characterize connections between people.  Worse still, controlled vocabularies standardize as matters of fact precisely the things that historians and literature scholars treat as central matters of concern and debate.

You can see the problem of controlled vocabularies most vividly in any popular social networking site.  I may wish to be connected to someone on Facebook but think it imprecise or even absurd to identify her or him as my “friend.”  This ends up changing the very definition of the term “friend,” making it into the general type of social relations qua relations, rather than one particular type of relation among others.  This problem of imposing relationship categories is exacerbated when we’re aiming to reconstruct the network of a period distant in time and culture from our own.

So why not just use an uncontrolled vocabulary of relationship types?  Why not let users characterize relationships with unconstrained subtlety, detail, and specificity?  The disadvantages of an uncontrolled vocabulary of types is are roughly the negation of its advantages.  Uncontrolled types can proliferate endlessly, making them nearly useless for searching, sorting, filtering, or ordering.  If, as Aristotle observes, there is no science of particulars, only of categories, then dispensing with categories also dispenses with the science, leaving only the proliferation of disparate, particular relations.  Since an uncontrolled vocabulary is not shared between members of a community (you have your preferred types, I have mine), this means that the community lacks a shared set of categories for querying or analyzing the network - or at least, overlap in categories will be the product of local and fleeting agreement.  Without a controlled vocabulary of relationship types, it wouldn’t be feasible to filter the network to display only persons related through “Family” or through “Profession,” since those general categories would be thrown into the mix indiscriminately with more specific categories like “Step-Son” or “co-Member of Parliament.”  Put simply, an uncontrolled vocabulary of relations would negate many of the practical benefits for which we’ve decided to reconstruct the social network in the first place.  

That said, I believe that we (the Six Degrees team) have already decided, of necessity, to use an uncontrolled vocabulary for nodes, which amounts to letting users tag persons with basically an unlimited range of group membership descriptors.  There’s no way around this because there’s no principled way to decide, in advance of historical inquiry, what kinds of groups an early modern person could have participated in.  The groups in which persons take part change over time, they overlap, and they are debatable (what was the status of the group “Ranters?”) both in their own time and in historical retrospect.  The only option for nodes is to have contributors deploy the fullest range of group types, including both general (“Puritan”) and specific (“Arminian”) tags, and without any attempt to impose hierarchical relations.  Any attempt to enumerate and categorize all of the radical sects of the late 1640s and early 1650s into a taxonomic scheme would, I think, be to repeat the futile project of Edwards’s _Gangraena_.  It would be beset with problem of overlapping hierarchies. For example, Milton could (arguably) be tagged as a Puritan (or “left Protestant,” anti-episcopal, etc.) and as an Arminian, but Arminianism is a subcategory of Anglican as well, even though Puritan and Anglican are, for most scholars, exclusive categories; a classical categorization scheme wouldn’t work.  So no controlled vocabulary and no hierarchy for node types.

How different are relationship types from nodes?  One intuition is that while groups (i.e. Ranters or members of the “Hartlib Circle”) are highly contingent historical categories, some relationship types have validity across periods and cultures.  All periods (so the intuition goes) have notions of what it means to be related by family, even if the kinds of relations that are counted as family relationships vary dramatically between and even within periods and cultures.  In the early modern period, as in earlier and later periods, the notion of a “natural” or “bastard”  or “illegitimate” child occupies a liminal role in the family, inside in some respects or with respect to some family members, but outside in others.  Yet this kind of liminal case, even as it troubles the coherence of the category “family,” at the same time demonstrates its indispensability.  A more current example: our concept of what counts as a family has changed dramatically (and for the better, I scarcely need to say) as a result of the gay rights movement.  People now speak publicly and proudly of same sex spouses, same sex partners, “gaybys” and other relations as family relations in a way that wasn’t the case decades, much less centuries ago.  But this change in the content of the category “family” demonstrates, rather than undermining, the perdurability and generality of the category itself.  (It’s unclear whether anti-normativity and anti-marriage gay theorists would think it possible or desirable to dispense with the normative category of family relations tout court; this is a question worth asking).  

Digital humanities projects (as opposed to DH scholarship) forces us to stop poking at our basic categories somewhere and make a decision.  This halt to fundamental questioning is the thing about DH that makes humanists like me uncomfortable.  Humanities disciplines have taught us to think of ourselves as poking, deconstructing, troubling, and questioning categories indefinitely.  But the discomfort with any halt to questioning is not peculiar to DH.  The decisions required for DH just make it harder to forget that it is only possible to trouble or deconstruct any particular category or set of categories by leaving in place a whole set of background categories and assumptions.  This is as true of radical critique as it is of any digital database.  Total skepticism about categories just isn’t possible - or desirable, since it would mean the cessation of thought, not thought’s highest pitch.  We can trouble anything, but we can’t trouble everything at once.  An advantage of a DH project like Six Degrees of Francis Bacon is that it lets us clarify our background categories in a systematic and visible way, essentially disclosing new objects for critique.  Its relative positivism (it is concerned to record, store, and make systematically available facts about how people were related) need not be opposed to critiques of categoires.  Rather, the project can serve as a basis for further critique.  In the terms of Bruno Latour, we can’t dispense with “matters of fact” if we hope to pursue “matters of concern” (in this case rethinking relationship types).

In that practical spirit, let me propose one possible way forward on the question of relationship types.  Instead of choosing between a practically useful but theoretically indefensible controlled vocabulary, on the one hand, and a theoretically defensible but practically disastrous uncontrolled vocabulary on the other, we should mix the two.  High level, relatively general and perdurable categories of relations – like family relations, work relations, or pedagogical relations – would be controlled.  These would, for example, allow users to search and sort and filter all Royalist nodes connected by family relations.  But the lower level types, which would be sub-specifications of the higher levels, would be uncontrolled.  That is, we would leave scholars/users free to elaborate the types of familial relationships without constraint, even if this makes it harder to filter and sort coherently at the lower level.  We would have a split-level hierarchy, one that (as with node types) would put no constraints on overlapping hierarchies (Master and Apprentice could be classed as both a pedagogical and a professional relation).  The network would support debates about the family category by enabling debates over the specific kinds of relations that fall under the general category of family relations.  This proposal offers a practical and technical compromise (and by compromise I mean not wholly satisfactory in any respect) to a fundamentally theoretical - i.e. conceptual and ontological - question: what kinds of relations between people are there? 

"We live in an Elizabethan world of our own reductive devising, populated by the Queen and Ben Jonson and the Dark Lady and the Bard and a theatre full of groundlings. But the real Elizabethan world had a lot more people in it than that."

—   Adam Gopnik, “The Poet’s Hand

PODCAST: Christopher Warren on Six Degrees of Francis Bacon

SDFB co-PI Christopher Warren recently presented in Oxford University’s Cultures of Knowledge seminar series “Negotiating Networks.”

The podcast of his presentation, “Bacon and Edges: Reassembling the Early Modern Social Network,” can be found here.

BLOG POST: Daniel Shore on “Extensions of the Book”

SDFB co-PI Daniel Shore has written a guest blog post at the Folger Shakespeare Library’s blog, “The Collation.”  

His post, “Extensions of the Book,” can be found here

PODCAST: Ruth Ahnert and Sebastian Ahnert on “Tudor Letter Networks: The Case for Quantitative Network Analysis”

SDFB team members Ruth and Sebastian Ahnert recently spoke in Oxford University’s Cultures of Knowledge seminar series “Negotiating Networks.”

The podcast of their presentation, “Tudor Letter Networks: The Case for Quantitative Network Analysis,” can be found here.  

Job Opportunity with Six Degrees of Francis Bacon: Early Modern Data Curation Fellow

Carnegie Mellon University’s Department of English and University Libraries jointly seek an Early Modern Data Curation Fellow to lead data curation activities for the Six Degrees of Francis Bacon (SDFB) project, a digital reconstruction of the early modern social network that scholars and students can collaboratively expand, revise, curate, and critique. The fellow will leverage expertise in early modern studies along with technical aptitude in order to contribute meaningfully to a rich data lifecycle, including collecting, processing, textmining, analyzing, and archiving data related to the early modern social network. 

Click on the links above for further details.