Global Graph

image

Above is a thumbnail from a large network visualization produced for SDFB by the talented folks at KNALIJ.  Click here to view the whole image, which is large (~12mb) but can be zoomed and navigated using your web browser.  The image includes only the top nodes and edges in our inferred network.  For a rather unwieldy visualization of all 6,000 odd nodes and their edges, without labels, click here.

The proximity of the nodes is determined by their connection strength.  If multiple nodes are all connected with a high degree of confidence, they will cluster together.  So, for example, you can see members of the Elizabethan court clustered in the bottom left hand corner.  The graph takes the shape of a circle because it’s what’s called a force-directed graph, in which links or edges are treated as springs whose stiffness varies based on confidence estimates.  It’s as though the nodes and their connections had been compressed, then left to settle in place in accordance with Hooke’s law for springs and elasticity.  Node size is a function of the number of connections, which is why a figure like Charles II is significantly larger than, say, Samuel Palmer.  The color of the nodes is an indication of community.  Nodes are members of the same community when they share a set number of edges with other members of the community.  When a node is part of multiple communities, its color is determined by the community with which it shares the most edges.         

What is remarkable about the image, from our perspective, is how much meaningful information it displays given the relatively sparse dataset on which it is based.  All that we sent to KNALIJ was a matrix of nodes and edges with confidence intervals.  But from this minimal data their clustering and community inference algorithms have inferred a remarkable amount. 

For example, though our data includes no dates or other temporal information, the graph has an obvious, though not entirely consistent, chronological organization.  Starting with the Elizabethan court in the lower left hand corner, the graph proceeds counter clockwise through the reigns of James, Charles I, and Charles II.  Nodes at 12 noon are largely post-Restoration and/or 18th-century.  Nodes at 10 o’clock are part of James’s Scottish court.

At first we wondered why the center of the graph is basically empty.  Then we realized that to occupy the center, a node would need to share edges with communities stretching over 150 years.  The empty center is, in effect, a sign of the temporal scope of of our network.  Presumably a network stretched over a longer time period would have an even more pronounced doughnut hole.

It’s worth at this point acknowledging some of the embarrassing things that this image makes evident about the current state of our inferred network.  We still have some named entity recognition problems.  The “Society of Antiquaries” should not show up in our network.  There’s more work to do on date limitations, since figures appear from both much earlier (King John) and later (Lloyd George) than our proposed date range of 1550-1700.  As we’ve discussed in earlier posts, there are still de-duping problems, especially with regards to monarchs.  Some of these should be simple to iron out: King James and James I should not have separate nodes. 

But in other cases the duplication provides potentially significant information. Even though James VI of Scotland became James I of England, it is fascinating to see different communities and networks surrounding the two names.  Nor, to scholars of the period at least, is it self-evident that James VI of Scotland and James I of England should be treated as the same person.  As Jenny Wormold asked long ago, James VI and I: Two Kings or One?  

King James’s northern and southern subjects shared one attitude: both treated this man, who embarked on his dual role three months short of his thirty-seventh birthday, as their king, dividing him as far as possible into two separate individuals.

At stake in the question “Two Kings or One?” is the category of Britain itself.

In some cases the use of colors to indicate communities shows fascinating breakdowns in social coherence.  A light blue Elizabeth I is surrounded by a sea of relatively unbroken light blue Protestantism.  But the pink Charles I is cut off from his Laudian community and hemmed in by the darker blues of Cromwell, Fairfax, and Henry Vane.  Henrietta Maria appears to have her own small and dispersed community, set apart from the rest of the mid-century milieu, that is more closely connected to the courts of Charles II and James II, and doubtless to the court in exile starting in 1644.

As tempting as it is to turn these images into narrative, it would be unwise to draw any strong conclusions at this point.  We can’t be sure which aspects of the visualization are artifacts of our highly imperfect network data, or of the arbitrary thresholds with which the visualization algorithms organize that data into a coherent image.  Revised data, or different thresholds (particularly thresholds manipulable by users), could and doubtless will yield very different pictures.

That said, we see the filtered SDFB graph as a rather large map of  problems.  Why does Henrietta Maria have a community distinct from her husband and those most proximate to her? Why does Jacob Tonson sit so far out to the upper-right hand corner?  Why are certain nodes so evidently out of place?  When and why don’t communities align with proximity, color with clustering?  The problems call out for further explanation, interpretation, and speculation especially by experts with knowledge of the period, of particular figures, or of graph learning and/or visualization.    

Gender and Name Recognition

In a recent post we talked about why SDFB currently does a poor job of including women, how we can fix it, and how it might eventually do an even better job of including women than some of our current intellectual tools.  

There is, however, an additional reason why women are excluded that we didn’t mention in the last post: the mismatch between the asymmetric naming conventions surrounding marriage (especially as they appear in the ODNB) and the capabilities of Named Entity Recognition (NER) and de-duplication (“de-duping”) programs.  

The naming conventions will surprise no one.  Women in the 17th century regularly took their husbands’ surnames when they married.  Multiple marriages meant that a woman would have multiple surnames.  For example, one of the founders of the Society of Friends (or Quakers) called herself Margaret Askew, Margaret Fell, and Margaret Fox at different stages of her life.  Identified, as a result, by multiple names in the ODNB, Margaret had approximately three times more difficulty meeting the 5-mentions threshold that would, for practical reasons, become our initial cutoff for inclusion.  

Scholars who study societies where women conventionally take their husbands’ names, as well as those who live in such societies, have developed general rules, and indeed intuitions, about how women’s names change as a result of marriage.  These rules and intuitions are not foolproof - when misapplied they can lead to scholarly errors and even social embarrassment - but they do a good job of handling most cases.  Scholars of 17th-century England have no conceptual problem recognizing that the names Margaret Askew, Margaret Fell, and Margaret Fox all refer to the same person.  Scholars have conventional ways (some simple, some more complex) of designating this identity of reference.  Fell’s ODNB entry (authored by Bonnelyn Young Kunze), for example, begins, “Fell [née Askew], Margaret (1614–1702).”  The French word “née” is one such convention of obvious and longstanding use.  But other ways of acknowledging identity are tacit and of more recent vintage.  If one searches the ODNB for “Margaret Fox,” one is silently directed to the entry for Margaret Fell.  Fell is never referred to as “Margaret Fox” in the entry (though she is in one of the sources); rather the identity is encoded only in the site’s redirection to the entry.  

Though the conventional rules and intuitions surrounding name changes are familiar enough to those who use or study them, NER and de-duping programs have to learn them from scratch.  In some respects this is similar to other problems of name duplication. “Charles I,” “King Charles,” and “Charles Stewart” all refer to the same person.  Briefly, and amusingly, we also had a “Charles I. King” among our set.  To ensure that they don’t appear as multiple nodes in the network (“King Charles knew Charles I who also knew Charles I. King!”) we’ve simply had to tell the network estimation algorithm that they are the same person.   

But changes in women’s names as a result of marriage are different in a few key respects.  There’s little need to develop rules for de-duping figures like Kings and Queens (male name + roman numeral = “King” Name; female name + roman numeral = “Queen” name.)  Such examples are few enough that it just makes sense to do it on an individual basis.  But women who marry are obviously a much much larger class, such that developing general rules for de-duping would be essential to making sure SDFB adequately includes and represents them in the network.  It would be useful to develop de-duping procedures, for example, that recognize that what follows the term “née” is an alternative last name for the same person.  And it’s not simply a matter of de-duping either.  The NER program needs to recognize the different and often more elaborate formats of women’s names in the first place.  It needs, for example, to be able to read a string like “Fell [née Askew], Margaret (1614–1702)” and recognize this as a name in the first place.

The point, we suppose, is that the inclusion of women in a resource likes Six Degrees of Francis Bacon will depend on more than good will, scholarly self-critique, self-awareness, or even careful research.  While these virtues remain important, it will also require good programming as well, programming that takes into account both the gendered naming conventions of the period and the notations by which we record those conventions.

An Entry of One’s Own, or Why Are There So Few Women In the Early Modern Social Network?

in honor of International Women’s Day


In this post, we will address what has long seemed to us a conspicuous shortcoming in the Six Degrees of Francis Bacon (SDFB) data: the relatively small number of early modern women.  As Helen Smith cheekily put it on Twitter, there’s “more sausage than Bacon” in “Six Degrees of Francis Bacon.”  Clearly, this is something necessitating further work, and it is worth emphasizing that we are currently in very early stages.  

How will women feature more prominently?  Our ultimate goal is to create architecture for scholars to curate, add, validate, and revise relationships.  Groups like the Society for the Study of Early Modern Women are well placed to help fill in what are currently obvious silences in the graph.  And, as we mine further data sources, including scholarship from the last half-century on women writers in the period, resources like the Brown Women Writers’ project will continue to offer rich information about networks of women writers.      

But the reasons why relatively few women appear in our earliest graphs are not self-evident, and those reasons open into intriguing questions about historiography, scale, and the kinds of relationships privileged by the DNB.  Our work with the DNB data certainly shows us much about early modern history and culture but it also yields insights into the way early modern history and culture get refracted through the particular, biased, and fallible lens - lenses? - of the DNB.     


The Oxford Dictionary of National Biography has roughly 58,000 biographical entries.  Once we had performed our Named Entity Recognition, our list of entities was already nine times as big, totaling in the end about 450,000 entities.  We tried to bracket non-persons (cities, organizations, and other such entities captured by our wide net) and because we were interested in particular in the early modern period, we further limited our set to people who lived between 1550-1700.  Limiting our set by these years was less straightforward than one might think. Some individuals appear in the body of the DNB who don’t have their own entries.  These individuals rarely appear with life dates.  We therefore had to develop further methods to infer approximate years of life.  

But even after we limited our data initially, our data set still remained too unwieldy for the kinds of validation and analysis we needed to do.  Our quantatatively-minded readers may not necessarily be scared off by such numbers, but humanists will surely appreciate the difficulty of trying to validate inferences from a data set with tens of thousands of names and, squaring that number, hundreds of millions of possible relationships.     

So, after dividing DNB entries into roughly 500-word chunks, or records, we introduced a threshold: we would limit our set further by working only with names that appeared five or more times in those 500-word records. Consider this: taking only persons prominent enough for their names to to appear in five or more DNB records, there are roughly 6,294 people who were alive between the years 1550-1700 who fit that criterion.  Each of the 6,294 people could therefore have been associated with any of 6,293 others.  This means that just at the highest levels of prominence—remember, we aren’t even counting people whose names appear in 4(!) DNB records—estimating the early modern social network involves inquiring into roughly 39 million possible relationships (assuming both one-way and two-way relationships).  

So why are there so few women in the early modern social network?  Early modern women have far fewer of their own DNB entries, and even when one counts their appearances in records derived from others’ entries, as we’ve tried to do, those appearances rarely total five or more.  

As it grows, we hope to use SDFB to rectify such biases.  While the social network inferred from the DNB currently does a poor job of including women and their associations, we believe that SDFB has the potential to enlarge our understanding of women in early modern England.  As impressive and important as the Women Writers’ Project is, for example, it is limited by the fact that it is dedicated only to women who were writers, and specifically those who published texts from 1526-1850.  


What about those women who never set pen to paper, but who played crucial roles in creating and convening sub-networks of artists, intellectuals, diplomats, and politicians nonetheless?  Early modern women assembled the society and culture of early modern England no less than men did, and by recording their associations SDFB is uniquely positioned to represent, and even to help us discover, the various ways in which they did so - including those that did not involve writing.     

“More sausage than Bacon,” it turns out, is, among other things, an argument for developing more sophisticated approaches to the DNB, for mining sources more sensitive to women’s networks, for rectifying historical biases through more research on women, and for enlisting the expertise of individual humanists with detailed knowledge about early modern social networks. SDFB’s present universe of 6,294 names and their possible relations is a very good start, but it is clear it is just that: a start.

Network Inference, Visualization, and the Generative Difficulties of “Knew”: The Case of James Harrington (1611-1677)

While graphs like the one immediately below focusing on James Harrington (1611-1677) help make early modern social networks visible, they are based upon data like that in the included chart.  

 

image

In this case, we have used line thickness, or edge weight, to indicate how likely it was, according to our analysis of the DNB, that two people knew one another (more on the generative difficulties of “knew” later).  The thicker the line, the higher the “confidence estimate,” which is to say, the more chance that two people knew one another, at least as far as the collective enterprise of the DNB is concerned.  Another way to think about the confidence estimate is to see it as the answer to a specific question: “when the SDFB team runs its algorithm over a random selection from the full DNB-derived data set 100 times, how many times are these people connected?”

image


In the visualization above, we’ve introduced a threshold at 51% (or 51 times).  Had we used a lower threshold, we would have introduced more names to the visualization (and more visual clutter), but our confidence—or more precisely, the confidence we’ve derived from the DNB—in those connections would have been lower. 

James Harrington was of course the author of Oceana (1656) and several republican pamphlets, many issued around the Restoration when his “Rota” club was most active.  Here, Harrington appears in a network composed largely of’ “commonwealths-men” and late seventeenth-century controversialists, precisely as one familiar with the standard account of his milieu might expect.

Yet the visualization can also help us move beyond what we already know.  Graphs such as this one are intended to spur new inquiry.  When we think, for example, about the paths of books (given, purchased, lent and not lent), the social lives of manuscripts (shared, copied, annotated, altered, torn), relations of patronage or affection (given, or withheld), we find ourselves in a new world of hypotheses and scholarly conjectures. Intellectual affinities, linguistic patterns, and group boundaries all take on new dimensions.     

Visualization even prompts us to clarify what we mean when we say two people “knew” one another. Were they friends, enemies, lovers?  Political allies of convenience? Of conviction?  Can a reader “know” an author if he hasn’t met her in the flesh?  What if the medium is the letter as opposed to the printed book?  Inference and visualization operate in this way as a kind of “experimental metaphysics,” to use Bruno Latour’s term. Far from binding us to dry quantitative analysis, visualization and the confidence estimates on which they’re based enable us to move swiftly toward complicated questions of affect, intimacy, and ideology. If we want to say, for example, that exchange of letters constitutes a relationship but reading one another’s books does not, what are the ideological suppositions, modes of address, “radicals of presentation” (Frye), and textual effects underpinning such a claim?

Consider here the usefully challenging case of John Toland, whose relation with Harrington is, according to our model, 99% certain.  James Harrington died when Toland was seven years old.  Common sense suggests that something’s gone wrong.  However, we know from J.G.A. Pocock, Blair Worden, Justin Champion, and others that Toland is the figure most responsible for carrying Harrington’s republican torch into the 18th century through his influential biography of Harrington, published with his similarly significant edition of Oceana (1700).  Toland, we might say, “knew” Harrington as well as anyone in Harrington’s lifetime.  While the relation was not reciprocal (Harrington did not know Toland), there are strong grounds—empirical, theoretical—for including this relationship. Put somewhat differently, we would skew our results considerably if we tried to force Toland out of Harrington’s network by altering the algorithm.  If the DNB tells us they were related, do we need to say they were not?  And this raises questions about the historical significance of the textual trace (life writing, editing), about ideological proximity vs. spatio-temporal proximity, and about the kinds of relationships privileged by the DNB itself.

At the same time, our “confidence estimates” might just as easily be called “doubt estimates,” and these “doubt estimates” can have considerable scholarly value too.  Consider two possible uses.  While low scores suggest slim chance of a relation, they also show scholars (and students) where there’s relatively high scholarly payoff for demonstrating evidence of connections.   This is one of the reasons why we err on the side of inclusion, giving confidence estimates nearly down to nil.  In shorthand, it helps scholars know more about what we don’t know about.  This “metaknowledge” is a key step on the path to new discoveries and arguments.    

Secondly, low scores take us toward a category we’ve come to think of as “white space” in the social graph, specifically, the category of non-relation.  What we’ve already said about the possibilities generated by networks might make “white space” or “non-relation” seem relatively uninteresting.  Presence is much more fun than absence, right?  But here, it’s possible to develop more fine-grained understandings of groups, individuals, and their “publics,” where publics are understood, in Michael Warner’s sense, as self-organized relations among strangers.  Insofar as tropes, images, ideas and so forth develop and operate within networks, we can posit a rough and ready dichotomy of network, on the one side, and public on the other. While undoubtedly reductive, such a dichotomy can be a productive heuristic, generating more concrete thinking about the kinds of languages and practices deemed worthy of “export” and the people and groups who made up audiences for publication, understood in its broadest sense.        







Global Birth Estimation Graph

Although DNB entries include birth dates, many names appear unassociated with such dates.  After our initial Named Entity Recognition approach to the the DNB, our list of names included many people who lived outside of our date range (1550-1700).

We estimated birth years of named entities by matching them against extracted years of birth from biography subjects. In the case of multiple matches, we went with the person with the longest biography.  

In order to test how successfully we had estimated birth dates - and to test whether our methodology is doing something reasonable in estimating links, we plotted each node using a graph-drawing algorithm that emphasizes closeness of nodes if there is an edge (relationship estimate) between them. 

Nodes cluster largely as one would expect (nodes tend to be connected to other nodes with similar birth years).  Outliers are easily identified and analyzed.image

Descriptions

  • For the ego-centric plots, the ‘Francis Bacon’ plot shows an evenly spread out layout of nodes. The ‘James Harrington’ plot shows an unevenly spread out layout, where node placement is influenced by an (estimated) measure of relationship strength.
  • For the dyad plot, nodes are placed evenly spread out. 
  • For the trail plot, nodes are evenly spread out within each ‘cluster’ (considering Milton and Fox as their own separate clusters).
  • For all of these plots, larger names / nodes represent names that appear more often in our text sources.