Turning Books Into Data: Creating a Historical Database of 20th Century Women Scientists in the U.S.

Eileen Clancy

Public Paper

Published on April 17, 2017

Turning Books Into Data:

Creating a Historical Database of 20^th Century Women Scientists in the U.S.

An introductory descriptive paragraph, which should include a problem statement, and say *what* your tool/thing will do. This is your abstract, or elevator pitch. This should not have the full theoretical framing of the project.

Abstract

“Hidden Figures” is the name of a book and a movie about a group of African-American women who calculated rocket trajectories for the early U.S. space program. Margot Lee Shetterly, the author of the book, says the most frequent question she is asked is, “Why haven’t we heard of these women before?” The reasons simply put: their work was unheralded at the time it occurred and was later on left out of the voluminous historiography of the period. The work of women scientists in the U.S, more generally, has been neglected by historians with the result that the documentation of their existence and their work is fragmentary and elusive. To counter the difficulty of identifying women scientists in this period, I propose to create a historical database of 20th Century Women Scientists in the U.S. to increase the possibility that their work will be recognized. As a starting point, I will take historiographic resources that already exist, but are not in an electronic form, and digitize them. For the pilot effort, I will scan, OCR and extract names and biographical information from a print source (the second volume of Margaret Rossiter’s “Women in Science”) and place it into a database to be used as the foundation for visualizations and network analysis by scholars. The database would be available for scholarly research both as an online resource and a downloadable dataset.

Project and research outputs: A dataset, an online resource and a white paper about the process of making the project.

Introduction [This is not required by the assignment.]

The historiography of 20^th century scientists in the U.S. presents a number of paradoxes. Until the latter part of the century, the total number of all scientists in the U.S. was fairly small and the percentage of women in a particular field of science could be represented on a spectrum ranging from minimal-but-noteworthy to a-tiny-sliver. And, since historians of science have chiefly studied “leaders” (often constructed as lone male geniuses), reading the literature leaves one with the impression that women had almost no presence in and vanishingly little impact on the fields in which they toiled.

As part of a long-term project in the history of computing and the history of science, I have researched the presence of women working in scientific and technical roles in the early U.S. space program. In the process, I have found surprising abundance where I had expected scarcity. I keep discovering women scientists, mathematicians, engineers, and technicians (including women of color) who are generally not reflected in the voluminous histories of the U.S. space program. I say “discover” advisedly because the documentation of their existence and their work is fragmentary and elusive.

For a number of reasons, including the structural sexism in science that inhibits the recognition of their work, as well as the aforementioned tendency by historians to create hagiographies of individuals, women scientists are often overlooked and their successes occluded. This makes it difficult to locate their papers, biographical information, or even merely to identify them.

I have been wondering if there might not be shortcuts to uncovering the stories of these women and their roles. I have been imagining the possibilities for text mining and network visualization and considering the potential for some of the available print materials about these women being turned into datasets. As a pilot project, I propose to take the second volume of Margaret Rossiter’s classic trilogy “Women Scientists in America,” transforming the text into processable data, and placing it into a database to be used as the foundation for visualizations and network analysis.

A set of personas and/or user stories.

These personae would be applicable only when this resource is more fully built out, not during the pilot stage.

a) Government agency employee seeking to understand the way relationships between women worked to effect change in history

A historian working at a government agency (you might be surprised how many of these exist) could use the data to understand the influence that agency’s employees had on the trajectory of environmental legislation. They could see, for example, links between the environmentalist Rachel Carson, her close friend, the illustrator and photographer Shirley Briggs (they met while working for the government), and Cornelia Cameron, a world-traveling geologist based in Iowa. Briggs’s and Cameron’s papers are at the University of Iowa, including copies of some of the best-known photographs of Carson.

b) Historian writing a macrohistory of women scientists in the 1950s-60s

There were very few women working full-time as scientists and mathematicians at universities in the U.S. in the 1950-60s. But there were women with degrees in science and math performing scientific and technical work in related fields at government agencies and in private industry. Using this resource, it would be possible to locate and identify women who trained as scientists or mathematicians and worked in these capacities: building some of the earliest stored-program, electronic computers for scientific calculation at the National Bureau of Standards (later NIST) in Washington, D.C. or Boulder, Colorado; as nuclear physicists at Los Alamos, Nevada; designing satellite instruments at Cal Tech; as librarians at the Naval Research Laboratory in Washington, D.C., programming the IBM computer tracking Sputnik at the Navy’s Vanguard Computer Center in Washington D.C.; calculating rocket trajectories at the Army Ballistic Missile Agency in Huntsville, Alabama; and as test engineers for Convair Aircraft in San Diego.

How you will make the full-fledged version. This is your “ideal world” version that fulfills all of your visions and fantasies (what tools you will use, how you will get them, how confident you are that all the moving parts will work together, etc)

As an initial experiment, I will digitize the second volume of Margaret Rossiter’s “Women Scientists in America.” (The third volume, published in 2013, is the only one available in ebook form.) To do this, I will scan the book, OCR* the pages, extract the text, parse it computationally, and place it into a database. After I have transformed the book into data, and placed it into a database, I will experiment with network analysis of the corpus.

*OCR is optical character recognition https://en.wikipedia.org/wiki/Optical_character_recognition

Proposed hardware and software

Scanner: GC library

OCR engine: ABBYY Finereader, possibly using the online version (ABBYY SDK) that acts as a distributed computing service. (Distributing computing breaks up tasks and places them on many, potentially thousands, of computers on a network, allowing for exponentially faster processing).

Cleaning data: OpenRefine

Named Entity Recognition: Overview Docs (cloud-based data journalism tool)

Database: TBD, possibly SQL or MongoDB

Network analysis: Gephi

Your assessment of how much time this will take, and how much of the skills you currently know and what you would have to learn.

Sticky parts of workflow

There is no one-size-fits-all technique to OCR an entire book. I use ABBYY Finereader regularly, but have never used it on such a large amount of material. I have found that, even for smaller amounts of information, the results can be scattershot. However, the online version, ABBYY SDK, allows for training the OCR engine to pattern match with a particular corpus. Training is likely a necessity for successful outcomes when batch processing a large amount of scans. If desirable, the training data can then be ported to the local OCR engine.

Time to learn tools

ABBYY SDK: 2-3 days

OpenRefine: a few hours with tutorials

Overview Docs: a few hours of experiments and reading documentation

A new database: several days with video tutorials, asking for advice from Digital Fellows, attending workshops and reading documentation

Gephi or d3.js: attending a workshop and several weeks of experiments to learn this software even on only a nut-and-bolts level.

Time to perform labor

Scanning. The scanning process will take most of a day to do carefully, including checking the scans, making sure the files are named appropriately and backing them up.

OCR: The best way to proceed is to first make test batches. Depending on the results, it will likely be necessary to train the OCR engine. Never having done this before, with the expected learning curve, I think that two days will be sufficient to get a strong sense of the viability of this method and another two days to OCR the entire set of scans.

Data cleaning: If done on the entire book, my estimate is that it will take one to two weeks full-time for the first pass. It is likely that additional problems will be discovered after I actually start working with the data; but further attempts to massage it thereafter should not be as time-consuming.

Named entity recognition: I know the least about the amount of time this step will take; but Overview Docs is streamlined for use by journalists who are not programmers. It extracts entities rather quickly and automatically. If it turns out that the results from Overview are not what I am looking for, I would try the open-source NLP software Pandas used with a Jupyter notebook. (I took a workshop to learn to use that during the GCDRI event). One of the Digital Fellows is a computational linguist and an expert user of these technologies and could advise. I am estimating that it will take three days.

Configuring the database and placing data in database: Two days initially with iterative labor thereafter as necessary.

Network analysis: To do the initial set up, I would work with the Digital Fellows. These tools are notoriously difficult to work with, but have remarkable looking results. Depending on the end goals of the project, which are still in formation, one could easily spend months tinkering with the settings of these tools to achieve a desired result.

How you will make the stripped-down version. The stripped down version is the minimally viable product. It is the most *bare bones* version to prove that what you are trying to get at is viable. (what tools you will use, how you will get them, how confident you are that all the moving parts will work together, etc).

MVP version: scan and OCR only the index of book, not the entire volume.

To do this as a MVP project, I might perform the same tasks but work only with the index of the book (approximately 80 pages). Using the index, besides scaling back the amount of time it will take to perform all of the tasks, will also allow far easier extraction of named entities (people, locations, organizations) because the pre-work, so to speak, has already been performed by a human indexer.

Proposed hardware and software

Scanner: GC library

OCR engine: ABBYY Finereader, possibly using the online version (ABBYY SDK) hat acts as a distributed computing service. (Distributing computing breaks up tasks and places them on many, potentially thousands, of computers on a network, allowing for exponentially faster processing).

Cleaning data: OpenRefine

Named Entity Recognition: Overview Docs (cloud-based data journalism tool) or the open-source NLP software Pandas used with a Jupyter notebook.

Database: possibly Filemaker

Your assessment of how much time this will take, and how much of the skills you currently know and what you would have to learn.

Scanning. The scanning process will take half a day to do carefully, including checking the scans, making sure the files are named appropriately and backing them up.

OCR: The best way to proceed with this is to first make test batches. Depending on the results, even for a MVP, it will likely be necessary to train the OCR engine. Never having done this before, with the expected learning curve, I think that two days will be sufficient to get a strong sense of the viability of this method plus another two days to OCR the entire set of scans.

Data cleaning: The index, besides having a smaller number of pages and fewer words per page, is structured (to an extent), this should take far less time than the full-fledged version of the project. I would estimate two days with some additional iterations later.

Named entity recognition: I think Overview Docs could handle the index material rather easily, perhaps in as little as a few hours.

Configuring the database and placing data in database: I have worked with Filemaker for many years and can set up a simple structure rather quickly if I have a CSV file. Half a day initially with iterative labor thereafter as necessary

Network analysis: I think for a pilot project, I might forgo this step. Alternatively, I could work with one of the Digital Fellows to make a rudimentary example of what a useful visualization would look like.

Responses

Lynne Turner says:

April 19, 2017 at 12:24 pm

Eileen – I really like this project. Your extended introductory context framing how structural sexism and historical treatments emphasizing hagiography have rendered invisible contributions of women to science during the mid/late 20th century. You of course have a solid grasp of the technical tools available for this research project – I am interested in how the text analysis/visualization works for a project of this scope. I like that you have factored into the plan to write a white paper about the project. Will the white paper focus on both the process for creating the resource as well an analytical treatment of your research? Also, where will the resource be housed after it is created to maximize ongoing use by government agencies, scientists, researchers, educators, etc? Finally, while this is beyond the scope of your project which will create a research tool to fill a gap in historical knowledge, a future iteration could develop from this into a teaching tool.

Reply
Lisa Hirschfield says:

April 19, 2017 at 3:17 pm

This is such a necessary and exciting project, and I am a little jealous that you came up with something so clear-cut in its scope, intention, development and functionality (my projects are rarely so focused). The possibilities for visualization are nearly endless – only limited by the level of detail you include when structuring the data. For the full (not MVP) project, how would you decide what to include and exclude, beyond the basics like the 5 Ws? Aside from the technical challenges–which you are confident about handling–that seems like it might be the most difficult part of the process.

Reply
Myrna Lillian Fuentes says:

April 20, 2017 at 5:56 am

Eileen, first and foremost, congratulations for your idea and acknowledging the women in all these areas that need to be acknowledged.
You have spoken about your project throughout the semester and reading it with such detail, I received a better visual. I can not wait to visit your site. Perhaps you can include some Spanish or Latin American women who have contributed, but no one knows of them. I could then use it in my classes.

Reply
Alexandra Juhasz (she/her) says:

April 24, 2017 at 1:15 pm

I am good with this as it is outlined: seems well thought through, needed, and doable. I think the harder work is going to be to deliver it to audiences who want and need it, and to convince others that they do as well. I do not think that this needs to be part of the ITP IS project, but it needs to be on your horizon and in your mind as you make it. How do you get the resources to these historians? Do you track what they do with it? Do you engage with them to make it better for their needs and uses? I really like the white paper, so perhaps it is part of the process of getting it to the hands that want and need the database.

Reply
Luke Waltzer (he/him) says:

April 26, 2017 at 10:49 am

Much to be excited about here, Eileen.

My concerns are about the viability of a single data source when attempting to do network analysis. This is not a conceptual or even a research concern, but more of a rhetorical challenge at this part of the project. The innovation and the bulk of the labor here resides in the process that you will be establishing for getting data into a format whereby it is deployable (searchable, sortable, visualizable). We can only begin to explore your core research questions once enough data is ingsted in the db. Maybe I’m wrong and the Rossiter book will give you enough to work with, but you haven’t given us enough yet to support an argument that it will.

It certainly is a viable ITP IS project for you to set up a process by which data sources can be ingested into a db that can then provide the basis for an interactive website for researchers and teachers. OCRing the data, parsing it into a usable structure. storing it, and then identifying how it will be retrieved are doable if challenging tasks. You are well positioned to do this work and document it for the broader community, which would be a tremendous contribution.

How will you address intellectual property concerns?

Reply

Turning Books Into Data: Creating a Historical Database of 20th Century Women Scientists in the U.S.

Eileen Clancy

Responses

Cancel reply

Cancel reply

Need help with the Commons?