Public Group active 1 month ago

Digital Humanities Initiative

The CUNY Digital Humanities Initiative (CUNY DHI), launched in Fall 2010, aims to build connections and community among those at CUNY who are applying digital technologies to scholarship and pedagogy in the humanities. All are welcome: faculty, students, and technologists, experienced practitioners and beginning DHers, enthusiasts and skeptics.

We meet regularly on- and offline to explore key topics in the Digital Humanities, and share our work, questions, and concerns. See our blog for more information on upcoming events (it’s also where we present our group’s work to a wider audience). Help edit the CUNY Digital Humanities Resource Guide, our first group project. And, of course, join the conversation on the Forum.

Photo credit: Digital Hello by hugoslv on


Hathi Trust Research Center feature extraction

  • Hi All —

    This message was posted on the MLA Commons ( ); I thought it would be of interest to members of this group.

    Harriett Green started the topic Hathi Trust Research Center feature extraction

    “Dear friends and colleagues,

    The HathiTrust Research Center (HTRC) is proud to announce the alpha release of a new dataset, consisting of page-level features extracted from a quarter-million text volumes.

    HTRC Extracted Features Dataset:

    Features are data attributes defined in such a way that they can be identified by a computer and analyzed at scale. The HTRC Feature Extraction alpha dataset has already processed the underlying text, identifying headers and footers, rejoining hyphenated words, and offering page-level details such as:

    – term-frequency counts, per section (head/body/footer), per page

    – occurrences of terms as different parts of speech

    – line counts and sentence counts

    – character counts at the start or end of lines

    Since it is currently in alpha version, we are looking for feedback on how data like this can help you in your research and how we can better serve the scholarly community.

    Today’s dataset is built upon the HathiTrust’s non-Google-digitized public domain volumes — that is, the original scanned representations of all the texts can be accessed through the HathiTrust. We have features for 67,932,813 pages from 250,178 volumes, spanning nearly six hundred years. The median date of the material is 1899, and the text is primarily English. While this alpha release originates from public domain data, this type of extracted feature dataset also provides a road map toward non-consumptive research on works not in the public domain, since the features, though useful for scholarly research purposes, are not sufficient to reconstruct the text itself.

    The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois. In conjunction with the HathiTrust Digital Library, the HTRC team strives to meet the technical challenges that researchers face when dealing with massive amounts of digital text, by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

    Questions? Please contact .

You must be logged in to reply to this topic.