Public Group active 2 weeks, 1 day ago

Digital Humanities Initiative

The CUNY Digital Humanities Initiative (CUNY DHI), launched in Fall 2010, aims to build connections and community among those at CUNY who are applying digital technologies to scholarship and pedagogy in the humanities. All are welcome: faculty, students, and technologists, experienced practitioners and beginning DHers, enthusiasts and skeptics.

We meet regularly on- and offline to explore key topics in the Digital Humanities, and share our work, questions, and concerns. See our blog for more information on upcoming events (it’s also where we present our group’s work to a wider audience). Help edit the CUNY Digital Humanities Resource Guide, our first group project. And, of course, join the conversation on the Forum.

Photo credit: Digital Hello by hugoslv on sxc.hu.

Admins:

Moderators:

Technical question– how to batch-convert multiple docs to .txt files

  • Hello all,

    I have a technical question. I am using the Mallet natural language processing toolkit in my research, and as part of this process I have to convert many (thousands) of documents into .txt files. I wondered if anyone knows a program or function that can batch together and convert many docs into .txt files at once (without having to manually open and individually convert each document)? Thanks for reading!

    Anders

Viewing 3 replies - 1 through 3 (of 3 total)
  • Hi, Anders. I suspect that command-line tools like pdftotext plus a
    shell-script will probably get the job done; it’s what I used (at Micki’s
    suggestion, and based on her code) to convert from pdf to txt, also for
    MALLET.

    What filetype are the docs to begin with? If it’s pdf, you might be able to
    just fork my script from at
    https://github.com/benmiller314/Dissertation-Research/blob/working/Shell%20scripts%20and%20commands/ben_clean_and_consolidate.sh.
    I tried to make the comments clear enough that I could figure them out
    later, even if I forgot all the scripting language I learned to write it,
    but let me know if you have questions (or if it helps!)

    Best,
    Ben

    In addition to what type of files you’re starting with, what operating system are you running? If you’re on a mac, and you’re dealing with .html, .rtf, .doc or .docx files, you can use the built in textutil command line tool to batch convert documents. The following commands assume all your files are .docx files (change to .doc, .html, .rtf, etc as necessary)

    Open a terminal window (/Applications/Utilities/Terminal.app) and type the following (where /base/path/to/files is the directory where your files are stored, e.g. ~/Documents/filestoconvert would be the folder “filestoconvert” in the Documents folder within your home directory):

    textutil -convert txt /base/path/to/files/*.docx
    

    If you want to concatenate all of the text into a single .txt file, use this command instead:

    textutil -cat txt /base/path/to/files/*.docx
    

    If you need to convert files recursively within a directory structure, you can try this:

    find /base/path/to/files -name '*.docx' -print0 | xargs -0 textutil -convert txt
    

    Edit: Obviously, if you want to concatenate the files into a single long ass text file recursively, use the previous command but replace “-convert” with “-cat”

    • This reply was modified 9 years, 7 months ago by Keith Miyake.

    Hi Keith and Benjamin, many thanks for this great advice! Yes, the files are pdf’s, and I’m working on a mac. Ben I’ve forked your script on github, thanks for your work here. Looking forward to using these tools!

Viewing 3 replies - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.