Digital Humanities Initiative

Public Group active 1 week, 2 days ago

Technical question– how to batch-convert multiple docs to .txt files

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
  • #38026
    Anders Wallace

    Hello all,

    I have a technical question. I am using the Mallet natural language processing toolkit in my research, and as part of this process I have to convert many (thousands) of documents into .txt files. I wondered if anyone knows a program or function that can batch together and convert many docs into .txt files at once (without having to manually open and individually convert each document)? Thanks for reading!


    Benjamin Miller

    Hi, Anders. I suspect that command-line tools like pdftotext plus a
    shell-script will probably get the job done; it’s what I used (at Micki’s
    suggestion, and based on her code) to convert from pdf to txt, also for

    What filetype are the docs to begin with? If it’s pdf, you might be able to
    just fork my script from at
    I tried to make the comments clear enough that I could figure them out
    later, even if I forgot all the scripting language I learned to write it,
    but let me know if you have questions (or if it helps!)


    Keith Miyake

    In addition to what type of files you’re starting with, what operating system are you running? If you’re on a mac, and you’re dealing with .html, .rtf, .doc or .docx files, you can use the built in textutil command line tool to batch convert documents. The following commands assume all your files are .docx files (change to .doc, .html, .rtf, etc as necessary)

    Open a terminal window (/Applications/Utilities/ and type the following (where /base/path/to/files is the directory where your files are stored, e.g. ~/Documents/filestoconvert would be the folder “filestoconvert” in the Documents folder within your home directory):

    textutil -convert txt /base/path/to/files/*.docx

    If you want to concatenate all of the text into a single .txt file, use this command instead:

    textutil -cat txt /base/path/to/files/*.docx

    If you need to convert files recursively within a directory structure, you can try this:

    find /base/path/to/files -name '*.docx' -print0 | xargs -0 textutil -convert txt

    Edit: Obviously, if you want to concatenate the files into a single long ass text file recursively, use the previous command but replace “-convert” with “-cat”

    • This reply was modified 7 years, 11 months ago by Keith Miyake.
    Anders Wallace

    Hi Keith and Benjamin, many thanks for this great advice! Yes, the files are pdf’s, and I’m working on a mac. Ben I’ve forked your script on github, thanks for your work here. Looking forward to using these tools!

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.