Digital Humanities Initiative

Public Group active 1 week, 2 days ago

Technical question– how to batch-convert multiple docs to .txt files

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #38026
    Anders Wallace
    Participant

    Hello all,

    I have a technical question. I am using the Mallet natural language processing toolkit in my research, and as part of this process I have to convert many (thousands) of documents into .txt files. I wondered if anyone knows a program or function that can batch together and convert many docs into .txt files at once (without having to manually open and individually convert each document)? Thanks for reading!

    Anders

    #38032
    Benjamin Miller
    Participant

    Hi, Anders. I suspect that command-line tools like pdftotext plus a
    shell-script will probably get the job done; it’s what I used (at Micki’s
    suggestion, and based on her code) to convert from pdf to txt, also for
    MALLET.

    What filetype are the docs to begin with? If it’s pdf, you might be able to
    just fork my script from at
    https://github.com/benmiller314/Dissertation-Research/blob/working/Shell%20scripts%20and%20commands/ben_clean_and_consolidate.sh.
    I tried to make the comments clear enough that I could figure them out
    later, even if I forgot all the scripting language I learned to write it,
    but let me know if you have questions (or if it helps!)

    Best,
    Ben

    #38041
    Keith Miyake
    Participant

    In addition to what type of files you’re starting with, what operating system are you running? If you’re on a mac, and you’re dealing with .html, .rtf, .doc or .docx files, you can use the built in textutil command line tool to batch convert documents. The following commands assume all your files are .docx files (change to .doc, .html, .rtf, etc as necessary)

    Open a terminal window (/Applications/Utilities/Terminal.app) and type the following (where /base/path/to/files is the directory where your files are stored, e.g. ~/Documents/filestoconvert would be the folder “filestoconvert” in the Documents folder within your home directory):

    textutil -convert txt /base/path/to/files/*.docx
    

    If you want to concatenate all of the text into a single .txt file, use this command instead:

    textutil -cat txt /base/path/to/files/*.docx
    

    If you need to convert files recursively within a directory structure, you can try this:

    find /base/path/to/files -name '*.docx' -print0 | xargs -0 textutil -convert txt
    

    Edit: Obviously, if you want to concatenate the files into a single long ass text file recursively, use the previous command but replace “-convert” with “-cat”

    • This reply was modified 7 years, 11 months ago by Keith Miyake.
    #38101
    Anders Wallace
    Participant

    Hi Keith and Benjamin, many thanks for this great advice! Yes, the files are pdf’s, and I’m working on a mac. Ben I’ve forked your script on github, thanks for your work here. Looking forward to using these tools!

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.