Internet Research Team

Public Group active 9 months, 3 weeks ago

Internet Research Team

Group logo of Internet Research Team

How to batch-convert multiple documents into .txt files

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
  • #38027
    Anders Wallace

    Hi, I have a technical question. I am using the Mallet natural language processing toolkit in my research, and as part of this process I have to convert many (thousands) of documents into .txt files. I wondered if anyone knows a program or function that can batch together and convert many docs into .txt files at once (without having to manually open and individually convert each document)? Thanks for reading!


    Micki Kaufman

    Hi Anders –

    to do this most efficiently you need two things: a command-line utility to convert whatever filetype you have into .txt files, and a shell script, macro or other automation tool to batch process each.

    In the packet I created for the Digital Fellows Workshops, you’ll find two examples of just such a shell script… designed to loop through the contents of a folder called ‘source’ (in the same directory as the script) and which processes each file until the entire list of files is processed.

    Let’s look at a version of ‘’ customized to turn html files into txt files. We’ll call the script ‘’:

    function largeloop ()
    {       while read line1; do
    html2text source/$line1 -o output/$line1.txt 
    ls source | largeloop

    Just save that code as a file called ‘,’ into the same folder that contains the ‘source’ folder. Then from the command line, type:


    Note: I wrote the above example to use the open-source command-line tool ‘html2text,’ which can be found here:

    The script can do pdftotext, etc., by modifying it accordingly to use those tools.

    Good luck!

    PS My packet can be found here:

    Anders Wallace

    Hi Micki, thanks so much! This looks really great. I’ll download your toolkit, looking forward to trying it out. Thanks for creating this :).

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.