Public Group active 7 months ago

Internet Research Team

The Internet Research Team is a student-led group of scholars interested in exploring, discussing, and using online and digital research methods. The group also includes faculty and staff and meets regularly throughout the year. We invite people of all levels of technical skills who are conducting or have an interest in online and digital research to join the group here on the Commons and attend the meetings.

For more information, please contact us at cunyirt@gmail.com.

Edwin Mayorga & Micki Kaufman, Coordinators
Collette Sosnowy, PhD & Kiersten Greene, PhD, Founders

Admins:

Moderators:

How to batch-convert multiple documents into .txt files

  • Hi, I have a technical question. I am using the Mallet natural language processing toolkit in my research, and as part of this process I have to convert many (thousands) of documents into .txt files. I wondered if anyone knows a program or function that can batch together and convert many docs into .txt files at once (without having to manually open and individually convert each document)? Thanks for reading!

    Anders

Viewing 2 replies - 1 through 2 (of 2 total)
  • Hi Anders –

    to do this most efficiently you need two things: a command-line utility to convert whatever filetype you have into .txt files, and a shell script, macro or other automation tool to batch process each.

    In the packet I created for the Digital Fellows Workshops, you’ll find two examples of just such a shell script… designed to loop through the contents of a folder called ‘source’ (in the same directory as the script) and which processes each file until the entire list of files is processed.

    Let’s look at a version of ‘loop-rename.sh’ customized to turn html files into txt files. We’ll call the script ‘loop-textify.sh’:

    #!/bin/bash
    
    function largeloop ()
    {       while read line1; do
    html2text source/$line1 -o output/$line1.txt 
            done
    }
    
    ls source | largeloop

    Just save that code as a file called ‘,’ into the same folder that contains the ‘source’ folder. Then from the command line, type:

    ./loop-textify.sh

    Note: I wrote the above example to use the open-source command-line tool ‘html2text,’ which can be found here: http://www.mbayer.de/html2text/readme.shtml

    The script can do pdftotext, etc., by modifying it accordingly to use those tools.

    Good luck!

    PS My packet can be found here: http://www.mickikaufman.com/packet.zip

    Hi Micki, thanks so much! This looks really great. I’ll download your toolkit, looking forward to trying it out. Thanks for creating this :).

Viewing 2 replies - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.