Internet Research Team

Public Group active 3 months, 1 week ago

How to batch-convert multiple documents into .txt files

This topic contains 2 replies, has 2 voices, and was last updated by 4 years, 2 months ago.

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • #38027

    Anders Wallace
    Participant

    Hi, I have a technical question. I am using the Mallet natural language processing toolkit in my research, and as part of this process I have to convert many (thousands) of documents into .txt files. I wondered if anyone knows a program or function that can batch together and convert many docs into .txt files at once (without having to manually open and individually convert each document)? Thanks for reading!

    Anders

    #38038

    Micki Kaufman
    Participant

    Hi Anders –

    to do this most efficiently you need two things: a command-line utility to convert whatever filetype you have into .txt files, and a shell script, macro or other automation tool to batch process each.

    In the packet I created for the Digital Fellows Workshops, you’ll find two examples of just such a shell script… designed to loop through the contents of a folder called ‘source’ (in the same directory as the script) and which processes each file until the entire list of files is processed.

    Let’s look at a version of ‘loop-rename.sh’ customized to turn html files into txt files. We’ll call the script ‘loop-textify.sh’:

    #!/bin/bash
    
    function largeloop ()
    {       while read line1; do
    html2text source/$line1 -o output/$line1.txt 
            done
    }
    
    ls source | largeloop

    Just save that code as a file called ‘,’ into the same folder that contains the ‘source’ folder. Then from the command line, type:

    ./loop-textify.sh

    Note: I wrote the above example to use the open-source command-line tool ‘html2text,’ which can be found here: http://www.mbayer.de/html2text/readme.shtml

    The script can do pdftotext, etc., by modifying it accordingly to use those tools.

    Good luck!

    PS My packet can be found here: http://www.mickikaufman.com/packet.zip

    #38102

    Anders Wallace
    Participant

    Hi Micki, thanks so much! This looks really great. I’ll download your toolkit, looking forward to trying it out. Thanks for creating this :).

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.

css.php
Need help with the Commons? Visit our
help page
Send us a message