Splitting PDF document

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Splitting PDF document

  • Creator
    Topic
  • #53203
    John Bass
    Participant

      Looks like I may have the need to process large PDFs with say 200 pages and split them up into separate PDFs based on page headings.  The resulting files would need to be simply written to a folder and would not need to be sent in an HL7 message.  Has anyone done anything like this or willing to share ideas on how to do this?

    Viewing 1 reply thread
    • Author
      Replies
      • #76908
        Mitchell Rawlins
        Participant

          To split out files I would use pdftk.

          The difficulty will be determining which pages belong together; if each page can be stand-alone then we’re done with pdftk’s burst feature.

          It has a function to uncompress PDFs, which may allow you to use a lexical analyzer to determine which PDFs belong together.

          My first approach would be:

          1) split out all the pages into separate files

          2) figure out a way to classify each of the separate PDFs.  My first lead here is the uncompress feature of pdftk.  Otherwise I’m going to be doing a lot of Google searches, and finally looking to grab some libraries out of a PDF handler like evince or okular.

          3) join up the groups of pages into their own PDFs.

          This may not be the best approach, but it’s what I would take if trying to do this sort of thing.

        • #76909
          David Barr
          Participant

            Here’s an example of something similar that I’ve done using Ghostscript, Pdftk and Poppler.

            Code:

            #!/bin/bash -x
            echo “START cumsum.sh $(date)”
            infile=”$1″
            outfile=”$2″
            outdir=/cygdrive/c/data/chartmaxx/data
            origname=$(echo $3 | cut -d. -f1)
            cd /cygdrive/c/scripts/cumsum
            gs -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=$origname.pdf $infile < /dev/null
            pdftotext -layout $origname.pdf $outfile
            cp $outfile $origname.txt
            dttm=$(perl -ne 'if(/^ +(..)/(..)/(….) +(..):(..)$/) { print "$3$1$2$4$5"; exit }' $outdir/${origname}_$acct.pdf.hl7
             eval $(echo pdftk $origname.pdf cat $pages output ${outdir}/${origname}_${acct}.pdf)
            done
            count=$(egrep ‘END OF REPORT’ $origname.txt | wc -l)
            rm $origname.txt $origname.pdf
            email -s “Date_$dt” zzz@valleymed.org <<EOF
            CUMSUMS REPORTS for [$dt]. Number processed: [$count].
            End of message.
            EOF

        Viewing 1 reply thread
        • The forum ‘Cloverleaf’ is closed to new topics and replies.