Splitting PDF document

Homepage Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Splitting PDF document

  • Creator
    Topic
  • #53203
    John Bass
    Participant

    Looks like I may have the need to process large PDFs with say 200 pages and split them up into separate PDFs based on page headings.  The resulting files would need to be simply written to a folder and would not need to be sent in an HL7 message.  Has anyone done anything like this or willing to share ideas on how to do this?

Viewing 1 reply thread
  • Author
    Replies
    • #76908
      Mitchell Rawlins
      Participant

      To split out files I would use pdftk.

      The difficulty will be determining which pages belong together; if each page can be stand-alone then we’re done with pdftk’s burst feature.

      It has a function to uncompress PDFs, which may allow you to use a lexical analyzer to determine which PDFs belong together.

      My first approach would be:

      1) split out all the pages into separate files

      2) figure out a way to classify each of the separate PDFs.  My first lead here is the uncompress feature of pdftk.  Otherwise I’m going to be doing a lot of Google searches, and finally looking to grab some libraries out of a PDF handler like evince or okular.

      3) join up the groups of pages into their own PDFs.

      This may not be the best approach, but it’s what I would take if trying to do this sort of thing.

    • #76909
      David Barr
      Participant

      Here’s an example of something similar that I’ve done using Ghostscript, Pdftk and Poppler.

      Code:

      #!/bin/bash -x
      echo “START cumsum.sh $(date)”
      infile=”$1″
      outfile=”$2″
      outdir=/cygdrive/c/data/chartmaxx/data
      origname=$(echo $3 | cut -d. -f1)
      cd /cygdrive/c/scripts/cumsum
      gs -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=$origname.pdf $infile < /dev/null
      pdftotext -layout $origname.pdf $outfile
      cp $outfile $origname.txt
      dttm=$(perl -ne 'if(/^ +(..)/(..)/(….) +(..):(..)$/) { print "$3$1$2$4$5"; exit }' $outdir/${origname}_$acct.pdf.hl7
       eval $(echo pdftk $origname.pdf cat $pages output ${outdir}/${origname}_${acct}.pdf)
      done
      count=$(egrep ‘END OF REPORT’ $origname.txt | wc -l)
      rm $origname.txt $origname.pdf
      email -s “Date_$dt” zzz@valleymed.org <<EOF
      CUMSUMS REPORTS for [$dt]. Number processed: [$count].
      End of message.
      EOF

Viewing 1 reply thread
  • The forum ‘Cloverleaf’ is closed to new topics and replies.

Forum Statistics

Registered Users
5,117
Forums
28
Topics
9,292
Replies
34,435
Topic Tags
286
Empty Topic Tags
10