Splitting PDF document

This topic has 2 replies, 3 voices, and was last updated 13 years, 1 month ago by David Barr.

Creator

Topic
July 19, 2012 at 3:08 pm #53203
John Bass
Participant
Looks like I may have the need to process large PDFs with say 200 pages and split them up into separate PDFs based on page headings. The resulting files would need to be simply written to a folder and would not need to be sent in an HL7 message. Has anyone done anything like this or willing to share ideas on how to do this?
Creator

Topic

Viewing 1 reply thread

Author

Replies
- July 19, 2012 at 3:50 pm #76908
  Mitchell Rawlins
  Participant
  To split out files I would use pdftk.
  
  The difficulty will be determining which pages belong together; if each page can be stand-alone then we’re done with pdftk’s burst feature.
  
  It has a function to uncompress PDFs, which may allow you to use a lexical analyzer to determine which PDFs belong together.
  
  My first approach would be:
  
  1) split out all the pages into separate files
  
  2) figure out a way to classify each of the separate PDFs. My first lead here is the uncompress feature of pdftk. Otherwise I’m going to be doing a lot of Google searches, and finally looking to grab some libraries out of a PDF handler like evince or okular.
  
  3) join up the groups of pages into their own PDFs.
  
  This may not be the best approach, but it’s what I would take if trying to do this sort of thing.
- July 19, 2012 at 9:34 pm #76909
  David Barr
  Participant
  Here’s an example of something similar that I’ve done using Ghostscript, Pdftk and Poppler.
  
  Code: #!/bin/bash -x echo “START cumsum.sh $(date)” infile=”$1″ outfile=”$2″ outdir=/cygdrive/c/data/chartmaxx/data origname=$(echo $3 | cut -d. -f1) cd /cygdrive/c/scripts/cumsum gs -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=$origname.pdf $infile < /dev/null pdftotext -layout $origname.pdf $outfile cp $outfile $origname.txt dttm=$(perl -ne 'if(/^ +(..)/(..)/(….) +(..):(..)$/) { print "$3$1$2$4$5"; exit }' $outdir/${origname}_$acct.pdf.hl7 eval $(echo pdftk $origname.pdf cat $pages output ${outdir}/${origname}_${acct}.pdf) done count=$(egrep ‘END OF REPORT’ $origname.txt | wc -l) rm $origname.txt $origname.pdf email -s “Date_$dt” zzz@valleymed.org <<EOF CUMSUMS REPORTS for [$dt]. Number processed: [$count]. End of message. EOF
Author

Replies

Viewing 1 reply thread

The forum ‘Cloverleaf’ is closed to new topics and replies.