How to remove headers and footer from pdf file using pdfbox in java -


i using pdf parser convert pdf text.below code convert pdf text file using java. pdf file contains following data:

    data sheet(header)     php courses php professionals(header)     networking academy     live in increasingly connected world, creating global economy , growing need technical skills.  networking academy delivers information technology skills on 500,000 students year in more 165 countries worldwide. networking academy students have opportunity participate in powerful , consistent learning experience supported high quality, online curricula , assessments, instructor training, hands-on labs, , classroom interaction. experience ensures same level of qualifications , skills regardless of in world student located.      copyrights reserved.(footer). 

sample code:

public class pdf_test {     pdfparser parser;     string parsedtext;     pdftextstripper pdfstripper;     pddocument pddoc;     cosdocument cosdoc;     pddocumentinformation pddocinfo;      // pdftextparser constructor      public pdf_test() {     }      // extract text pdf document     string pdftotext(string filename) {           file f = new file(filename);          if (!f.isfile()) {              return null;         }          try {             parser = new pdfparser(new fileinputstream(f));         } catch (exception e) {              return null;         }          try {             parser.parse();             cosdoc = parser.getdocument();             pdfstripper = new pdftextstripper();             pddoc = new pddocument(cosdoc);             parsedtext = pdfstripper.gettext(pddoc);          } catch (exception e) {              e.printstacktrace();             try {                    if (cosdoc != null) cosdoc.close();                    if (pddoc != null) pddoc.close();                } catch (exception e1) {                e.printstacktrace();             }             return null;         }                return parsedtext;     }      // write parsed text pdf file     void writetexttofile(string pdftext, string filename) {           try {             printwriter pw = new printwriter(filename);             pw.print(pdftext);             pw.close();              } catch (exception e) {              e.printstacktrace();         }      }      //extracts text pdf document , writes text file     public static void test() {         string args[]={"c://sample.pdf","c://sample.txt"};         if (args.length != 2) {              system.exit(1);         }          pdftextparser pdftextparserobj = new pdftextparser();           string pdftotext = pdftextparserobj.pdftotext(args[0]);          if (pdftotext == null) {          }         else {              pdftextparserobj.writetexttofile(pdftotext, args[1]);         }     }        public static void main(string args[]) throws ioexception     {         test();     } } 

the above code works extracting pdf text.but requirement ignore header , footer , extract content pdf file. required output:

networking academy         live in increasingly connected world, creating global economy , growing need technical skills.  networking academy delivers information technology skills on 500,000 students year in more 165 countries worldwide. networking academy students have opportunity participate in powerful , consistent learning experience supported high quality, online curricula , assessments, instructor training, hands-on labs, , classroom interaction. experience ensures same level of qualifications , skills regardless of in world student located. 

please suggest me how this. thanks.

in general there nothing special header or footer texts in pdfs. possible tag material differently, tagging optional , op did not provide sample pdf check.

thus, manual work (or failure intensive image analysis) necessary find regions on pages header, content, , footer material.

as have coordinates these regions, though, can use pdftextstripperbyareawhich extends pdftextstripper collect text regions. define region page content using largest rectangle including content excluding headers , footers, , after pdfstripper.gettext(pddoc) call gettextforregion defined region.


Comments

Popular posts from this blog

matlab - Deleting rows with specific rules -

jquery - How would i go about shortening this code? And to cancel the previous click on click of new section? -