How to remove headers and footer from pdf file using pdfbox in java -
i using pdf parser convert pdf text.below code convert pdf text file using java. pdf file contains following data:
data sheet(header) php courses php professionals(header) networking academy live in increasingly connected world, creating global economy , growing need technical skills. networking academy delivers information technology skills on 500,000 students year in more 165 countries worldwide. networking academy students have opportunity participate in powerful , consistent learning experience supported high quality, online curricula , assessments, instructor training, hands-on labs, , classroom interaction. experience ensures same level of qualifications , skills regardless of in world student located. copyrights reserved.(footer). sample code:
public class pdf_test { pdfparser parser; string parsedtext; pdftextstripper pdfstripper; pddocument pddoc; cosdocument cosdoc; pddocumentinformation pddocinfo; // pdftextparser constructor public pdf_test() { } // extract text pdf document string pdftotext(string filename) { file f = new file(filename); if (!f.isfile()) { return null; } try { parser = new pdfparser(new fileinputstream(f)); } catch (exception e) { return null; } try { parser.parse(); cosdoc = parser.getdocument(); pdfstripper = new pdftextstripper(); pddoc = new pddocument(cosdoc); parsedtext = pdfstripper.gettext(pddoc); } catch (exception e) { e.printstacktrace(); try { if (cosdoc != null) cosdoc.close(); if (pddoc != null) pddoc.close(); } catch (exception e1) { e.printstacktrace(); } return null; } return parsedtext; } // write parsed text pdf file void writetexttofile(string pdftext, string filename) { try { printwriter pw = new printwriter(filename); pw.print(pdftext); pw.close(); } catch (exception e) { e.printstacktrace(); } } //extracts text pdf document , writes text file public static void test() { string args[]={"c://sample.pdf","c://sample.txt"}; if (args.length != 2) { system.exit(1); } pdftextparser pdftextparserobj = new pdftextparser(); string pdftotext = pdftextparserobj.pdftotext(args[0]); if (pdftotext == null) { } else { pdftextparserobj.writetexttofile(pdftotext, args[1]); } } public static void main(string args[]) throws ioexception { test(); } } the above code works extracting pdf text.but requirement ignore header , footer , extract content pdf file. required output:
networking academy live in increasingly connected world, creating global economy , growing need technical skills. networking academy delivers information technology skills on 500,000 students year in more 165 countries worldwide. networking academy students have opportunity participate in powerful , consistent learning experience supported high quality, online curricula , assessments, instructor training, hands-on labs, , classroom interaction. experience ensures same level of qualifications , skills regardless of in world student located. please suggest me how this. thanks.
in general there nothing special header or footer texts in pdfs. possible tag material differently, tagging optional , op did not provide sample pdf check.
thus, manual work (or failure intensive image analysis) necessary find regions on pages header, content, , footer material.
as have coordinates these regions, though, can use pdftextstripperbyareawhich extends pdftextstripper collect text regions. define region page content using largest rectangle including content excluding headers , footers, , after pdfstripper.gettext(pddoc) call gettextforregion defined region.
Comments
Post a Comment