• JAVASCRIPT > les fichiers PDF

      Extract text from pdf

      Il existe du code : extract text from pdf in Javascript et http://hublog.hubmed.org/archives/001948.html et https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

      1) I want please to know what are the files which are necessary for these extraction from the previous ones.

      2) I don’t know exactly how to adapt these codes in an application, not in the web.

       

      REPONSES

      here is a nice example of how to use pdf.js for extracting the text: http://git.macropus.org/2011/11/pdftotext/example/

      of course you have to remove a lot of code for your purpose.

       

      I’ve made an easier approach that doesn’t need to post messages between iframes using the same library (using the latest version), using pdf.js.

      The following example would extract all the text only from the first page of the PDF:

      /**
       * Retrieves the text of a specif page within a PDF Document obtained through pdf.js
       *
       * @param {Integer} pageNum Specifies the number of the page
       * @param {PDFDocument} PDFDocumentInstance The PDF document obtained
       **/
      function getPageText(pageNum, PDFDocumentInstance) {
          // Return a Promise that is solved once the text of the page is retrieven
          return new Promise(function (resolve, reject) {
              PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
                  // The main trick to obtain the text of the PDF page, use the getTextContent method
                  pdfPage.getTextContent().then(function (textContent) {
                      var textItems = textContent.items;
                      var finalString = "";
      
                      // Concatenate the string of the item to the final string
                      for (var i = 0; i < textItems.length; i++) {
                          var item = textItems[i];
      
                          finalString += item.str + " ";
                      }
      
                      // Solve promise with the text retrieven from the page
                      resolve(finalString);
                  });
              });
          });
      }
      
      /**
       * Extract the test from the PDF
       */
      
      var PDF_URL  = '/path/to/example.pdf';
      PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {
      
          var totalPages = PDFDocumentInstance.pdfInfo.numPages;
          var pageNumber = 1;
      
          // Extract the text
          getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
              // Show the text of the page in the console
              console.log(textPage);
          });
      
      }, function (reason) {
          // PDF loading error
          console.error(reason);
      });

      Read the article about this solution here.

      the library has changed since the first solution was posted (it shouldn’t work with the latest version of pdf.js anymore). This should work for most of the cases.

       

      http://git.macropus.org/2011/11/pdftotext/example/

       

      view-source:http://git.macropus.org/2011/11/pdftotext/example/

       

      https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

       

      https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

       

       

 

Aucun commentaire

 

Laissez un commentaire