Difference between revisions of "PDF Filter"

From Okapi Framework
Jump to navigation Jump to search
 
Line 2: Line 2:
 
==Overview==
 
==Overview==
  
The PDF Filter is an Okapi component that implements the IFilter interface for PDF files.
+
The PDF Filter is an Okapi component that implements the IFilter interface for PDF files. The filter does not deal with complex formatting like tables, multi-level lists etc.. The typical use case for this filter is to scape the text from the PDF for quick and dirty word counts and leverage analysis.  
  
 
{{WarningBox|This is a filter does not merge back into PDF format. instead it produces a plain text file output upon merging.}}
 
{{WarningBox|This is a filter does not merge back into PDF format. instead it produces a plain text file output upon merging.}}
 
  
 
==Processing Details==
 
==Processing Details==

Latest revision as of 14:27, 11 October 2016

Overview

The PDF Filter is an Okapi component that implements the IFilter interface for PDF files. The filter does not deal with complex formatting like tables, multi-level lists etc.. The typical use case for this filter is to scape the text from the PDF for quick and dirty word counts and leverage analysis.

Warning: This is a filter does not merge back into PDF format. instead it produces a plain text file output upon merging.

Processing Details

Input Encoding

PDF files are binary files and do not have a specific encoding. Okapi extracts all text from the PDF as a Java string and forces the encoding to be "UTF-16". Any encoding selected in tools like Rainbow will be ignored.

Segmentation

TextUnits are created following the default rules of the Plain Text filter. That is, any text followed by a newline will create a new TextUnit or paragraph.

Parameters

This filter has no parameters.

Limitations

  • This filter merges back in plain text format, not PDF.