Difference between revisions of "PDF Filter"

From Okapi Framework
Jump to navigation Jump to search
(Created page with "{{Filters Header}} ==Overview== The PDF Filter is an Okapi component that implements the IFilter interface for PDF files. {{WarningBox|This is a filter does not merge back i...")
 
Line 11: Line 11:
 
===Input Encoding===
 
===Input Encoding===
  
TODO?
+
PDF files are binary files and do not have a specific encoding. Okapi extracts all text from the PDF as a Java string and forces the encoding to be "UTF-16". Any encoding selected in tools like Rainbow will be ignored.
  
 
===Segmentation===
 
===Segmentation===

Revision as of 14:19, 11 October 2016

Overview

The PDF Filter is an Okapi component that implements the IFilter interface for PDF files.

Warning: This is a filter does not merge back into PDF format. instead it produces a plain text file output upon merging.


Processing Details

Input Encoding

PDF files are binary files and do not have a specific encoding. Okapi extracts all text from the PDF as a Java string and forces the encoding to be "UTF-16". Any encoding selected in tools like Rainbow will be ignored.

Segmentation

TODO?

Parameters

This filter has no parameters.

Limitations

  • This filter merges back in plain text format, not PDF.