PDF Filter: Difference between revisions
Jump to navigation
Jump to search
Jhargraveiii (talk | contribs) |
Jhargraveiii (talk | contribs) |
||
Line 15: | Line 15: | ||
===Segmentation=== | ===Segmentation=== | ||
TextUnits are created following the default rules of the Plain Text filter. That is, any text followed by a newline will create a new TextUnit or paragraph. | |||
==Parameters== | ==Parameters== |
Revision as of 14:24, 11 October 2016
Overview
The PDF Filter is an Okapi component that implements the IFilter interface for PDF files.
Warning: This is a filter does not merge back into PDF format. instead it produces a plain text file output upon merging.
Processing Details
Input Encoding
PDF files are binary files and do not have a specific encoding. Okapi extracts all text from the PDF as a Java string and forces the encoding to be "UTF-16". Any encoding selected in tools like Rainbow will be ignored.
Segmentation
TextUnits are created following the default rules of the Plain Text filter. That is, any text followed by a newline will create a new TextUnit or paragraph.
Parameters
This filter has no parameters.
Limitations
- This filter merges back in plain text format, not PDF.