PDF Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

The PDF Filter is an Okapi component that implements the IFilter interface for PDF files. The filter does not deal with complex formatting like tables, multi-level lists etc.. The typical use case for this filter is to scape the text from the PDF for quick and dirty word counts and leverage analysis.

Warning: This is a filter does not merge back into PDF format. instead it produces a plain text file output upon merging.

Processing Details

Input Encoding

PDF files are binary files and do not have a specific encoding. Okapi extracts all text from the PDF as a Java string and forces the encoding to be "UTF-16". Any encoding selected in tools like Rainbow will be ignored.

Segmentation

TextUnits are created following the default rules of the Plain Text filter. That is, any text followed by a newline will create a new TextUnit or paragraph.

Parameters

This filter has no parameters.

Limitations

  • This filter merges back in plain text format, not PDF.