The PDF Filter is an Okapi component that implements the IFilter interface for PDF files. The filter does not deal with complex formatting like tables, multi-level lists etc.. The typical use case for this filter is to scape the text from the PDF for quick and dirty word counts and leverage analysis.
PDF files are binary files and do not have a specific encoding. Okapi extracts all text from the PDF as a Java string and forces the encoding to be "UTF-16". Any encoding selected in tools like Rainbow will be ignored.
TextUnits are created following the default rules of the Plain Text filter. That is, any text followed by a newline will create a new TextUnit or paragraph.
This filter has no parameters.
- This filter merges back in plain text format, not PDF.