PHP Content Filter

From Okapi Framework
Jump to: navigation, search

Overview

The PHP Content Filter is an Okapi component that implements the IFilter interface for PHP content.

The implementation is based on the PHP syntax found in the PHP language Reference documentation (http://www.php.net/manual/en/langref.php).

Note: This filter is not meant to process HTML files with PHP content, but rather the content of the PHP tags. Beware that many files with a .php extension are HTML files with PHP tags.

The following is an example of a simple PHP content. The extractable text is highlighted:

<?php
$str = <<<EOD
Example of string
spanning multiple lines
using heredoc syntax.
EOD;

/* More complex example, with variables. */
class foo
{
    var $foo;
    var $bar;

    function foo()
    {
        $this->foo = 'Foo';
        $this->bar = array('Bar1', 'Bar2', 'Bar3');
    }
}

$foo = new foo();
$name = 'MyName';

echo <<<EOT
My name is "$name". I am printing some $foo->foo.
Now, I am printing some {$foo->bar[1]}.
This should print a capital 'A': \x41
EOT;
?>

Processing Details

Input Encoding

The filter decides which encoding to use for the input file using the following logic:

  • If the file has a Unicode Byte-Order-Mark:
    • Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
    • Else, if a header entry with a charset declaration exists in the first 1000 characters of the file:
  • If the value of the charset is "charset" (case insensitive):
      • Then the file is likely to be a template with no encoding declared, so the current encoding (auto-detected or default) is used.
      • Else, the declared encoding is used. Note that if the encoding has been detected from a Byte-Order-Mark and the encoding declared in the header entry does not match, a warning is generated and the encoding of the Byte-Order-Mark is used.
  • Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

Output Encoding

If the file has a header entry with a charset declaration, the declaration is automatically updated in the output to reflect the encoding selected for the output.

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.


Parameters

Localization directives

Localization directives are special comments you can use to override the default behavior of the filter regarding the parts to extract. The syntax and behavior of the directives are the same across all Okapi filters.Note that the directives override key conditions.

Use localization directives when they are present — Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored (and all extractable string will be extracted).

Extract items outside the scope of localization directives — Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set.

Inline Codes

Has inline codes as defined below — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code. By default the expression is:

(\A[^<]*?>)|(<[\w!?/].*?(>|\Z))
|(\\a|\\b|\\f|\\n|\\r|\\t|\\v)
|(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})
|([\[{][\w_$]+?[}\]])

Add — Click this button to add a new rule.

Remove — Click this button to remove the current rule.

Move Up — Click this button to move the current rule upward.

Move down — Click this button to move the current rule downward.

[Top-right text box] — Enter the regular expression for the current rule. Use the Modify button to enter the edit mode. The expression must be a valid regular expression. You can check the syntax (and the effect of the rule) as it automatically tests it against the test data in the text box below and shows the result in the bottom-right text box.

Modify — Click this button to edit the expression of the current rule. This button is labeled Accept when you are in edit mode.

Accept — Click this button to save any changes you have made to the expression and leave the edit mode. This button is labeled Modify when you are not in edit mode.

Discard — Click this button to leave the edit mode and revert the current rule to the expression it had before you started the edit mode.

Patterns — Click this button to display some help on regular expression patterns.

Test using all rules — Set this option to test all the rules at the same time. The syntax of the current rule is automatically checked. See the effect it has on the sample text. The result of the test are displayed in the bottom right result box. The parts of the text that are matches of the expressions are displayed in <> brackets. If the Test using all rules option is set, the test takes all rules of the set in account, if it is not set only the current rule is tested.

[Middle-right text box] — Optional test data to test the regular expression for the current rule or all rules depending on the Test using all rules option.

[Bottom-right text box] — Shows the result of the regular expression applied to the test data.

Limitations

  • Support for the define statement is not implemented yet.
  • In array declarations, both string key and string value are extracted.