PO Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

The PO Filter is an Okapi component that implements the IFilter interface for Gettext PO (Portable Object) resource files (as well as POT (PO templates)).

The implementation is based on the PO specifications found in the GNU gettext manual. There is also a useful representation guide for PO-to-XLIFF conversion available on the XLIFF TC pages.

The following is an example of a very simple PO file. The translatable text is highlighted. Note also the header information in the first entry (the one with an empty msgid line), where encoding and plural information may be found.

# PO file for myApp

msgid ""
msgstr ""
"Project-Id-Version: myApp 1.0.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2005-10-02 05:16+0200\n"
"PO-Revision-Date: 2005-03-21 11:28/-0600\n"
"Last-Translator: unknown <email@address>\n"
"Language-Team: unknown <email@address>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "diverging after version %d of %s"
msgstr ""

msgid "You have selected %d file for deletion"
msgid_plural "You have selected %d files for deletion"
msgstr[0] ""
msgstr[1] ""

Processing Details

Input Encoding

The filter decides which encoding to use for the input file using the following logic:

  • If the file has a Unicode Byte-Order-Mark:
    • Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
  • Else, if a header entry with a charset declaration exists in the first 1000 characters of the file:
    • If the value of the charset is "charset" (case insensitive):
      • Then the file is likely to be a template with no encoding declared, so the current encoding (auto-detected or default) is used.
      • Else, the declared encoding is used. Note that if the encoding has been detected from a Byte-Order-Mark and the encoding declared in the header entry does not match, a warning is generated and the encoding of the Byte-Order-Mark is used.
  • Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

Output Encoding

If the file has a header entry with a charset declaration, the declaration is automatically updated in the output to reflect the encoding selected for the output.

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Output Language

Important: No language information is updated in the PO header entry. Any language-related information needs to be updated manually in the header entry of the generated PO file. Note that PO files are generally bilingual files with their language-related information already set.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Plural Forms

The filter supports plural forms entries with the assumption that they are in a sequential order. That is, msgstr[0] comes first, then msgstr[1], etc. All the msgstr strings of a given plural entry are processed as part of a single group that has its type value set to "x-gettext-plurals".

If a resource name is generated, its value for plural form entries has an additional index indicator. For example, if you have the following plural form entries:

msgid "untranslated-singular"
msgid_plural "untranslated-plural"
msgstr[0] "translated-singular"
msgstr[1] "translated-plural-form1"
msgstr[2] "translated-plural-form2"

The extracted resources will allow you to construct an XLIFF output looking like this:

<group restype="x-gettext-plurals">
 <trans-unit id="1" resname="P3ADE34F0-0" xml:space="preserve">
  <source xml:lang="en-US">untranslated-plural</source>
  <target xml:lang="fr-FR">translated-singular</target>
 </trans-unit>
 <trans-unit id="2" resname="P3ADE34F0-1" xml:space="preserve">
  <source xml:lang="en-US">untranslated-plural</source>
  <target xml:lang="fr-FR">translated-plural-form1</target>
 </trans-unit>
 <trans-unit id="3" resname="P3ADE34F0-2" xml:space="preserve">
  <source xml:lang="en-US">untranslated-singular</source>
  <target xml:lang="fr-FR">translated-plural-form2</target>
 </trans-unit>
</group>

Domains

The domains are supported as groups, with the type of the group set to "x-gettext-domain" and the resource name set to the group identifier. For example, if you have the following entry:

domain TheDomain1
msgid "Text 1 in domain 'TheDomain1'"
msgstr "Texte 1 dans le domain 'TheDomain1'"

The extracted resources will allow you to construct an XLIFF output looking like this:

<group resname="TheDomain1" restype="x-gettext-domain">
 <trans-unit id="1" resname="N9D1999AB" xml:space="preserve">
  <source xml:lang="en-US">Text 1 in domain 'TheDomain1'</source>
  <target xml:lang="fr-FR">Texte 1 dans le domain 'TheDomain1'</target>
 </trans-unit>
</group>

References

PO files may have reference comments generated from the source code. they are denoted by a leading "#:" marker.

The filter provides read-only access to them through the references resource-level property.

Extracted Comments

PO files may have comments generated from the source code. They correspond to localization notes from the developers. They are denoted by a leading "#." marker.

The filter provides read-only access to them through the note resource-level property

Translators Comments

PO files may have translators comments, denoted by a leading "# " marker.

The filter provides read-only access to them through the transnote resource-level property.

Context Comments

PO files may have context comments, denoted by a leading "#|" markers.

The filter does do anything with them currently.

Context Lines

PO files may have context entry lines, denoted by a leading "msgctxt" marker.

The filter provides read-only access to them through the context resource-level property.

It is also used as a differentiator for the identifiers generated by the option Generate identifiers from the source text, so two entries with the same source text but different context content will have different identifier.

In addition, if the context is in the format generated by the Okapi tools and a valid id is specified, the text unit id of the given entry is set to the id provided in the context. For example, with the following entry:

msgctxt "okpCtx:tu=123"
msgid "Source"
msgstr "Target"

The extracted text unit will have an id value of 123.

Fuzzy Flag

PO files may may a "fuzzy flag" indicated as "#, fuzzy". This indicates that the text in the msgstr entry is only a proposed translation, and may or may not be correct for the source.

When on bilingual mode, the filter provides access to this flag through the approved target property:

  • If the flag is set the approved target property is set to "no". If the flag does not exists, the approved target property is not defined.
  • If the approved target property was defined by the filter, and is deleted, or set to a different value than "no", the flag is removed from the output document.

Other Flags

The following flags are currently ignored: No-Wrap Flag, X-Format Flag, No-X-Format Flag.

Lines Without Quotes

Any quoted entry found without quotes in its first line is re-written with an initial empty string. For example:

msgid
"source text"
msgstr
"target text"

is re-written:

msgid ""
"source text"
msgstr ""
"target text"

Parameters

Options Tab

Protect approved entries — Set this option to set all entries that are not empty and not fuzzy with a non-translatable flag.

Processing mode

Bilingual mode — Select this option to process the input document as a bilingual file. In this mode the msgid entry is the source text, and the msgstr entry is the translation. This is the normal standard way PO files are.

Generate identifiers from the source text — Set this option to generate identifiers from the source text of the msgid entry. The values are constructed from the hash code of the source text, possibly with a domain prefix, and possibly with the text of the msgctxt entry. Note that the value may not be unique if the source text is not unique within the same domain, or has no distinct context values. The id value is set as the name of the extracted resource. This option is enabled only in bilingual mode.

Monolingual mode — Select this option to process the input document as a monolingual file. In this mode the msgid entry is a real identifier (rather than the source text), and the corresponding text is in the msgstr entry.

In this mode the msgid value is used as the identifier for the entry. The value is set as the name of the extracted resource.

Inline Codes Tab

Has inline codes as defined below — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code. By default the expression is:

((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn])
|((\\r\\n)|\\a|\\b|\\f|\\n|\\r|\\t|\\v)
|(\{\d.*?\}))

Add — Click this button to add a new rule.

Remove — Click this button to remove the current rule.

Move Up — Click this button to move the current rule upward.

Move down — Click this button to move the current rule downward.

[Top-right text box] — Enter the regular expression for the current rule. Use the Modify button to enter the edit mode. The expression must be a valid regular expression. You can check the syntax (and the effect of the rule) as it automatically tests it against the test data in the text box below and shows the result in the bottom-right text box.

Modify — Click this button to edit the expression of the current rule. This button is labeled Accept when you are in edit mode.

Accept — Click this button to save any changes you have made to the expression and leave the edit mode. This button is labeled Modify when you are not in edit mode.

Discard — Click this button to leave the edit mode and revert the current rule to the expression it had before you started the edit mode.

Patterns — Click this button to display some help on regular expression patterns.

Test using all rules — Set this option to test all the rules at the same time. The syntax of the current rule is automatically checked. See the effect it has on the sample text. The result of the test are displayed in the bottom right result box. The parts of the text that are matches of the expressions are displayed in <> brackets. If the Test using all rules option is set, the test takes all rules of the set in account, if it is not set only the current rule is tested.

[Middle-right text box] — Optional test data to test the regular expression for the current rule or all rules depending on the Test using all rules option.

[Bottom-right text box] — Shows the result of the regular expression applied to the test data.

Limitations

None known.