Okapi Framework - FiltersPO Filter |
|
If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=PO_Filter
The PO Filter is an Okapi component that implements the IFilter interface for
Gettext PO (Portable Object) resource files (as well as POT (PO templates)). The filter is implemented in the class
net.sf.okapi.filters.po.POFilter of the Okapi library.
The implementation is based on the PO specifications found in the GNU gettext manual. There is also a useful representation guide for PO-to-XLIFF conversion available on the XLIFF TC pages.
The following is an example of a very simple PO file. The translatable text
is marked in bold. Note also the header information in the first entry
(the one with an empty msgid line), where encoding and plural
information may be found.
# PO file for myApp msgid "" msgstr "" "Project-Id-Version: myApp 1.0.0\n" "Report-Msgid-Bugs-To: \n" "POT-Creation-Date: 2005-10-02 05:16+0200\n" "PO-Revision-Date: 2005-03-21 11:28/-0600\n" "Last-Translator: unknown <email@address>\n" "Language-Team: unknown <email@address>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" "Plural-Forms: nplurals=2; plural=(n != 1);\n" msgid "diverging after version %d of %s" msgstr "" msgid "You have selected %d file for deletion" msgid_plural "You have selected %d files for deletion" msgstr[0] "" msgstr[1] ""
The filter decides which encoding to use for the input file using the following logic:
charset declaration
exists in the first 1000 characters of the file:charset" (case
insensitive):If the file has a header entry with a charset declaration, the
declaration is automatically updated in the output to reflect the encoding
selected for the output.
If the output encoding is UTF-8:
Important: No language information is updated in the PO header entry. Any language-related information needs to be updated manually in the header entry of the generated PO file. Note that PO files are generally bilingual files with their language-related information already set.
The type of line-breaks of the output is the same as the one of the original input.
The filter supports
plural forms entries with the assumption that they are in a sequential
order. That is, msgstr[0] comes first, then msgstr[1],
etc. All the msgstr strings of a given plural entry are processed
as part of a single group that has its type value set to "x-gettext-plurals".
If a resource name is generated, its value for plural form entries has an additional index indicator. For example, if you have the following plural form entries:
msgid "untranslated-singular" msgid_plural "untranslated-plural" msgstr[0] "translated-singular" msgstr[1] "translated-plural-form1" msgstr[2] "translated-plural-form2"
The extracted resources will allow you to construct an XLIFF output looking like this:
<group restype="x-gettext-plurals"> <trans-unit id="1" resname="P3ADE34F0-0" xml:space="preserve" translate="no"> <source xml:lang="en-US">untranslated-plural</source> <target xml:lang="fr-FR">translated-singular</target> </trans-unit> <trans-unit id="2" resname="P3ADE34F0-1" xml:space="preserve" translate="no"> <source xml:lang="en-US">untranslated-plural</source> <target xml:lang="fr-FR">translated-plural-form1</target> </trans-unit> <trans-unit id="3" resname="P3ADE34F0-2" xml:space="preserve" translate="no"> <source xml:lang="en-US">untranslated-singular</source> <target xml:lang="fr-FR">translated-plural-form2</target> </trans-unit> </group>
The domains are supported as groups, with the type of the group set to "x-gettext-domain"
and the resource name set to the group identifier. For example, if you have the following entry:
domain TheDomain1 msgid "Text 1 in domain 'TheDomain1'" msgstr "Texte 1 dans le domain 'TheDomain1'"
The extracted resources will allow you to construct an XLIFF output looking like this:
<group resname="TheDomain1" restype="x-gettext-domain"> <trans-unit id="1" resname="N9D1999AB" xml:space="preserve"> <source xml:lang="en-US">Text 1 in domain 'TheDomain1'</source> <target xml:lang="fr-FR">Texte 1 dans le domain 'TheDomain1'</target> </trans-unit> </group>
PO files may have reference comments generated from the source code. they are
denoted by a leading "#:" marker.
The filter provides read-only access to them through the references resource-level property.
PO files may have comments generated from the source code. They correspond to
localization notes from the developers. They are denoted by a leading "#."
marker
The filter provides read-only access to them through the note resource-level property
PO files may have translators comments, denoted by a leading "#
" marker.
The filter provides read-only access to them through the transnote resource-level property.
PO files may have context comments, denoted by a leading "#|"
markers.
The filter does do anything with them currently.
PO files may have context entry lines, denoted by a leading "msgctx"
marker.
The filter does do anything with them currently. Such entries are apparently rarely used.
PO files may may a "fuzzy flag" indicated as "#, fuzzy". This
indicates that the text in the msgstr entry is only a proposed
translation, and may or may not be correct for the source.
When on bilingual mode, the filter provides access to this flag through the approved target property:
no".
If the flag does not exists, the approved property is not defined.no", the flag is removed from the
output document.TODO
TODO
TODO
Bilingual mode -- Select this option to process the input
document as a bilingual file. In this mode the msgid entry is the
source text, and the msgstr is the translation. Most PO files are
bilingual documents.
Generate identifiers from the source text -- Set this option to
generate identifiers from the source text of the msgid entry. The
values are constructed from the has code of the source text, and possibly with a
domain prefix. Note that the value may not be unique if the source text is not
unique within the same domain. The value generated is accessible from
getName() method of the extracted resource. This option is enabled only
in bilingual mode.
Monolingual mode -- Select this option to process the input
document as a monolingual file. In this mode the msgid entry is a
real identifier (rather than the source text), and the corresponding text is in
the msgstr entry.
In monolingual mode the msgid value is used as the identifier
for the entry. To access it, use the getName() method of the
extracted resource.
Has inline codes as defined below: -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. By default the expression is:
((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn])
|((\\r\\n)|\\a|\\b|\\f|\\n|\\r|\\t|\\v)
|(\{\d.*?\}))
Add -- Click this button to add a new rule.
Remove -- Click this button to remove the current rule.
Move Up -- Click this button to move the current rule upward.
Move down -- Click this button to move the current rule downward.
[Top-right text box] -- Enter the regular expression for the current rule. Use the Modify button to enter the edit mode. The expression must be a valid regular expression. You can check the syntax (and the effect of the rule) as it automatically tests it against the test data in the text box below and shows the result in the bottom-right text box.
Modify -- Click this button to edit the expression of the current rule. This button is labeled Accept when you are in edit mode.
Accept -- Click this button to save any changes you have made to the expression and leave the edit mode. This button is labeled Modify when you are not in edit mode.
Discard -- Click this button to leave the edit mode and revert the current rule to the expression it had before you started the edit mode.
Patterns -- Click this button to display a list of "guidline" regular expression patterns then select a pattern to insert it the edit box. The inserted text replaces whatever text is currently selected.
Test using all rules -- Set this option to test all the rules at the same time. The syntax of the current rule is automatically checked. See the effect it has on the sample text. The result of the test are displayed in the bottom right result box. The parts of the text that are matches of the expressions are displayed in <> brackets. If the Test using all rules option is set, the test takes all rules of the set in account, if it is not set only the current rule is tested.
[Middle-right text box] -- Optional test data to test the regular expression for the current rule or all rules depending on the Test using all rules option.
[Bottom-right text box] -- Shows the result of the regular expression applied to the test data.