OpenXML Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

Parameters

The filter parameters are divided into General Options, which apply to all formats, and format-specific options.

General Options

Translate Document Properties
When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
Translate Comments
When checked, exposes document comments for translation. Default: on.
Clean Tags Aggressively
When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.

Word Options

Translated Headers and Footers
When checked, exposes header and footer content for translation. Default: on.
Translate Numbering Level Text
When checked, exposes numbering-level text for translation. Default: off.
Translated Hidden Text
When checked, exposes hidden text for translation. Default: on.
Exclude Graphical Metadata
When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
Ignored Styles > Ignore Font Colours
When checked, font colours will be ignored. Default: off.
If Clean Tags Aggressively and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
Ignored Styles > Font Colours Minimum Ignorance Threshold
When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
Ignored Styles > Font Colours Maximum Ignorance Threshold
When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
Excluded/Included Styles
Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
Excluded/Included Highlight Colors
Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
Excluded Font Colours
Text using any selected colours will not be exposed for translation. Default: none.

Excel Options

Translate Hidden Rows and Columns
When checked, hidden rows and columns are exposed for translation. Default: off.
Colors to Exclude
Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
Translate Cells Copied
When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
Worksheet Configurations
The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
For one configuration it is possible to specify:
  • Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to java.util.regex.Pattern. E.g.: Sheet1.
  • Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: A,B.
  • Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: C,D.
  • Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: 1,2.
  • Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: A,B.
  • Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: 3,4.
  • Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: C,D.
Let's consider a simple table as an example and find out what can be done with all those configurations.
Metadata Header A1 Metadata Header C1
Metadata Header A2 Metadata Header B2 Metadata Header C2 Metadata Header D2
A3 B3 C3 Metadata D3
A4 B4 C4 Metadata D4
A5 B5 C5 Metadata D5
Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
This requirement can be configured in the following way (using the net.sf.okapi.common.ParametersString format as an example):
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
Then the XLIFF would look like this after extraction and translation:
<group id="P76C545-sg1" resname="Sheet1">
  <group id="P132303AB-sg1" resname="1">
  </group>
  <group id="P132303AB-sg2" resname="2">
  </group>
  <group id="P132303AB-sg3" resname="3">
    <trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
      <source xml:lang="en">A3</source>
      <target xml:lang="es">A3-tr</target>
    </trans-unit>
  </group>
  <group id="P132303AB-sg4" resname="4">
    <trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
      <source xml:lang="en">A4</source>
      <target xml:lang="es">A4-tr</target>
    </trans-unit>
  </group>
  <group id="P132303AB-sg5" resname="5"
    <trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
      <source xml:lang="en">A5</source>
      <target xml:lang="es">A5-tr</target>
    </trans-unit>
  </group>
</group>
And the merged representation would be the following:
Metadata Header A1 Metadata Header C1
Metadata Header A2 Metadata Header B2 Metadata Header C2 Metadata Header D2
A3 A3-tr C3 Metadata D3
A4 A4-tr C4 Metadata D4
A5 A5-tr C5 Metadata D5
Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
All these requirements can be written as the following configurations:
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
Then, the extraction to XLIFF should look like that:
<group id="P76C545-sg1" resname="Sheet1">
  <group id="P132303AB-sg1" resname="1">
  </group>
  <group id="P132303AB-sg2" resname="2">
  </group>
  <group id="P132303AB-sg3" resname="3">
    <context-group name="row-metadata">
      <context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
    </context-group>
    <trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
      <source xml:lang="en">A3</source>
      <target xml:lang="es"></target>
    </trans-unit>
    <trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
      <source xml:lang="en">B3</source>
      <target xml:lang="es"></target>
    </trans-unit>
  </group>
  <group id="P132303AB-sg4" resname="4">
    <context-group name="row-metadata">
      <context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
    </context-group>
    <trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
      <source xml:lang="en">A4</source>
      <target xml:lang="es"></target>
    </trans-unit>
    <trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
      <source xml:lang="en">B4</source>
      <target xml:lang="es"></target>
    </trans-unit>
  </group>
  <group id="P132303AB-sg5" resname="5">
    <context-group name="row-metadata">
      <context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
    </context-group>
  </group>
</group>

PowerPoint Options

Translate Document Properties
When checked and the same option is checked under the Gereral Options (they will be separated after the next release), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
Reorder Document Properties
When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
Reorder Relationships
When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
Translate Diagram Data
When checked, the diagram data are exposed for translation. Default: on.
Reorder Diagram Data
When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
Translate Charts
When checked, the charts are exposed for translation. Default: on.
Reorder Charts
When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
Translate Notes
When checked, the slide notes exposed for translation. Default: off.
Reorder Notes
When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
Translate Comments
When checked and the same option is checked under the Gereral Options (they will be separated after the next release), the document comments are exposed for translation. Default: on.
Reorder Comments
When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
Translate Masters
When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.

Limitations