OpenXML Filter
Overview
This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.
Parameters
The filter parameters are divided into General Options, which apply to all formats, and format-specific options.
General Options
- Translate Document Properties
- When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
- Translate Comments
- When checked, exposes document comments for translation. Default: on.
- Clean Tags Aggressively
- When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
Word Options
- Translated Headers and Footers
- When checked, exposes header and footer content for translation. Default: on.
- Translated Hidden Text
- When checked, exposes hidden text for translation. Default: on.
- Exclude Graphical Metadata
- When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
- Styles to Exclude
- Text using any of the selected styles will not be exposed for translation . Default: none.
Excel Options
- Translate Hidden Rows and Columns
- When checked, hidden rows and columns are exposed for translation. Default: off.
- Colors to Exclude
- Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
- Worksheet Configurations
- The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
- For one configuration it is possible to specify:
- Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to
java.util.regex.Pattern. E.g.:Sheet1. - Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.:
1,2. - Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.:
A,B. - Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.:
3,4. - Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.:
C,D.
- Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to
- Let's consider a simple table as an example and find out what can be done with all those configurations.
| Metadata Header A1 | Metadata Header C1 | ||
|---|---|---|---|
| Metadata Header A2 | Metadata Header B2 | Metadata Header C2 | Metadata Header D2 |
| A3 | B3 | C3 | Metadata D3 |
| A4 | B4 | C4 | Metadata D4 |
| A5 | B5 | C5 | Metadata D5 |
- Let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
- All these requirements can be written as the following configurations (I am using the
net.sf.okapi.common.ParametersStringformat at the moment):
worksheetConfigurations.number.i=1 worksheetConfigurations.0.namePattern=Sheet1 worksheetConfigurations.0.excludedRows=5 worksheetConfigurations.0.excludedColumns=C worksheetConfigurations.0.metadataRows=1,2 worksheetConfigurations.0.metadataColumns=D
- Then, the extraction to XLIFF should look like that:
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
- Exclude Marked Columns in Each Sheet (deprecated, may be removed at any time, please use the worksheet configurations instead)
- When checked, columns selected in the "Sheet # Columns to Exclude" lists will be excluded from translation. The filter allows for sheets 1 and 2 to be configured individually. Sheets 3 and higher must be configured as a single group. Default: off.
PowerPoint Options
- Translate Notes
- When checked, expose slide notes for translation. Default: off.
- Translate Masters
- When checked, expose master slides for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: off.
Limitations
- Various, see the issues list.