OpenXML Filter
Overview
This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.
Parameters
The filter parameters are divided into General Options, which apply to all formats, and format-specific options.
General Options
- Translate Document Properties
- When checked, it exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
- Translate Comments
- When checked, it exposes document comments for translation. Default: on.
- Clean Tags Aggressively
- When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
- Ignore Whitespace Styles
- When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
- Preserve ACSII and HighAnsi Font Categories On Detection
- When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.
- Remove Embedded Excel Package
- When checked and either cached chart strings or numbers are also set for extraction, the embedded Excel package is removed, and any references to it in chart parts and related relationships are removed as well. Default: off.
Word Options
- Translated Headers and Footers
- When checked, exposes header and footer content for translation. Default: on.
- Translate Numbering Level Text
- When checked, exposes numbering-level text for translation. Default: off.
- Translated Hidden Text
- When checked, exposes hidden text for translation. Default: on.
- Translate Graphic Name
- When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
- Translate Graphic Description
- When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
- Ignored Styles > Ignore Font Colours
- When checked, font colours will be ignored. Default: off.
- If Clean Tags Aggressively and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
- Ignored Styles > Font Colours Minimum Ignorance Threshold
- When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
- Ignored Styles > Font Colours Maximum Ignorance Threshold
- When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
- Excluded/Included Styles
- Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
- Excluded/Included Highlight Colors
- Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
- If the switch is set to "Include", only text in the specified colors will be extracted for translation.
- If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.
Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.
Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.
Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.
- Excluded Font Colours
- Text using any selected colours will not be exposed for translation. Default: none.
- Allow Style Optimisation
- When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.
Excel Options
- Translate Hidden Rows and Columns
- When checked, hidden rows and columns are exposed for translation. Default: off.
- Colors to Exclude
- Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
- The named colors available in the UI correspond to the standard color palette of Excel 2010.
- The configuration itself also supports colors specified as RGB in the format
RRGGBB, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify thetsExcelExcludedColorssection of the configuration file like this:
tsExcelExcludedColors.i=1 ccc0=69b3e7
- Translate Cells Copied
- When a single string appears in more than one cell in an Excel spreadsheet, Excel often stores only a single copy of the string and references it from multiple locations. This creates localization problems, as the same string may require different translations in different contexts. When this option is enabled, the OpenXML filter will extract repeated strings once for each cell that they appear in, allowing for independent translations as needed. For certain documents with huge numbers of repeated cells, this may be undesirable. Default: on.
- Preserve Styles In Target Columns
- When checked, the cell styles in target columns are preserved. Default: off.
- Extract Source And Target Columns Joined
- When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
- Extract Worksheets Explicitly Specified
- When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
- Extract Cells Explicitly Specified
- When checked, only cells specified in the Worksheet Configurations are exposed for extraction. The explicitly mentioned source and target columns are eligible for such handling. Default: off.
- Worksheet Configurations
- The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
- For one configuration, it is possible to specify:
- Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to
java.util.regex.Pattern. E.g.:Sheet1. - Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.:
A,B. - Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.:
C,D. - Target Columns Max Characters - See Configuring Size Constraints.
- Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.:
1,2. - Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.:
A,B. - Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.:
3,4. - Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.:
C,D.
- Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to
Example of Worksheet Configuration
- Let's consider a simple table as an example and find out what can be done with all those configurations.
| Metadata Header A1 | Metadata Header C1 | ||
|---|---|---|---|
| Metadata Header A2 | Metadata Header B2 | Metadata Header C2 | Metadata Header D2 |
| A3 | B3 | C3 | Metadata D3 |
| A4 | B4 | C4 | Metadata D4 |
| A5 | B5 | C5 | Metadata D5 |
- Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
- This requirement can be configured in the following way (using the
net.sf.okapi.common.ParametersStringformat as an example):
worksheetConfigurations.number.i=1 worksheetConfigurations.0.namePattern=Sheet1 worksheetConfigurations.0.sourceColumns=A worksheetConfigurations.0.targetColumns=B worksheetConfigurations.0.excludedRows=1,2 worksheetConfigurations.0.excludedColumns=C,D
- Then the XLIFF would look like this after extraction and translation:
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
- And the merged representation would be the following:
| Metadata Header A1 | Metadata Header C1 | ||
|---|---|---|---|
| Metadata Header A2 | Metadata Header B2 | Metadata Header C2 | Metadata Header D2 |
| A3 | A3-tr | C3 | Metadata D3 |
| A4 | A4-tr | C4 | Metadata D4 |
| A5 | A5-tr | C5 | Metadata D5 |
- Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
- All these requirements can be written as the following configurations:
worksheetConfigurations.number.i=1 worksheetConfigurations.0.namePattern=Sheet1 worksheetConfigurations.0.excludedRows=5 worksheetConfigurations.0.excludedColumns=C worksheetConfigurations.0.metadataRows=1,2 worksheetConfigurations.0.metadataColumns=D
- Then, the extraction to XLIFF should look like that:
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
XLSX: Dynamic extraction constraints
The OpenXML XLSX filter can expose text-unit size constraints dynamically from spreadsheet cells during extraction. This is useful when the spreadsheet contains per-row or per-cell limits, such as maximum character counts or minimum/maximum width or height values.
Availability
This functionality is available from Okapi 1.49.0.
Prior to version 1.49.0, this value contained a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: 25,30.
In 1.49.0, this legacy method is still supported, but has been augmented with a dynamic system that allows you to:
Configuration
Dynamic constraints are configured through the worksheet configuration. The constraint columns are declared through targetColumnsMaxCharacters. They are mapped positionally to targetColumns.
worksheetConfigurations.number.i=1 worksheetConfigurations.0.namePattern=Sheet1 worksheetConfigurations.0.targetColumns=A,B worksheetConfigurations.0.targetColumnsMaxCharacters=C,D worksheetConfigurations.0.metadataColumns=E
In this example:
| Column | Meaning |
|---|---|
| A, B | Extracted worksheet columns. |
| C, D | Dynamic constraint columns. They are not extracted as translatable text. |
| E | Ordinary metadata column. Its value is written as row metadata. |
A dynamic constraint column may contain either a bare value or a self-describing constraint expression.
Constraint syntax
Use the following syntax in a dynamic constraint cell:
COLUMN:VALUE[:CONSTRAINT_NAME[:UNIT]]
Multiple constraints may be provided in one cell by separating them with commas:
A:1:minwidth:char,A:2:maxwidth:char
Do not add spaces around commas or colons.
| Part | Required? | Description |
|---|---|---|
COLUMN |
yes | Worksheet column of the extracted cell to which the constraint applies. |
VALUE |
yes | Numeric value of the constraint. A value of 0 removes a previously defined constraint with the same name for the same text unit.
|
CONSTRAINT_NAME |
no | Constraint property. Defaults to maxwidth if omitted.
|
UNIT |
no | Size unit. Defaults to char if omitted.
|
Supported constraint names
| Full name | Short form | XLIFF attribute |
|---|---|---|
maxwidth |
mxw |
maxwidth
|
minwidth |
mnw |
minwidth
|
maxheight |
mxh |
maxheight
|
minheight |
mnh |
minheight
|
The names are case-insensitive.
Supported units
| Full name | Short form |
|---|---|
byte |
b
|
char |
ch
|
col |
col
|
cm |
cm
|
dlgunit |
du
|
em |
em
|
ex |
ex
|
glyph |
g
|
in |
in
|
mm |
mm
|
percent |
pct
|
pixel |
px
|
point |
pt
|
row |
r
|
The units are case-insensitive.
Bare values
When a cell in a configured dynamic constraint column contains only a value, the value is interpreted as maxwidth for the target column mapped to that constraint column.
For example, with:
worksheetConfigurations.0.targetColumns=A worksheetConfigurations.0.targetColumnsMaxCharacters=C
a value of:
40
in column C means:
A:40:maxwidth:char
Dynamic column with a default value
A targetColumnsMaxCharacters entry may contain both a dynamic constraint column and a default value:
worksheetConfigurations.0.targetColumns=A worksheetConfigurations.0.targetColumnsMaxCharacters=C:20
This means:
- column C is used as the dynamic constraint column for target column A;
- if no row-specific value overrides it, target column A receives
maxwidth="20"andsize-unit="char".
Example
With the example shown above, the relevant cells are:
| Cell | Value | Effect |
|---|---|---|
| C1 | A:1:minwidth:char |
Applies minwidth="1" and size-unit="char" to cell A1.
|
| D1 | A:2:maxwidth:char |
Applies maxwidth="2" and size-unit="char" to cell A1.
|
| E1 | general metadata |
Is written as ordinary row metadata. |
| C2 | B:3:minheight:em |
Applies minheight="3" and size-unit="em" to cell B2.
|
The extracted XLIFF contains the size attributes on the affected text units:
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
<context-group name="row-metadata">
<context context-type="x-E">general metadata</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A1" xml:space="preserve"
minwidth="1" maxwidth="2" size-unit="char">
<source xml:lang="en">A1</source>
<target xml:lang="fr"></target>
</trans-unit>
</group>
<group id="P132303AB-sg2" resname="2">
<context-group name="row-metadata"></context-group>
<trans-unit id="P147242AB-tu2" resname="Sheet1!A2" xml:space="preserve">
<source xml:lang="en">A2</source>
<target xml:lang="fr"></target>
</trans-unit>
<trans-unit id="P147242AB-tu3" resname="Sheet1!B2" xml:space="preserve"
minheight="3" size-unit="em">
<source xml:lang="en">B2</source>
<target xml:lang="fr"></target>
</trans-unit>
</group>
</group>
Precedence rules
If the same text unit receives several constraints with the same property name, the last one is used. If several constraints specify different units, the text unit still receives only one size-unit property; the last specified unit is used.
A constraint value of 0 removes the previously defined constraint with the same property name. For example:
A:20:maxwidth:char,A:0:maxwidth:char
leaves no maxwidth constraint for column A.
Notes and limitations
- If "Translate Cells Copied" is disabled, only static constraints (that is, a fixed integer width) will be honored. This is because the filter can not apply dynamic location-based constraints to a string that appears in multiple locations.
- Static default constraints from
targetColumnsMaxCharacterscan still be applied without row-specific dynamic metadata. - Plain metadata columns remain row metadata unless they are also configured as dynamic constraint columns through
targetColumnsMaxCharacters.
PowerPoint Options
- Translate Document Properties
- When checked and the same option is checked under the Gereral Options (they will be separated after the next release), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
- Reorder Document Properties
- When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
- Reorder Relationships
- When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
- Translate Diagram Data
- When checked, the diagram data are exposed for translation. Default: on.
- Reorder Diagram Data
- When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
- Translate Charts
- When checked, the charts are exposed for translation. Default: on.
- Reorder Charts
- When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
- Translate Notes
- When checked, the slide notes exposed for translation. Default: off.
- Reorder Notes
- When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
- Translate Comments
- When checked and the same option is checked under the Gereral Options (they will be separated after the next release), the document comments are exposed for translation. Default: on.
- Reorder Comments
- When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
- Translate Masters
- When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
- Translate Graphic Name
- When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
- Translate Graphic Description
- When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
- Translate Cached Chart Strings
- When checked, the cached chart strings are exposed for translation. Default: off.
- Translate Cached Chart Numbers
- When checked, the cached chart numbers and format codes are exposed for translation. Default: off.
- Excluded/Included Highlight Colors
- Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in #Word Options.
Limitations
- Various, see the issues list.
