Segmentation Step

From Okapi Framework
Jump to navigation Jump to search

Overview

This step segments the content of extracted text units.

Takes: Filter events. Sends: Filter events.

The separation between text units is based on the structure of the original file format, for example the content of two <p> elements in HTML gives you two text units. This step allows you to break down the content of the text units into smaller parts, usually corresponding to sentences.

The segmentation is done using segmentation rules defined in a SRX document. This step support SRX 2.0. The SRX (Segmentation Rules eXchange) format is a standard way of describing rules on how to break text. It used regular expressions to specify patterns before and after a break or a non-break position.

  • Text units flagged as non-translatable are not segmented.
  • Text units with no content are not segmented.
  • Text units that are already segmented are not re-segmented, except if the option Overwrite existing segmentation is set.

Parameters

Source

Segment the source text using the following SRX rules — Set this option to segment the source text of the text units.

Enter the full path of the SRX document to use for segmenting the source text. You can use the variables ${rootDir} and ${inputRootDir} in the path.

Edit — Click this button to open the SRX document in Ratel, the SRX editor of the Okapi framework. Note that when when you exit the editor the file being edited is set as the file to use.

Target

Segment existing target text using the following SRX rules — Set this option to segment the target text of the text unit, if there is a text for the target locales being processed. No matter if you're processing a single target locale or multiple target locales, only those locales will be affected (even if your document contains more locales).

Enter the full path of the SRX document to use for segmenting the target text. This can be the same document as for the rules for the source. You can use the variables ${rootDir} and ${inputRootDir} in the path.

Edit — Click this button to open the SRX document in Ratel, the SRX editor of the framework. Note that when when you exit the editor the file being edited is set as the file to use.

Options

Behavior if input text is already segmented — You can select one of the 3 options:

Keep existing segmentation
If a text unit is already segmented, its segmentation is not modified by this step.
Overwrite existing segmentation (re-segment)
If a text unit is already segmented, re-segment it against the (new) SRX rules in step parameters. Previous segmentation will be lost.
Keep existing segmentation, segment further against the SRX rules
If a text unit is already segmented, keep its segmentation, but try to segment its existing segments further, thus creating more segments and text parts out of existing segments. Segmentation gets deeper only if the new SRX rules in the step parameters are providing for higher granularity.

Copy source into target if no target exists — Set this option to copy the source content into the target if no target is already available.

Verify that a target segment matches each source segment when a target content exists — Set this option to verify that, if there is a target available, all the source segment have a corresponding target segment in the target content. this also verifies that both source and target have the same number of segments. Note that this verification does not ensure the the content of a target segment is the translation of its corresponding source text. It only matches their segment ID.

When possible force the output to show the segmentation — Set this option to enable the file formats that can represent segmentation to show the segments, regardless what is the option set in the filter configuration. For example: the XLIFF Filter by default output segments only for entries that had segments in the input XLIFF document. When you set this option, the filter's output option is changed to show the segments as needed, even for the text units that had no segments. If this option is off, the segments are shown only if the option of the filter configuration allows it.

Renumber code IDs — Set this option to change the IDs of the inline codes in each resulting segments so they all start at '1'. This can be useful when working with translation memories where each segment restart the IDs of its inline codes in such way. The option Restore original IDs to renumbered codes in the Desegmentation Step can be used to restore the original IDs when desegmenting.

Warning: This option cannot work with documents where the ID values are not numeric, or not sequential.

Limitations

  • Strict implementation of the SRX syntax in Java is not possible. See the "SRX and Java" page for more details.
  • Using the Renumber code IDs option when creating a translation kit requires to use a separate Desegmentation Step when merging back, and may not be possible with some formats that are segment-based like TTX.