Sentence Alignment Step

From Okapi Framework
Jump to: navigation, search

Overview

This step aligns the sentences of each text unit from two documents.

Takes: Filter events. Sends: Filter events.

Currently the events sent by this step are the same as the events it received (and not the generated aligned text units). This will be implemented in a future version.

Text units from the source and target documents must be perfectly synchronized (aligned). For example, if the source document has more text units than the target document an error will be generated. If requested, the step can segment both source and target text units as they are processed using a default SRX file, or one you specify.

The aligner algorithm takes the sentences in the source and target text units and finds the best possible alignment based on the character lengths of the sentences. Internal parameters take into account that some languages translate into fewer (or more) characters. Possible match types produced are: 1-1, 2-1, 1-2, 0-1, 1-0, 2-0, 2-3, 3-2 etc.

Entries set as non-translatable are not processed.

When processing bilingual documents (for example TMX, PO, etc.) use only a single input list.

Parameters

Generate the following TMX document — Set this option to generate a TMX files with the aligned entries.

  • If this option is set: this step returns the event it received (possibly re-segmented).
  • If this option is not set: this step returns the bilingual text unit corresponding to the alignment (also segmented).

Enter the directory of the TMX document to generate. If the file already exists it will be overwritten.

Segment the source content — Set this option to segment the source content before trying to align it. If this option is not set the content is expected to be already segmented. If this option is set and the content is already segmented, the existing segmentation will be reset to the new one.

Use custom source segmentation rules — Set this option to use a specified SRX file for segmenting the source. If this option is not set, and segmentation is required, the default rules are used.

Enter the full path of the SRX file to use for segmenting the source.

Segment the target content — Set this option to segment the target content before trying to align it. If this option is not set the content is expected to be already segmented. If this option is set and the content is already segmented, the existing segmentation will be reset to the new one.

Use custom target segmentation rules — Set this option to use a specified SRX file for segmenting the target. If this option is not set, and segmentation is required, the default rules are used.

Enter the full path of the SRX file to use for segmenting the target.

Note: For this step the default rules are hard-coded with the step. They are not the rules defined in the config sub-directory of the installation directory.

Collapse whitspace — Set this option to collapse whitespace (space, newline etc.) to a single space before performing the segmentation and alignment.

Output 1-1 matches only — Set this option to output only 1-1 sentence aligned matches.

Force Simple One to One Alignment — Set this option so that for each paragraph, if there are the same number of sentences, align the sentences; otherwise, join the sentences in the paragraph back together and output the aligned paragraph. If checked, this overrides the Output 1-1 matches only option.

Limitations

None known.