Quality Check Step

From Okapi Framework
Jump to navigation Jump to search

Overview

This step generate a report of potential issues found by comparing the source and target of text units.

Takes: Filter events or raw documents. Sends: Filter events or raw documents.

No check is done on the entries set as non-translatable.

  • With the input as filter events: The events are processed and are sent to the next step. The quality check report is generated at the end of the batch of input files. No interactive UI is provided.
  • With the input as raw documents: The documents are added to a new quality check session and are sent to the next step. At the end of the batch the session is open in CheckMate, the documents are processed and the user can browse through the issues found, disabled some and generate the report if necessary

Parameters

General Tab

Note: a text unit in the Okapi tools corresponds to a unit of extracted text, for example a paragraph in HTML or OpenOffice, a string table in a Properties file, etc. while a segment is the unit resulting from a segmentation. A text unit is composed of one or more segments and possible inter-segment parts. When a text unit has not been segmented it is seen as having a single segment. Some document may have been segmented, like XLIFF. Other are typically not segmented, like TMX where each <tu> entry corresponds to a text unit (and therefore a single segment).

Text unit verification

Verifications that are done on the whole content of each text unit:

Warn if an entry does not have a translation — This verification is always done. It checks if each entry has a corresponding translation. That is there is no entry for the given target language corresponding to the source entry. Empty translations are checked in the option: Warn if a target segment is empty.

Warn if a target entry has a difference in leading white spaces — Set this option to flag the text units where the leading white spaces are different between source and target.

Warn if a target entry has a difference in trailing white spaces — Set this option to flag the text units where the trailing white spaces are different between source and target.

Segment verification

Verification that are done on each segment of each text unit (un-segmented text unit being seen as having a single segment):

Warn if a source segment does not have a corresponding target — This verification is always done. It checks if all source segments have a corresponding target segment. That is a source segment is identified with a segment ID that does not exist in the target text unit.

Warn if there is an extra target segment — This verification is always done. It checks if all target segments correspond to an existing source segment.

Warn if a target segment is empty when its source is not empty — Set this option to flag the segments for which the translation is empty (if the corresponding source is not empty).

Warn if a target segment is not empty when its source is empty — Set this option to flag the segments for which the target is not empty while its source is empty.

Warn if a target segment is the same as its source — Set this option to flag the segments where the translation is the same as the source. This check is done only if the source segment contain in its text at least one word-character (a character included in the regular expression: "[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]", which is basically: any Unicode letter or digit. Note also that the inline codes are not part of the text of the entry. For example (with codes in gray):

  • The entry "<b>%s</b> : %d" is not checked because it has no character that could be part of a word.
  • The entry "-------- S= %d" is checked because the character 'S' could be part of a word.

Verify for same language family — Set this option to enable the verification Warn if a target segment is the same as its source to be performed even if the language of the source and target locale codes are the same. For example if the source is en and the target is en-GB.

Include the codes in the comparison — Set this option if the comparison done when verifying if the target is the same as the source should take inline codes into account. If this option is set and the only difference between the source and the translation is an inline code, the segment will not be flagged as having the target the same as the source (because the will have at least one code different).

Note that when a target is found to be the same as its source, the tool checks the list of patterns that have their expected target set to "<same>". If the string matches one of those patterns no warning is generated as the target is expected to be the same.

Warn on doubled words — Set this option to flag the target segments where there is a sequential repetition of the same word, for example "is is" in "this is is an example". The check is not case-sensitive, for example "This this an example" is flagged.

Exceptions — Enter the list of words that can be repeated. For example, in French some sentences may have the expressions "vous vous" or "nous nous". To allow this, enter "vous;nous": Each word for which repetition is allowed separated by semi-colons. You must not leave any space around the semi-colon. The exceptions are not case-sensitive.

File verification

Validate XLIFF documents against schema — Set this option to validate XLIFF documents against the xliff-core-1.2-transitional.xsd schema. A document is detected as XLIFF if it is associated with a filter configuration starting with okf_xliff.

Length Tab

Warn if a source or target text unit does not fit its ITS storage size property — Set this option to check the byte length of the text units that have a its-storageSize property. that is have a ITS Storage Size metadata.

Warn if a target is longer than — Set this option to flag any target text that is longer than the specified number of characters. The length is based on the number of characters without counting the inline codes.

Warn if a target is longer than the given percentage of the character length of its source — Set this option to flag any target text that is longer than a given percentage of its source text.

Character length above which a text is considered "long" — Enter the number of characters above which you consider a text to be a long text (vs. a short one). This allows you to set different percentages for short and longer text.

Percentage for "short" text — Enter the percentage to use when the text is shorter or equal to the character length above which a text is considered long.

Percentage for "long" text — Enter the percentage to use when the text is longer than the character length above which a text is considered long.

The length is based on the number of characters without counting the inline codes. These values must be tuned for each source/target language pair.

Warn if a target is shorter than the following percentage of the character length of its source — Set this option to flag any target text that is shorter than a given percentage of its source text. This allows you to set different percentages for short and longer text.

Character length above which a text is considered "long" — Enter the number of characters above which you consider a text to be a long text (vs. a short one). This allows you to set different percentages for short and longer text.

Percentage for "short" text — Enter the percentage to use when the text is shorter or equal to the character length above which a text is considered long.

Percentage for "long" text — Enter the percentage to use when the text is longer than the character length above which a text is considered long.

The length is based on the number of characters without counting the inline codes. These values must be tuned for each source/target language pair.

Inline Codes Tab

Warn if there is a code differences between source and target segments — Set this option to verify that the target content has the same inline codes as the source content. This function compares the content of the codes between the source and target, when a content is available. Otherwise the codes' type and id are compared. Both missing codes (codes in the source but not in the target) and extra codes (codes in the target but not in the source) are indicated. A difference only in the order of the codes does not trigger a warning.

If no extra or missing codes are found, the program checks for possible issue in the sequence of opening/closing codes. The groups can be moved. Change of parent as well as switch of closing and opening sequence are reported.

Try to guess opening/closing types for placeholder codes — Set this option to let the program try to detect whether or not placeholder codes are really opening or closing codes for XML formats. If an opening/closing type is detected, it is included in the verification of the opening/closing sequence.

List of the inline code types to ignore — Enter the list of all the types of inline codes the verification should ignore. For example <mrk> elements in XLIFF, or <df> elements in TTX. The types must be separated by semi-colons.

Codes allowed to be missing from the target — List of the codes that are allowed to be missing in the translation. The strings listed here are codes that are in the source segment and not in its translation, and are allowed to be missing. The list applies to all entries of the input documents. The strings are case-sensitive.

Codes allowed to be extra in the target — List of the codes that are allowed to be extra in the translation. The strings listed here are codes that are in the translation segment but not in its source, and are allowed to be extra. The list applies to all entries of the input documents. The strings are case-sensitive.

For both lists: Use Add to add a new string, Remove to remove the selected string from the list, and Remove All to clear the list.

Patterns Tab

Verify that the following source patterns are translated as expected — Set this option to verify that each source pattern defined in the list has its corresponding expected part in the target content.

  • The first column shows three options associated with this item:
    • If the item should be used (un-check the item to disable it)
    • if the item goes from source to target ("Src" indicator). That is if the source pattern is looked at first, and if found, the corresponding pattern is searched in the target. Otherwise ("Trg" indicator) the target pattern is looked at first, and then searched in the source. This allows for example to detect extra patterns in the target.
    • The severity of the warning.
  • The second column is the regular expression of the pattern to look for in the source.
  • The third column is the target pattern corresponding to the part found in the source. If the part should be the same as in the source, just use the "<same>" keyword.
  • The fourth column is a short description of the rule.

Add — Click this button to add a new pattern to the list.

Edit — Click this button to edit the pattern currently selected. You can also double-click the pattern in the table.

Remove — Click this button to remove the pattern currently selected from the table.

Move Up — Click this button to move the pattern currently selected upward in the table.

Move Down — Click this button to move the pattern currently selected downward in the table.

Import — Click this button to import an existing file in the table.

Export — Click this button to export the patterns in the table to a tab-delimited file.

Characters Tab

Warn if some possibly corrupted characters are found in the target entry —Set this option to check for special patterns that often indicate a file with corrupted characters. For example a UTF-8 file opens as ISO-8859-1, etc. This feature does not found all possible cases of corrupted characters, only some of the frequent ones.

Warn if a character is not included in the following character set encoding — Set this option to check the characters of the text against a given character set encoding. Enter the name of a valid character set encoding, such as ISO-8859-1. You can also leave this field empty to use only the given list of characters provided in the field below this one.

Allow the characters matching the following regular expression — Optionally enter a regular expression that matches a list of allowed characters. The characters specified here will be allowed even if they are not part of the character set encoding specified above. Leave this field empty to not use any regular expression.

You can enter: only a character set encoding, or only a regular expression, or both.

LanguageTool Tab

Perform the verifications provided by the LanguageTool server — Set this option to run the verifications provided by a LanguageTool server. To use this option you must have access to LanguageTool Checker run as a server. Most of the time this is simply a local server. You can start the application with Java Web Start: Start LanguageTool Checker from the Web. You can also do this by clicking on the Start LanguageTool from the Web button.

Note that using LanguageTool may increase significantly the processing time. In addition, using the auto-translate option (see below) does increase the processing time further.

Use bilingual mode — Set this option to use the "bitext" mode, where the source text of the segment is also sent to LanguageTool. This may reduce the number of false warnings. This option does not work with older version of LanguageTool.

Auto-translate the messages from the LanguageTool checker — Set this option to have the messages coming from the LanguageTool checker translated into a given language. Most of the time, the error messages of LanguageTool are provided in the same language as the text verified (e.g. verifying a Polish text will give you back error messages in Polish). Use this option to have the messages automatically translated using Google MT and displayed along with the original messages.

From — Enter the language of the original messages (e.g. "po" for Polish).

Into — Enter the code of the language into which you want to translate the messages (e.g. "en" for English)

Start LanguageTool from the Web — Click this button to start LanguageTool checker directly from the Web. This command uses the Java Web Start technology to download and execute the latest version of LanguageTool from its Web site.

You will be prompted by a Security Warning dialog asking you to confirm you want to launch the application. Click Run or Yes if you want to continue. Once the application is running: go to File menu and select the command Options. Select the target language. Make sure the option Run as server on port is set, and that the port specified matches the port you want to use. Minimize the LanguageTool window and go back to your application. You are now able to use LanguageTool.

Terms Tab

Terminology

Warning: The Terms feature is under development

Verify terminology — Set this option to verify the translation terminology. The given glossary file is read and for each segment, the tool searches for existing terms in the source, then it search in the target text for the corresponding translation. If no corresponding translation is found a warning is issued.

Full path of the glossary file to use — Enter the full path of the glossary file to use. Several formats are supported:

  • TBX (TermBase eXange format).
  • CSV (Comma-Separated Values format): the file must be in UTF-8. The first column is the source terms, the second is the translation. Other columns are ignored. Values with a comma must be enclosed in double-quotes. Literal double-quotes in quoted values must be escaped as two double-quotes.
  • Tab-delimited: The file must be in UTF-8. The first column is the source terms, the second is the translation. Other columns are ignored.

Files with a .tbx extension are imported as TBX files. Files with a .csv extension are imported as CSV files. Files with a .txt extension (or any other extension) are imported as Tab-delimited files.

Verify using string matching — Set this option to use the terms as full strings. When this option is set, the tool looks for complete match. This can be used to verify UI strings in a document for example.

Strings must be between inline codes to match — Set this option to have the strings to match only of they are surrounded by inline codes.

Blacklist

Check for blacklisted terms — Set this option to detect usage of blacklisted terms. The specified blacklist file is read and, for each segment, the target text is searched for any blacklisted term . If a blacklisted term is found, and warning is issued.

Full path of blacklist file to use: — Enter the full path of the blacklist file to use. The file must be in UTF-8.

  • The first column is the blacklisted term,
  • the second column is an optional replacement suggestion.
  • From M35 onward you can also have a third column with a comment,
  • and a fourth column with the integer indicating the severity (0, 1 or 2, with 0 (low) as the default).

Other columns are ignored.

Other Settings Tab

Scope

Note that entries flagged as non-translatable are never processed, regardless of the choice for the scope. When the scope is not set to all entries, it is determined by the value of the Approved property The setting of this property is specific to each file format.

Process all entries — Select this option to process all entries.

Process only approved entries — Select this option to process only the entries that have the property Approved set to "yes".

Process only non-approved entries — Select this option to process only the entries that do not have a property Approved, or that have one not set to "yes".

Report output

Path of the report file — Enter the full path of the HTML report to generate. You can use the variable ${rootDir} in the path.

Format of the report — Select the type of report you want:

  • HTML file: An HTML file.
  • Tab-delimited file: A tab-delimited file where the first column is the location of the issue, the second column is the message of the issue, the third column is the source text, the fourth column is the target text. The file is in UTF-8.
  • XML file: An XML document.

Open the report after completion — Set this option to automatically open the report file after the process is complete.

Show full paths on the report — Set this option to show the full path of the documents with issues in the report. If this option is not set the report will show the relative paths, removing the longest common root directories.

Overall Configuration

Import — Click this button to import an existing configuration file.

Export — Click this button to export the current configuration to a file.

Reset to Defaults — Click this button to reset all the configuration settings to their original values. You will be prompted to confirm the command.

Step-Specific Options

(Options available only when editing the parameters for a step).

Save the session using the following path — Set this option to save the session when the step is run. Enter the full path of the session file to save. You can use the variable ${rootDir} in the path.

Limitations

None known.