Scoping Report Step
Overview
This step creates a template-based report on various counts (word count, character count, etc.) and optionally leveraged data.
Takes: Filter events. Sends: Filter events.
In order to have leveraging statistics with this step, your pipeline needs to include, prior this step, one or more steps that leverage translations, such as the Leveraging Step. Some filters, such as the XLIFF Filter may also generate resources with leveraged data. For just generating word- or character-count annotations, without report, use the Word Count Step or Character Count Step.
For a list of the types of matches possible in the counts, see the "Match Types" page.
Parameters
Project name — Enter the name that is placed in the title of the report.
Custom template — Enter URI or the full path of the custom template to be used to generate the report. If the custom template filed is left empty, or if the specified URI is not found, the default template is used.
Output path — Enter the full path of the report file to generate. You can use the ${rootDir}
variable, as well as any of the source or target locale variables (${srcLoc}
, ${trgloc}
, etc).
Templates
Templates are used by the Scoping Report Step to generate reports looking exactly the way you would like them to. Currently plain text and HTML formats are supported in templates. The Scoping Report Step includes a default HTML report, that displays general information about the project and its items. You can specify your own custom report with the step parameter Custom template.
Templates contain text and report fields. Report fields are enclosed in brackets. Table rows are enclosed in brackets around a row of column fields. A template can look like this:
Project Name: [PROJECT_NAME] Creation Date: [PROJECT_DATE] Target Locale: [PROJECT_TARGET_LOCALE] File,Exact Previous Version Matches,Exact Local Context Matches,100% Matches,Fuzzy Matches,Repetitions,Total, [[ITEM_NAME],[ITEM_EXACT_PREVIOUS_VERSION],[ITEM_EXACT_LOCAL_CONTEXT],[ITEM_EXACT],[ITEM_FUZZY],[ITEM_GMX_REPETITION_MATCHED_WORD_COUNT],[ITEM_TOTAL_WORD_COUNT],] Total,[PROJECT_EXACT_PREVIOUS_VERSION],[PROJECT_EXACT_LOCAL_CONTEXT],[PROJECT_EXACT],[PROJECT_FUZZY],[PROJECT_GMX_REPETITION_MATCHED_WORD_COUNT],[PROJECT_TOTAL_WORD_COUNT]
This template will produce something similar to this:
Project Name: Community website Creation Date: 17.03.2011 23:21:23 CET Target Locale: fr-ca File,Exact Previous Version Matches,Exact Local Context Matches,100% Matches,Fuzzy Matches,Repetitions,Total, D:\SVN\OKAPI\steps\scopingreport\target\test-classes\net\sf\okapi\steps\scopingreport\aa324.html,10,23,12,57,132,23, D:\SVN\OKAPI\steps\scopingreport\target\test-classes\net\sf\okapi\steps\scopingreport\form.html,31,22,13,17,19,17, D:\SVN\OKAPI\steps\scopingreport\target\test-classes\net\sf\okapi\steps\scopingreport\W3CHTMHLTest1.html,10,23,12,57,12,54, Total,210,323,512,357,312,154
Report fields
Templates should contain placeholders for calculable report data. Those placeholders are called report fields and are filled up automatically by the Scoping Report Step.
Please note, that calculation of most of the fields' values is performed by separate steps, e.g. Word Count Step, Character Count Step, or Leveraging Step. The Scoping Report Step generally speaking is a presentation layer, displaying information provided by other steps. So if you forget to include a required step in your pipeline, you will see zeros in the generated report.
Report fields can contain word or character counts for the entire project or an individual item in the project. Report fields for those count types are respectively prefixed with REPORT_
and ITEM_
respectively.
The tables below show how report fields are related to count categories, and list example steps that provide information for related word or character counts.
General project fields
Report field | Example of provider | Description |
PROJECT_NAME | Name of the project as set in the step parameters. | |
PROJECT_DATE | Date and time when the report was generated. | |
PROJECT_SOURCE_LOCALE | Source locale, obtained automatically. | |
PROJECT_TARGET_LOCALE | Target locale, obtained automatically. | |
PROJECT_TOTAL_WORD_COUNT | Word Count Step | Total number of words, both translatable and non-translatable, in all items of the project. |
PROJECT_TOTAL_CHARACTER_COUNT | Character Count Step | Total number of characters, excluding whitespace and punctuation, both translatable, and non-translatable in all items of the project. |
PROJECT_WHITESPACE_CHARACTER_COUNT | Character Count Step | Total number of whitespace characters, both translatable and non-translatable, in all items of the project. |
PROJECT_PUNCTUATION_CHARACTER_COUNT | Character Count Step | Total number of punctuation characters, both translatable and non-translatable, in all items of the project. |
PROJECT_OVERALL_CHARACTER_COUNT | Character Count Step | Total number of characters, including whitespace and punctuation, both translatable and non-translatable, in all items of the project. |
General item fields
Report field | Example of provider | Description |
ITEM_NAME | Name of the item (full file name). | |
ITEM_SOURCE_LOCALE | Source locale, obtained automatically. | |
ITEM_TARGET_LOCALE | Target locale, obtained automatically. | |
ITEM_TOTAL_WORD_COUNT | Word Count Step | Total number of words, both translatable and non-translatable, in the current item. |
ITEM_TOTAL_CHARACTER_COUNT | Character Count Step | Total number of characters, excluding whitespace and punctuation, both translatable and non-translatable, in the current item. |
ITEM_WHITESPACE_CHARACTER_COUNT | Character Count Step | Total number of whitespace characters, both translatable and non-translatable, in the current item. |
ITEM_PUNCTUATION_CHARACTER_COUNT | Character Count Step | Total number of punctuation characters, both translatable and non-translatable, in the current item. |
ITEM_OVERALL_CHARACTER_COUNT | Character Count Step | Total number of characters, including whitespace and punctuation, both translatable and non-translatable, in the current item. |
Project fields for Okapi count categories
Report field | Example of provider | Okapi word count category | Description |
PROJECT_EXACT_UNIQUE_ID | Leveraging Step | EXACT_UNIQUE_ID | Matches EXACT and matches a unique id. |
PROJECT_EXACT_PREVIOUS_VERSION | Leveraging Step | EXACT_PREVIOUS_VERSION | Matches EXACT and comes from the preceding version of the same document (i.e., if v4 is leveraged this match must come from v3, not v2 or v1!!). |
PROJECT_EXACT_LOCAL_CONTEXT | Leveraging Step | EXACT_LOCAL_CONTEXT | Matches EXACT and a small number of segments before and/or after. |
PROJECT_EXACT_DOCUMENT_CONTEXT | Repetition Analysis Step | EXACT_DOCUMENT_CONTEXT | Matches EXACT and comes from the same document. |
PROJECT_EXACT_STRUCTURAL | Leveraging Step | EXACT_STRUCTURAL | Matches EXACT and the structural type of the segment (title, paragraph, list element etc.) |
PROJECT_EXACT | Leveraging Step | EXACT | Matches text and codes exactly. |
PROJECT_EXACT_TEXT_ONLY_UNIQUE_ID | Leveraging Step | EXACT_TEXT_ONLY_UNIQUE_ID | Matches EXACT_TEXT_ONLY and matches a unique id. |
PROJECT_EXACT_TEXT_ONLY_PREVIOUS_VERSION | Leveraging Step | EXACT_TEXT_ONLY_PREVIOUS_VERSION | Matches EXACT_TEXT_ONLY and comes from a previous version of the same document. |
PROJECT_EXACT_TEXT_ONLY | Leveraging Step | EXACT_TEXT_ONLY | Matches text exactly, but there is a difference in one or more codes. |
PROJECT_EXACT_REPAIRED | Leveraging Step | EXACT_REPAIRED | Matches text and codes exactly, but only after the result of some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc.) |
PROJECT_FUZZY_UNIQUE_ID | Leveraging Step | FUZZY_UNIQUE_ID | Matches FUZZY and matches a unique id. |
PROJECT_FUZZY_PREVIOUS_VERSION | Leveraging Step | FUZZY_PREVIOUS_VERSION | Matches FUZZY and comes from a previous version of the same document. |
PROJECT_FUZZY | Leveraging Step | FUZZY | Matches both text and/or codes partially. |
PROJECT_FUZZY_REPAIRED | Leveraging Step | FUZZY_REPAIRED | Matches both text and/or codes partially and some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc..) was applied to the target. |
PROJECT_PHRASE_ASSEMBLED | - | PHRASE_ASSEMBLED | Matches assembled from phrases in the TM or other resources (different algorithms could be used). |
PROJECT_MT | Leveraging Step | MT | Indicates a translation coming from an MT engine. |
PROJECT_CONCORDANCE | - | CONCORDANCE | TM concordance or phrase match (usually a word or term only) |
PROJECT_NOCATEGORY | n/a | Does not match any of the Okapi word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count. | |
PROJECT_NONTRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match either of non-translatable Okapi word count categories. |
PROJECT_TRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match neither of non-translatable Okapi word count categories, and thus need translation. |
Character count categories are also available; replace WORD
with CHARACTER
or add the suffix _CHARACTER
to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.
Item fields for Okapi count categories
Report field | Example of provider | Okapi word count category | Description |
ITEM_EXACT_UNIQUE_ID | Leveraging Step | EXACT_UNIQUE_ID | Matches EXACT and matches a unique id. |
ITEM_EXACT_PREVIOUS_VERSION | Leveraging Step | EXACT_PREVIOUS_VERSION | Matches EXACT and comes from the preceding version of the same document (i.e., if v4 is leveraged this match must come from v3, not v2 or v1!!). |
ITEM_EXACT_LOCAL_CONTEXT | Leveraging Step | EXACT_LOCAL_CONTEXT | Matches EXACT and a small number of segments before and/or after. |
ITEM_EXACT_DOCUMENT_CONTEXT | Repetition Analysis Step | EXACT_DOCUMENT_CONTEXT | Matches EXACT and comes from the same document. |
ITEM_EXACT_STRUCTURAL | Leveraging Step | EXACT_STRUCTURAL | Matches EXACT and the structural type of the segment (title, paragraph, list element etc.) |
ITEM_EXACT | Leveraging Step | EXACT | Matches text and codes exactly. |
ITEM_EXACT_TEXT_ONLY_UNIQUE_ID | Leveraging Step | EXACT_TEXT_ONLY_UNIQUE_ID | Matches EXACT_TEXT_ONLY and matches a unique id. |
ITEM_EXACT_TEXT_ONLY_PREVIOUS_VERSION | Leveraging Step | EXACT_TEXT_ONLY_PREVIOUS_VERSION | Matches EXACT_TEXT_ONLY and comes from a previous version of the same document. |
ITEM_EXACT_TEXT_ONLY | Leveraging Step | EXACT_TEXT_ONLY | Matches text exactly, but there is a difference in one or more codes. |
ITEM_EXACT_REPAIRED | Leveraging Step | EXACT_REPAIRED | Matches text and codes exactly, but only after the result of some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc.) |
ITEM_FUZZY_UNIQUE_ID | Leveraging Step | FUZZY_UNIQUE_ID | Matches FUZZY and matches a unique id. |
ITEM_FUZZY_PREVIOUS_VERSION | Leveraging Step | FUZZY_PREVIOUS_VERSION | Matches FUZZY and comes from a previous version of the same document. |
ITEM_FUZZY | Leveraging Step | FUZZY | Matches both text and/or codes partially. |
ITEM_FUZZY_REPAIRED | Leveraging Step | FUZZY_REPAIRED | Matches both text and/or codes partially and some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc..) was applied to the target. |
ITEM_PHRASE_ASSEMBLED | - | PHRASE_ASSEMBLED | Matches assembled from phrases in the TM or other resources (different algorithms could be used). |
ITEM_MT | Leveraging Step | MT | Indicates a translation coming from an MT engine. |
ITEM_CONCORDANCE | - | CONCORDANCE | TM concordance or phrase match (usually a word or term only) |
ITEM_NOCATEGORY | n/a | Does not match any of the Okapi word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count. | |
ITEM_NONTRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match either of non-translatable Okapi word count categories. |
ITEM_TRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match neither of non-translatable Okapi word count categories, and thus need translation. |
Character count categories are also available; replace WORD
with CHARACTER
or add the suffix _CHARACTER
to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.
Project fields for GMX count categories
Report field | Example of provider | GMX word count category | Description |
PROJECT_GMX_PROTECTED_WORD_COUNT | ProtectedWordCount | An accumulation of the word count for text that has been marked as 'protected', or otherwise not translatable (XLIFF text enclosed in <mrk mtype="protected"> elements).
| |
PROJECT_GMX_EXACT_MATCHED_WORD_COUNT | Leveraging Step | ExactMatchedWordCount | An accumulation of the word count for text units that have been matched unambiguously with a prior translation and thus require no translator input. |
PROJECT_GMX_LEVERAGED_MATCHED_WORD_COUNT | Leveraging Step | LeveragedMatchedWordCount | An accumulation of the word count for text units that have been matched against a leveraged translation memory database. |
PROJECT_GMX_REPETITION_MATCHED_WORD_COUNT | Repetition Analysis Step | RepetitionMatchedWordCount | An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching. |
PROJECT_GMX_FUZZY_MATCHED_WORD_COUNT | Leveraging Step | FuzzyMatchedWordCount | An accumulation of the word count for text units that have been fuzzy matched against a leveraged translation memory database. |
PROJECT_GMX_ALPHANUMERIC_ONLY_TEXT_UNIT_WORD_COUNT | AlphanumericOnlyTextUnitWordCount | An accumulation of the word count for text units that have been identified as containing only alphanumeric words. | |
PROJECT_GMX_NUMERIC_ONLY_TEXT_UNIT_WORD_COUNT | NumericOnlyTextUnitWordCount | An accumulation of the word count for text units that have been identified as containing only numeric words. | |
PROJECT_GMX_MEASUREMENT_ONLY_TEXT_UNIT_WORD_COUNT | MeasurementOnlyTextUnitWordCount | An accumulation of the word count from measurement-only text units. | |
PROJECT_GMX_NOCATEGORY | n/a | Does not match any of the GMX word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count. | |
PROJECT_GMX_NONTRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match either of non-translatable GMX word count categories. |
PROJECT_GMX_TRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match neither of non-translatable GMX word count categories, and thus need translation. |
Character count categories are also available; replace WORD
with CHARACTER
or add the suffix _CHARACTER
to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.
Item fields for GMX count categories
Report field | Example of provider | GMX word count category | Description |
ITEM_GMX_PROTECTED_WORD_COUNT | ProtectedWordCount | An accumulation of the word count for text that has been marked as 'protected', or otherwise not translatable (XLIFF text enclosed in <mrk mtype="protected"> elements). | |
ITEM_GMX_EXACT_MATCHED_WORD_COUNT | Leveraging Step | ExactMatchedWordCount | An accumulation of the word count for text units that have been matched unambiguously with a prior translation and thus require no translator input. |
ITEM_GMX_LEVERAGED_MATCHED_WORD_COUNT | Leveraging Step | LeveragedMatchedWordCount | An accumulation of the word count for text units that have been matched against a leveraged translation memory database. |
ITEM_GMX_REPETITION_MATCHED_WORD_COUNT | Repetition Analysis Step | RepetitionMatchedWordCount | An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching. |
ITEM_GMX_FUZZY_MATCHED_WORD_COUNT | Leveraging Step | FuzzyMatchedWordCount | An accumulation of the word count for text units that have been fuzzy matched against a leveraged translation memory database. |
ITEM_GMX_ALPHANUMERIC_ONLY_TEXT_UNIT_WORD_COUNT | AlphanumericOnlyTextUnitWordCount | An accumulation of the word count for text units that have been identified as containing only alphanumeric words. | |
ITEM_GMX_NUMERIC_ONLY_TEXT_UNIT_WORD_COUNT | NumericOnlyTextUnitWordCount | An accumulation of the word count for text units that have been identified as containing only numeric words. | |
ITEM_GMX_MEASUREMENT_ONLY_TEXT_UNIT_WORD_COUNT | MeasurementOnlyTextUnitWordCount | An accumulation of the word count from measurement-only text units. | |
ITEM_GMX_NOCATEGORY | n/a | Does not match any of the GMX word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count. | |
ITEM_GMX_NONTRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match either of non-translatable GMX word count categories. |
ITEM_GMX_TRANSLATABLE_WORD_COUNT | Word Count Step | n/a | Number of words that match neither of non-translatable GMX word count categories, and thus need translation. |
Character count categories are also available; replace WORD
with CHARACTER
or add the suffix _CHARACTER
to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.
Limitations
None known.