Rainbow

Pipelines

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=Rainbow

Overview

A pipeline is a customizable set of steps that takes the input documents provided in the input lists of Rainbow and execute each step in sequential order.

A step is a small component that takes a specific input (e.g a raw document), perform a specific task (e.g. parse the document extract text units out of it) on it and send a specific output (e.g. filter events) to the next step in the pipeline.

The Okapi distribution comes with a set of pre-existing steps you can use right out of the box, and you may have additional plugin steps as well.

The steps communicate through their input and output. There are two main types of input and output:

  1. Raw document, which is simply a file.
  2. Filter events, which are the content of a raw document broken down into standardized parts that include text units, start and end group, etc.

Any step that takes a raw document as input can be placed at the front of the pipeline or after any step that outputs a raw document. Any setp that takes filter events as input needs to be placed after a step that outputs filter events.

The step Raw Document to Filter Events is the normal way to get a filter events output from a raw document input. And the step Filter Events to Raw Document provides a way to create a raw document from an input of filter events, in the format of the original raw document. So, to simply extract and merge an input file, without doing anything special to it, the pipeline would be made of:

  1. raw document ==> Raw Document to Filter Events ==> filter events
  2. filter events ==> Filter Events to Raw Document ==> raw document

To perform any kind of action that modify the extracted content, you simply add the necessary step or steps between those two. For example, imgine that you would want to make sure the Japanese text of a file uses full-width characters and never half-width ones, you would use the Full-Width Conversion step that allows you to convert the extacted text from half-width characters to full-width characters. You would have the following pipeline:

  1. raw document ==> Raw Document to Filter Events ==> filter events
  2. filter events ==> Full-Width Conversion ==> filter events
  3. filter events ==> Filter Events to Raw Document ==> raw document

Any number of steps could be added between the extraction and the merging, allowing you to perform several tasks in a single process.

Basic Raw Document-Base Pipeline

Not all steps require filter events. Some steps take a raw document and output a raw document, for example when they perform tasks on the fille as a whole rather than its translatable text.

For example, the step Encoding Conversion reads a text file (without any filter) in a given encoding, and re-writes it in a different one. This step can be used in a one-step pipeline:

  1. raw document ==> Encoding Conversion ==> raw document

Note that a few steps, such as Search And Replace, can take either a raw document or filter events as input and output either as well, depending on the step's options.

Complex Pipelines

The real power of pipelines really comes to light when you chain multiple steps without having to re-extract the text, or perform different raw-document taks one after the other automatically. You can also change the type of input/output several times in a single pipeline. For example, you could convert the content of an input document to full-width Japanese characters, then make sure its encoding  is Shift-JIS, and then make sure all line-breaks are Unix line-breaks.

  1. raw document ==> Raw Document to Filter Events ==> filter events
  2. filter events ==> Full-Width Conversion ==> filter events
  3. filter events ==> Filter Events to Raw Document ==> raw document
  4. raw document ==> Encoding Conversion ==> raw document
  5. raw document ==> Line-Break Conversion ==> raw document

You can even add more steps to this pipeline that would go back to act on filter events. For example, to do a last search and replace on a specific pattern only inside the extracted text:

  1. raw document ==> Raw Document to Filter Events ==> filter events
  2. filter events ==> Full-Width Conversion ==> filter events
  3. filter events ==> Filter Events to Raw Document ==> raw document
  4. raw document ==> Encoding Conversion ==> raw document
  5. raw document ==> Line-Break Conversion ==> raw document
  6. raw document ==> Raw Document to Filter Events ==> filter events
  7. filter events ==> Search And Replace ==> filter events
  8. filter events ==> Filter Events to Raw Document ==> raw document

Note that some steps generate outputs that are not necessarily passed on to the pipeline. For example, the Used Character Listing step generates one output file that contains the list of all characters used in all the text unit content of all the input files. It takes filter events as input and passes on (un-modified) filter events as output.

Frequently Asked Questions

Q: How do I specify what filter configuration to use with the Raw Document to Filter Events step?
A: In the Rainbow Input List tabs, you associate one filter configuration per input file.

Q: How do I specify the name of the output file?
A: In the Rainbow's Other Settings tab, you can define the names and locations of all the input files based on their input file paths.

Q: How many steps can I chain together?
A: There is no specific limit to the number of steps in a pipeline, only the available memory will limit what you can do (and pipeline definitions do not take up a lof of memory).

Q: What kind of file formats are supported?
A: It depends. Filter events are generated only for files associated with filters, so to use a step that takes filter events, you must have a filter that support the given input file format. Some other steps act directly on the raw documents, often the only requirements for those is that they need to be text-based files.