Okapi Framework - Developer's Guide

Pipelines

Overview

IMPORTANT: This page may not reflect the latest changes in the pipeline mechanism from the previous release.
The pipeline mechanism is under development and its API may change from release to release.

Pipelines are a powerful mechanism to apply a sequence of actions to an input document (or a set of them). They allow you to construct processes customized to specific projects very easily, re-using the same components. For example, many tasks can be broken down into these main parts:

Extract the text >> Apply some changes to the text >> Merge back the modified text into its original format.

With the framework this type of sequence is implemented using the following interfaces:

IFilter >> IPipelineStep >> IFilterWriter

The pipeline is the glue that puts these parts together and allows you to include as many as you need.

Simple Pipeline

A pipeline is represented by a IPipeline interface. The framework offers several implementations for it. The simplest is the Pipeline class.

The easiest way to set up and execute a pipeline is to use a pipeline driver. It is represented by the IPipelineDriver interface and provides all you need to process one or more input documents through a pipeline.

The first step is to create the driver:

// Create a pipeline driver
IPipelineDriver driver = new PipelineDriver();

The next step is to add the different steps you want in the pipeline. In this example we are going to simply extract the translatable text from the original format and re-write it back. These two operations are very common and they have corresponding steps already coded for you: the RawDocumentToFilterEventsStep class and the FilterEventsWriterStep class.

// Add the filter step
driver.addStep(new RawDocumentToFilterEventsStep());
// Add the filter writer step
driver.addStep(new FilterEventsWriterStep());

Because our pipeline uses a filter we need to provide a way to know which filter to use with which input document. This is done in two settings:

With the input document, we will provide a filter configuration ID.
An object that implements the IFilterConfigurationMapper interface needs to be set in the pipeline context, so any step that needs to create a filter for an input document can use this mapper to lookup the filter configuration ID of the document and retrieve the filter and the filter's parameters to use.

Each calling application can provide its own implementation of IFilterConfigurationMapper or, like in this example, use directly the one provided with the library: the FilterConfigurationMapper class.

// Create the filter configuration mapper
IFilterConfigurationMapper fcMapper = new FilterConfigurationMapper();

In a real application, you would use some kind of discovery mechanism to get the different filters available to you and add their default configurations to the mapper. But, if needed, you can also easily hard-code this:

// Fill the mapper with the default configurations of a few filters
fcMapper.addConfigurations("net.sf.okapi.filters.html.HtmlFilter");
fcMapper.addConfigurations("net.sf.okapi.filters.openoffice.OpenOfficeFilter");
fcMapper.addConfigurations("net.sf.okapi.filters.properties.PropertiesFilter");
fcMapper.addConfigurations("net.sf.okapi.filters.xml.XMLFilter");

The last task is the associate the mapper to the pipeline:

// Set the filter configuration mapper
driver.setFilterConfigurationMapper(fcMapper);

Now the driver is all set up to process your documents. Executing the pipeline for a given input document is done by first providing the document and its parameters, and then by invoking the driver.

Usually we have more than one document to process. A set of input documents is called a batch, and both the driver and the pipeline are designed to work with batches. A batch item corresponds to the input for a single execution of the pipeline. It is usually made of a single input document. But some steps may require several input documents per batch item. For example a step that would perform an alignment between a source document and its translation may request two input documents for each batch item.

To allow the feeding of the batch item to the pipeline is done with the IBatchItemContext interface. It provides the method to access the parameters for one or more input document per batch item. One important advantage of using an interface for this is that your application can store its input data any way it wants, and simply can expose the way the pipeline expects by simply implementing that interface.

The driver offers several variations of the IPipelineDriver.addBatchItem() method to facilitate the creation of the batch. In our case, our pipeline needs one input document per batch item, and its corresponding output parameters. We can use the following code to add one batch item to the driver:

// Add one batch item to the batch
driver.addBatchItem(
   new BatchItemContext(
      (new File("myFile.html")).toURI(), // URI of the input document
      "UTF-8", // Default encoding
      "okf_html", // Filter configuration of the document
      (new File("myFile.out.html")).toURI(), // Output
      "UTF-8", // Encoding for the output
      LocaleId.fromString("en"), // Source locale
      LocaleId.fromString("fr") // Target locale
   )
);

We are now ready to execute the pipeline for the given input document. This is done in one call:

// Execute the pipeline for all batch items
driver.processBatch();

When this process is done you should have a new document myFile.out.html that should be a copy of myFile.html with possibly some small modifications, such as the language declarations changed from en to fr.

Note that you can also run the exact same pipeline on input documents that are in different file formats, as long as you provide the proper filter configuration ID with each one.

Creating Steps

Now we want to modify the pipeline we created in the previous section, so it does something more meaningful than rewrite the input documents. We can do this by adding an extra step between the two we currently have. This step will receive filter events from the first step, and send them down to the next step that will write the output file. The only thing we have to do is write the part that modify the extracted text we get through the TEXT_UNIT events. Let's create a step that pseudo-translate the extracted text.

This requires to create a new class that implements the IPipelineStep interface. The framework makes things easy by providing the class BasePipelineStep that you can use to derive you own steps.

There are only a few methods we need to overwrite:

The IPipelineStep.getName() method should return the name of the step. This name is localizable and is used by other applications when they need to associate the step to a visual label. It should be short and descriptive. For example: "Pseudo-Translation".

The IPipelineStep.getDescription() method should return a brief description of what the step does. This text is localizable and is used by other applications when they need to associate the step with a short description. It should be one or two short descriptive sentences. For example: "Pseudo-translate text units content."

Then we need to override any of the event handler methods you need. In our case we just need to override one: BasePipelineSetp.handleTextUnit().

The code below shows our new step class. It intercepts the TEXT_UNIT events and performs a simple pseudo-translation by replacing some ASCII characters by the same ones with accents, so the text "A goose quill is more dangerous than a lion's claw" becomes "A gõõsè qüìll ìs mõrè ðåñgèrõüs thåñ å lìõñ's çlåw".

In order to create the target text in the text unit, the class needs to know what the target language is. A pipeline step publishes the runtime parameters it needs using the standard JavaBean pattern, along with a special Java annotation. In our case, we declare a setTargetLocale() method. The pipeline driver will introspect the steps and provide the proper parameters from the IBatchItemContext interface.

The other parts of the code deal with changing the text unit content itself. See the section Working with Text Units for more details on how to modify text units.

public class PseudoTranslateStep extends BasePipelineStep {

   private static final String OLDCHARS = "aeiouycdn";
   private static final String NEWCHARS = "\u00e5\u00e8\u00ec\u00f5\u00fc\u00ff\u00e7\u00f0\u00f1";

   private LocaleId trgLoc;

   @StepParameterMapping(parameterType = StepParameterType.TARGET_LOCALE)
   public void setTargetLocale (LocaleId targetLocale) {
      trgLoc = targetLocale;
   }

   public String getName () {
      return "Pseudo-Translation";
   }

   public String getDescription () {
      return "Pseudo-translates text units content.";
   }

   protected void handleTextUnit (Event event) {
      TextUnit tu = (TextUnit)event.getResource();
      if ( !tu.isTranslatable() ) return;

      TextFragment tf = tu.createTarget(trgLoc, false, IResource.COPY_CONTENT);
      StringBuilder text = new StringBuilder(tf.getCodedText());
      int n;
      for ( int i=0; i<text.length(); i++ ) {
         if ( TextFragment.isMarker(text.charAt(i)) ) {
            i++; // Skip the pair
         }
         else {
            if ( (n = OLDCHARS.indexOf(text.charAt(i))) > -1 ) {
               text.setCharAt(i, NEWCHARS.charAt(n));
            }
         }
      }
      tf.setCodedText(text.toString());
   }
}

Once we have created our new class, we simply need to add it between the input and output steps of our previous code:

// Add the filter step
driver.addStep(new RawDocumentToFilterEventsStep());
// Add the pseudo-translation step
driver.addStep(new PseudoTranslateStep());
// Add the filter writer step
driver.addStep(new FilterEventsWriterStep());

At first it may seems more complicated to create a new class for each new step instead of working directly in a single class, But the benefits are important: Each step defined as a separate class can be easily re-used in different processes.

You should see each step as a component independent of everything else. It should not be filter-specific and avoid using global parameters. It should, most of the time, not expect to be before or after another specific step. It should also be aware of inline codes, as well as the translate and the preserve-whitespaces information attached to each text unit. The TextUnit class may provide plenty of information you can query: TextUnit.getType(), TextUnit.getName(), TextUnit.getMimeType(), TextUnit.getAnnotation(), TextUnit.getProperty(), etc. Make use of them to drive the different actions performed on the extracted text.

Pipeline Events

When a pipeline is executed the following sequence of events are are dispatched:

START_BATCH - The batch starts. This is the opportunity for the steps to initialize themselves as needed.
For each batch item:
- START_BATCH_ITEM - A new batch item starts. This is the opportunity for the steps to perform any initialization that depend on each batch item.
- RAW_DOCUMENT is the normal way to start the pipeline. From here the events sent down the pipeline depends on each step.
  - Some steps, like RawDocumentToFilterEventsStep, may send filter events until the document is completely parsed.
  - Some steps may take a RAW_DOCUMENT event, modify the document, and send a new RAW_DOCUMENT event down the pipeline.
  - Some steps, like FilterEventsToRawDocument, may take filter events and convert them into a single RAW_DOCUMENT to the next step.
  - CUSTOM, CANCEL and NO_OP events may be received at any time. All steps must be capable of handling any event. If a step does not know what to do with a given event it should simply pass it on without any modification.
- END_BATCH_ITEM - The current batch item is done. This is the time to perform any task that works at the batch-item level. For example, a word-counting step would now compute the total word count for all text units found in the input document of this batch item.
END_BATCH - The last batch item of this batch was done, we are ending the batch. This is the time for the steps to trigger any task that works at the batch level. For example, a word-counting step would now compute the total word count for all the input documents of this batch.