Okapi Framework - Developer's GuidePipelines |
|
IMPORTANT: This page may not reflect the latest changes in the pipeline
mechanism from the previous release.
The pipeline mechanism is under development and its API may change from release
to release.
Pipelines are a powerful mechanism to apply a sequence of actions to an input document (or a set of them). They allow you to construct processes customized to specific projects very easily, re-using the same components. For example, many tasks can be broken down into these main parts:
Extract the text >> Apply some changes to the text >> Merge back the modified text into its original format.
With the framework this type of sequence is implemented using the following interfaces:
IFilter
>>
IPipelineStep
>>
IFilterWriter
The pipeline is the glue that puts these parts together and allows you to include as many as you need.
A pipeline is represented by a
IPipeline
interface. The framework offers several implementations for it. The simplest is
the
Pipeline
class.
The easiest way to set up and execute a pipeline is to use a pipeline driver.
It is represented by the
IPipelineDriver
interface and provides all
you need to process one or more input documents through a pipeline.
The first step is to create the driver:
// Create a pipeline driver IPipelineDriver driver = new PipelineDriver();
The next step is to add the different steps you want in the pipeline. In this
example we are going to simply extract the translatable text from the original
format and re-write it back. These two operations are very common and they have
corresponding steps already coded for you: the
RawDocumentToFilterEventsStep
class and the
FilterEventsWriterStep
class.
// Add the filter step driver.addStep(new RawDocumentToFilterEventsStep()); // Add the filter writer step driver.addStep(new FilterEventsWriterStep());
Because our pipeline uses a filter we need to provide a way to know which filter to use with which input document. This is done in two settings:
IFilterConfigurationMapper
interface needs to be set in the
pipeline context, so any step that needs to create a filter for an input
document can use this mapper to lookup the filter configuration ID of the
document and retrieve the filter and the filter's parameters to use.Each calling application can provide its own implementation of
IFilterConfigurationMapper
or, like in this example, use directly the
one provided with the library: the
FilterConfigurationMapper
class.
// Create the filter configuration mapper IFilterConfigurationMapper fcMapper = new FilterConfigurationMapper();
In a real application, you would use some kind of discovery mechanism to get the different filters available to you and add their default configurations to the mapper. But, if needed, you can also easily hard-code this:
// Fill the mapper with the default configurations of a few filters fcMapper.addConfigurations("net.sf.okapi.filters.html.HtmlFilter"); fcMapper.addConfigurations("net.sf.okapi.filters.openoffice.OpenOfficeFilter"); fcMapper.addConfigurations("net.sf.okapi.filters.properties.PropertiesFilter"); fcMapper.addConfigurations("net.sf.okapi.filters.xml.XMLFilter");
The last task is the associate the mapper to the pipeline:
// Set the filter configuration mapper driver.setFilterConfigurationMapper(fcMapper);
Now the driver is all set up to process your documents. Executing the pipeline for a given input document is done by first providing the document and its parameters, and then by invoking the driver.
Usually we have more than one document to process. A set of input documents is called a batch, and both the driver and the pipeline are designed to work with batches. A batch item corresponds to the input for a single execution of the pipeline. It is usually made of a single input document. But some steps may require several input documents per batch item. For example a step that would perform an alignment between a source document and its translation may request two input documents for each batch item.
To allow the feeding of the batch item to the pipeline is done with the
IBatchItemContext
interface. It provides the method to access the
parameters for one or more input document per batch item. One important
advantage of using an interface for this is that your application can store its
input data any way it wants, and simply can expose the way the pipeline expects
by simply implementing that interface.
The driver offers several variations of the
IPipelineDriver.addBatchItem()
method to facilitate the creation of
the batch. In our case, our pipeline needs one input document per batch item,
and its corresponding output parameters. We can use the following code to add
one batch item to the driver:
// Add one batch item to the batch driver.addBatchItem( new BatchItemContext( (new File("myFile.html")).toURI(), // URI of the input document "UTF-8", // Default encoding "okf_html", // Filter configuration of the document (new File("myFile.out.html")).toURI(), // Output "UTF-8", // Encoding for the output LocaleId.fromString("en"), // Source locale LocaleId.fromString("fr") // Target locale ) );
We are now ready to execute the pipeline for the given input document. This is done in one call:
// Execute the pipeline for all batch items driver.processBatch();
When this process is done you should have a new document
myFile.out.html
that should be a copy of myFile.html
with
possibly some small modifications, such as the language declarations changed
from en
to fr
.
Note that you can also run the exact same pipeline on input documents that are in different file formats, as long as you provide the proper filter configuration ID with each one.
Now we want to modify the pipeline we created in the previous
section, so it does something more meaningful than rewrite the input documents.
We can do this by adding an extra step between the two we currently have. This
step will receive filter events from the first step, and send them down to the
next step that will write the output file. The only thing we have to do is write
the part that modify the extracted text we get through the TEXT_UNIT
events. Let's create a step that pseudo-translate the extracted text.
This requires to create a new class that implements
the IPipelineStep
interface. The framework makes things easy by
providing the class
BasePipelineStep
that you can
use to derive you own steps.
There are only a few methods we need to overwrite:
The
IPipelineStep.getName()
method should return the name of the step.
This name is localizable and is used by other applications when they need to
associate the step to a visual label. It should be short and descriptive. For
example: "Pseudo-Translation
".
The
IPipelineStep.getDescription()
method should return a brief description of what the step does. This text is
localizable and is used by other applications when they need to associate the
step with a short description. It should be one or two short descriptive
sentences. For example: "Pseudo-translate text units content.
"
Then we need to override any of the event handler methods you need. In our
case we just need to override one:
BasePipelineSetp.handleTextUnit()
.
The code below shows our new step class. It intercepts the
TEXT_UNIT
events and performs a simple pseudo-translation by replacing some ASCII characters by
the same ones with accents, so the text "A goose quill is more dangerous
than a lion's claw"
becomes "A gõõsè qüìll ìs mõrè ðåñgèrõüs thåñ å
lìõñ's çlåw"
.
In order to create the target text in the text unit, the class needs to know what the
target language is. A pipeline step publishes the runtime parameters it needs
using the standard JavaBean pattern, along with a special Java annotation. In
our case, we declare a setTargetLocale()
method. The pipeline
driver will introspect the steps and provide the proper parameters from the
IBatchItemContext
interface.
The other parts of the code deal with changing the text unit content itself. See the section Working with Text Units for more details on how to modify text units.
public class PseudoTranslateStep extends BasePipelineStep { private static final String OLDCHARS = "aeiouycdn"; private static final String NEWCHARS = "\u00e5\u00e8\u00ec\u00f5\u00fc\u00ff\u00e7\u00f0\u00f1"; private LocaleId trgLoc; @StepParameterMapping(parameterType = StepParameterType.TARGET_LOCALE) public void setTargetLocale (LocaleId targetLocale) { trgLoc = targetLocale; } public String getName () { return "Pseudo-Translation"; } public String getDescription () { return "Pseudo-translates text units content."; } protected void handleTextUnit (Event event) { TextUnit tu = (TextUnit)event.getResource(); if ( !tu.isTranslatable() ) return; TextFragment tf = tu.createTarget(trgLoc, false, IResource.COPY_CONTENT); StringBuilder text = new StringBuilder(tf.getCodedText()); int n; for ( int i=0; i<text.length(); i++ ) { if ( TextFragment.isMarker(text.charAt(i)) ) { i++; // Skip the pair } else { if ( (n = OLDCHARS.indexOf(text.charAt(i))) > -1 ) { text.setCharAt(i, NEWCHARS.charAt(n)); } } } tf.setCodedText(text.toString()); } }
Once we have created our new class, we simply need to add it between the input and output steps of our previous code:
// Add the filter step driver.addStep(new RawDocumentToFilterEventsStep()); // Add the pseudo-translation step driver.addStep(new PseudoTranslateStep()); // Add the filter writer step driver.addStep(new FilterEventsWriterStep());
At first it may seems more complicated to create a new class for each new step instead of working directly in a single class, But the benefits are important: Each step defined as a separate class can be easily re-used in different processes.
You should see each step as a component independent of everything else. It
should not be filter-specific and avoid using global parameters. It should, most
of the time, not expect to be before or after another specific step. It should also
be aware of inline codes, as well as the translate and the preserve-whitespaces
information attached to each text unit. The
TextUnit
class may provide plenty of information you can query:
TextUnit.getType()
,
TextUnit.getName()
,
TextUnit.getMimeType()
,
TextUnit.getAnnotation()
,
TextUnit.getProperty()
, etc. Make use of them to drive the different
actions performed on the extracted text.
When a pipeline is executed the following sequence of events are are dispatched:
START_BATCH
- The batch starts. This is the opportunity for
the steps to initialize themselves as needed.START_BATCH_ITEM
- A new batch item starts. This is the
opportunity for the steps to perform any initialization that depend on
each batch item.RAW_DOCUMENT
is the normal way to start the pipeline.
From here the events sent down the pipeline depends on each step.
RawDocumentToFilterEventsStep
,
may send filter events until
the document is completely parsed.RAW_DOCUMENT
event, modify
the document, and send a new RAW_DOCUMENT
event down
the pipeline.
FilterEventsToRawDocument
, may take
filter events and convert
them into a single RAW_DOCUMENT
to the next step.CUSTOM
, CANCEL
and NO_OP
events may be received at any time. All steps must be capable of
handling any event. If a step does not know what to do with a given
event it should simply pass it on without any modification.END_BATCH_ITEM
- The current batch item is done. This
is the time to perform any task that works at the batch-item level. For
example, a word-counting step would now compute the total word count for
all text units found in the input document of this batch item.END_BATCH
- The last batch item of this batch was done, we
are ending the batch. This is the time for the steps to trigger any task
that works at the batch level. For example, a word-counting step would now
compute the total word count for all the input documents of this batch.