Sunday, October 5, 2014

Apache UIMA in a Nutshell

I have to look at Apache UIMA project at my workplace just to get an idea what it is and what does it really do. But I had to go through the whole documentation of UIMA to understand what it is exactly.
So just thought of summarising  what I learnt here.

In general each UIMA component defined in this framework consists of two parts; a descriptor XML and a Java class. Descriptor XML defines all the configurations, configuration parameters, inputs, outputs, etc. Java class contains the implementation of the component and it is referred by the descriptor XML.

UIMA Framework defines a type system where each component deal with. They define a data structure called UIMA Common Analysis Structure (CAS). It is the central data structure through which all UIMA components communicate. There is a native Java interface to the CAS called the JCas. The JCas represents each feature structure as a Java object; for an example feature structure output that identifies persons in a paragraph would be an instance of a Java class Person with getFullName() and setFullName() methods.
So when implementing UIMA components we have to define the CAS Feature Structure that it collaborate with. This is also done via a XML file called Type System Descriptor. For such CAS types java sources can be generated via a tool called JCasGen. Then these classes can be used with UIMA component implementations.

I have described each UIMA component that this framework interface below.

Analysis Engine (AE)

AE is a component that analyzes artifacts (e.g. documents) and extract information from them. AEs are constructed from building blocks called Annotators. An annotator is a component that contains analysis logic. Annotators analyze an artifact (for example, a text document) and create additional data (metadata) about that artifact. An AE may contain a single annotator (this is referred to as a Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an Aggregate AE). An aggregate AE can be defined such that it act as a pipeline; i.e annotators can be configured to take input from the previous and to provide output to the next. Aggregate AEs can be managed in an advance manner with Flow Controllers, that I have defined later here.

Collection Reader

A Collection Reader is responsible for obtaining documents from the collection and returning each document as a CAS. We can write our own Collection Readers by implementing UIMA Framework interface for Collection Reader and configuring it with a descriptor XML. 
By implementing this class we make sure that we read some artifact and build a CAS for the CPE to perform. The Collection Reader provides the ability to iterate over a collection of artifacts and to build CAS of each to be processed at a time.
It can also return progress information such that how much is read so far and how much remains to read.

CASConsumer

A CAS Consumer receives each CAS after it has been analyzed by the Analysis Engine. CAS consumers typically extract data from the CAS and persist selected information to aggregate data structures such as search engine indexes or databases.
However in UIMA version 2 AE itself can be used to persist analyzed information. Both CasConsumer and AE interface defines batch and collection level processing methods too.

Flow Controller

A Flow Controller is a component that plugs into an Aggregate Analysis Engine. When a CAS is input to the Aggregate, the Flow Controller determines the order in which the components of that aggregate are invoked on that CAS.
Flow Controllers may decide the flow dynamically, based on the contents of the CAS. For an example, we can develop a Flow Controller that first sends each CAS to a Language Identification Annotator and then, based on the output of the Language Identification Annotator, routes that CAS to an Annotator that is specialized for that particular language.

Collection Processing Engine (CPE)

CPE construct a data flow among above UIMA components. The Collection Reader, AE and CAS Consumer are components of CPE that makes the flow such that it reads a document and convert it to CAS feature structure, process and add the results to the feature structure, and finally extract results from the CAS and persist as required.
There is a Collection Processing Manager(CPM), which orchestrates the data flow within a CPE, monitors status, optionally manages the life-cycle of internal components and collects statistics. CPE engine is configures via a XML.
There are three deployment modes for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:
  • Integrated (runs in the same Java instance as the CPM)
  • Managed (runs in a separate process on the same machine), and
  • Non-managed (runs in a separate process, perhaps on a different machine)
For both managed and non-managed CAS Processors, the CAS must be transmitted between separate processes and possibly between separate computers. This is accomplished using Vinci, a communication protocol used by the CPM and which is provided as a part of Apache UIMA. 
The UIMA SDK also supports using unmanaged remote services via SOAP communications protocol.

UIMAFramework

When developing applications we can instantiate any UIMA Component described above from the descriptor XML via UIMA framework.
We can also configure Collection Processing Engine descriptors programmatically;can initiate, and manage CAS instances; create shared CASes across multiple Analysis Engines; save CASes to file system, etc.
The framework also supports CAS pools and enables multi threaded application development.

Hope this summary would be useful... :)

No comments:

Post a Comment