Character Code Compilers

The Parabix Character Code Compilers

Basic Concepts

Character Class Bit Streams

Given an input stream of character code units, a ''character class'' bit stream is a stream of bits defined in one-to-one correspondence with the input stream such that one bits mark instances of character code units within the class, and zero bits mark instances of character code units outside the class.

For example, consider the ASCII character class expression [abc] standing for the class comprising ASCII bytes having code unit values 97 (ASCII value for a), 98 (ASCII value for b) or 99 (ASCII value for c). The following example shows the [abc] character class bit stream aligned with an example ASCII input stream.

input:  This is an example ASCII byte stream.  ASCII is an abbreviation.  
[abc]:  ........1....1...........1........1.............1..111..........

By convention, zero bits within a character class bit stream are marked with periods, so that the one bits (each marked with the digit 1) stand out.

Character Class Expressions

Following traditional regular expression notation, character classes can be defined by listing individual characters and ranges of characters within square brackets. For example, the set of delimiter characters consisting of colon, period, comma, semicolon may be denoted [:.,;], while the alphanumeric ASCII characters (any uppercase or lowercase ASCII letter or any ASCII digit) is represented [A-Za-z0-9].

Characters vs. Code Units

''Code units'' are the fixed-size units that are used in defining a character encoding system. Very often, 8-bit code units (bytes) are used as the basis of an encoding system. But in some cases, such as the UTF-8 representation of Unicode, multiple code units may be required to define a single character. In UTF-8, characters are encoded using sequences that are either one, two, three, or four code units in length.

At the fundamental level, the Parabix character class compilers operate as compilers for identifying individual code units. Defining characters that are comprised of sequences of code units involves an additional transformation structure.

Using the Character Code Compilers

The Parabix framework includes a character class compiler that can be dynamically invoked to compile character code unit classes at run-time. However there is also a static character class compiler written in Python that is sometimes useful and illustrates the basic concept.

The Parabix Character Class Compiler

The Parabix+LLVM toolchain incorporates a just-in-time character class compiler that allows applications to be built when the definition of a character class is determined at run-time.
For example, this supports regular expression applications such as icgrep, in which users may provide arbitrary regular expressions involving arbitrary character classes as input.

Construction of Character Class Objects

Character classes may be built-up using the operations defined in parabix-devel/include/re/ADT/re_cc.h, as follows.
1. Basic character classes may be constructed using a single codepoint or a codepoint range, using re::makeCC(cp) or re:makeCC(lo_cp, hi_cp), respectively.
2. Character classes may be further constructed by union of two character classes using re:makeCC(cc1, cc2).
3. Subtraction and intersection of character classes is also supported.
Character classes may also be constructed by parsing a character class expression as a regular expression object, using the re::parse routine of [source:icGREP/icgrep/icgrep-devel/re/re_parser.h]
In the Parabix+LLVM framework, the current character class objects are based on Unicode codepoints in the range 0 to 0x10FFFF.
1. Character class definition based on other alphabets can also potentially be supported - future work.

Compilation of Character Code Unit Classes

When a character class is confined to a single code unit (byte, presently), the compileCC operations of the CC compiler can be used to generate Pablo code, in a manner analogous to the Python character class compiler.
The character code unit compiler can be invoked from within a Pablo kernel to integrate the recognition of a character class into a larger Pablo kernel, or may be directly used to create a kernel specifically for the character classes of interest.

Compilation of Full UTF-8 Character Classes

Character class objects include code unit values from the full space of Unicode codepoints.
Compiling full Unicode definitions can be performed by the UTF compiler.