|
|
# The Parabix Character Code Compilers
|
|
|
|
|
|
## Basic Concepts
|
|
|
|
|
|
### Character Class Bit Streams
|
|
|
|
|
|
Given an input stream of character code units, a ''character class'' bit stream
|
|
|
is a stream of bits defined in one-to-one correspondence with the input stream
|
|
|
such that one bits mark instances of character code units within the class, and
|
|
|
zero bits mark instances of character code units outside the class.
|
|
|
|
|
|
For example, consider the ASCII character class expression `[abc]` standing
|
|
|
for the class comprising ASCII bytes having code unit values 97 (ASCII value for `a`),
|
|
|
98 (ASCII value for `b`) or 99 (ASCII value for `c`). The following
|
|
|
example shows the `[abc]` character class bit stream aligned with an example
|
|
|
ASCII input stream.
|
|
|
|
|
|
```
|
|
|
input: This is an example ASCII byte stream. ASCII is an abbreviation.
|
|
|
[abc]: ........1....1...........1........1.............1..111..........
|
|
|
```
|
|
|
|
|
|
By convention, zero bits within a character class bit stream are marked with periods,
|
|
|
so that the one bits (each marked with the digit 1) stand out.
|
|
|
|
|
|
### Character Class Expressions
|
|
|
|
|
|
Following traditional regular expression notation, character classes can be
|
|
|
defined by listing individual characters and ranges of characters within
|
|
|
square brackets. For example, the set of delimiter characters consisting
|
|
|
of colon, period, comma, semicolon may be denoted `[:.,;]`, while the
|
|
|
alphanumeric ASCII characters (any uppercase or lowercase ASCII letter
|
|
|
or any ASCII digit) is represented `[A-Za-z0-9]`.
|
|
|
|
|
|
### Characters vs. Code Units
|
|
|
|
|
|
''Code units'' are the fixed-size units that are used in defining a character
|
|
|
encoding system. Very often, 8-bit code units (bytes) are used as the
|
|
|
basis of an encoding system. But in some cases, such as the UTF-8 representation
|
|
|
of Unicode, multiple code units may be required to define a single character.
|
|
|
In UTF-8, characters are encoded using sequences that are either one, two,
|
|
|
three, or four code units in length.
|
|
|
|
|
|
At the fundamental level, the Parabix character class compilers operate
|
|
|
as compilers for identifying individual code units. Defining characters
|
|
|
that are comprised of sequences of code units involves an additional transformation
|
|
|
structure.
|
|
|
|
|
|
## Using the Character Code Compilers
|
|
|
|
|
|
The Parabix framework includes a character class compiler that can be
|
|
|
dynamically invoked to compile character code unit classes at run-time.
|
|
|
However there is also a [StaticCCC](static character class compiler written in Python)
|
|
|
that is sometimes useful and illustrates the basic concept.
|
|
|
|
|
|
|
|
|
### The Parabix Character Class Compiler
|
|
|
|
|
|
The Parabix+LLVM toolchain incorporates a just-in-time character class compiler that allows applications
|
|
|
to be built when the definition of a character class is determined at run-time.
|
|
|
For example, this supports regular expression applications such as icgrep, in which
|
|
|
users may provide arbitrary regular expressions
|
|
|
involving arbitrary character classes as input.
|
|
|
|
|
|
#### Construction of Character Class Objects
|
|
|
|
|
|
1. Character classes may be built-up using the operations defined in parabix-devel/include/re/ADT/re_cc.h, as follows.
|
|
|
1. Basic character classes may be constructed using a single codepoint or a codepoint range, using `re::makeCC(cp)` or `re:makeCC(lo_cp, hi_cp)`, respectively.
|
|
|
1. Character classes may be further constructed by union of two character classes using `re:makeCC(cc1, cc2)`.
|
|
|
1. Subtraction and intersection of character classes is also supported.
|
|
|
1. Character classes may also be constructed by parsing a character class expression as a regular expression object, using the `re::parse` routine of [source:icGREP/icgrep/icgrep-devel/re/re_parser.h]
|
|
|
1. In the Parabix+LLVM framework, the current character class objects are based on Unicode codepoints in the range 0 to 0x10FFFF.
|
|
|
1. Character class definition based on other alphabets can also potentially be supported - future work.
|
|
|
|
|
|
#### Compilation of Character Code Unit Classes
|
|
|
1. When a character class is confined to a single code unit (byte, presently), the `compileCC` operations of the CC compiler can be used to generate Pablo code, in a manner analogous to the Python character class compiler.
|
|
|
1. The character code unit compiler can be invoked from within a Pablo kernel to integrate
|
|
|
the recognition of a character class into a larger Pablo kernel, or may be directly used
|
|
|
to create a kernel specifically for the character classes of interest.
|
|
|
|
|
|
#### Compilation of Full UTF-8 Character Classes
|
|
|
1. Character class objects include code unit values from the full space of Unicode codepoints.
|
|
|
1. Compiling full Unicode definitions can be performed by the UTF compiler. |
|
|
\ No newline at end of file |