|  |  | # The Parabix Character Code Compilers | 
|  |  |  | 
|  |  | ## Basic Concepts | 
|  |  |  | 
|  |  | ### Character Class Bit Streams | 
|  |  |  | 
|  |  | Given an input stream of character code units, a ''character class'' bit stream | 
|  |  | is a stream of bits defined in one-to-one correspondence with the input stream | 
|  |  | such that one bits mark instances of character code units within the class, and | 
|  |  | zero bits mark instances of character code units outside the class. | 
|  |  |  | 
|  |  | For example, consider the ASCII character class expression `[abc]` standing | 
|  |  | for the class comprising ASCII bytes having code unit values 97 (ASCII value for `a`), | 
|  |  | 98 (ASCII value for `b`) or 99 (ASCII value for `c`).   The following | 
|  |  | example shows the `[abc]` character class bit stream aligned with an example | 
|  |  | ASCII input stream. | 
|  |  |  | 
|  |  | ``` | 
|  |  | input:  This is an example ASCII byte stream.  ASCII is an abbreviation. | 
|  |  | [abc]:  ........1....1...........1........1.............1..111.......... | 
|  |  | ``` | 
|  |  |  | 
|  |  | By convention, zero bits within a character class bit stream are marked with periods, | 
|  |  | so that the one bits (each marked with the digit 1) stand out. | 
|  |  |  | 
|  |  | ### Character Class Expressions | 
|  |  |  | 
|  |  | Following traditional regular expression notation, character classes can be | 
|  |  | defined by listing individual characters and ranges of characters within | 
|  |  | square brackets.   For example, the set of delimiter characters consisting | 
|  |  | of colon, period, comma, semicolon may be denoted `[:.,;]`, while the | 
|  |  | alphanumeric ASCII characters (any uppercase or lowercase ASCII letter | 
|  |  | or any ASCII digit) is represented `[A-Za-z0-9]`. | 
|  |  |  | 
|  |  | ### Characters vs. Code Units | 
|  |  |  | 
|  |  | ''Code units'' are the fixed-size units that are used in defining a character | 
|  |  | encoding system.   Very often, 8-bit code units (bytes) are used as the | 
|  |  | basis of an encoding system.   But in some cases, such as the UTF-8 representation | 
|  |  | of Unicode, multiple code units may be required to define a single character. | 
|  |  | In UTF-8, characters are encoded using sequences that are either one, two, | 
|  |  | three, or four code units in length. | 
|  |  |  | 
|  |  | At the fundamental level, the Parabix character class compilers operate | 
|  |  | as compilers for identifying individual code units.   Defining characters | 
|  |  | that are comprised of sequences of code units involves an additional transformation | 
|  |  | structure. | 
|  |  |  | 
|  |  | ## Using the Character Code Compilers | 
|  |  |  | 
|  |  | The Parabix framework includes a character class compiler that can be | 
|  |  | dynamically invoked to compile character code unit classes at run-time. | 
|  |  | However there is also a [StaticCCC](static character class compiler written in Python) | 
|  |  | that is sometimes useful and illustrates the basic concept. | 
|  |  |  | 
|  |  |  | 
|  |  | ### The Parabix Character Class Compiler | 
|  |  |  | 
|  |  | The Parabix+LLVM toolchain incorporates a just-in-time character class compiler that allows applications | 
|  |  | to be built when the definition of a character class is determined at run-time. | 
|  |  | For example, this supports regular expression applications such as icgrep, in which | 
|  |  | users may provide arbitrary regular expressions | 
|  |  | involving arbitrary character classes as input. | 
|  |  |  | 
|  |  | #### Construction of Character Class Objects | 
|  |  |  | 
|  |  | 1.  Character classes may be built-up using the operations defined in parabix-devel/include/re/ADT/re_cc.h, as follows. | 
|  |  | 1.  Basic character classes may be constructed using a single codepoint or a codepoint range, using `re::makeCC(cp)` or `re:makeCC(lo_cp, hi_cp)`, respectively. | 
|  |  | 1.  Character classes may be further constructed by union of two character classes using `re:makeCC(cc1, cc2)`. | 
|  |  | 1.  Subtraction and intersection of character classes is also supported. | 
|  |  | 1.  Character classes may also be constructed by parsing a character class expression as a regular expression object, using the `re::parse` routine of [source:icGREP/icgrep/icgrep-devel/re/re_parser.h] | 
|  |  | 1.  In the Parabix+LLVM framework, the current character class objects are based on Unicode codepoints in the range 0 to 0x10FFFF. | 
|  |  | 1.  Character class definition based on other alphabets can also potentially be supported - future work. | 
|  |  |  | 
|  |  | #### Compilation of Character Code Unit Classes | 
|  |  | 1.  When a character class is confined to a single code unit (byte, presently), the `compileCC` operations of the CC compiler can be used to generate Pablo code, in a manner analogous to the Python character class compiler. | 
|  |  | 1.  The character code unit compiler can be invoked from within a Pablo kernel to integrate | 
|  |  | the recognition of a character class into a larger Pablo kernel, or may be directly used | 
|  |  | to create a kernel specifically for the character classes of interest. | 
|  |  |  | 
|  |  | #### Compilation of Full UTF-8 Character Classes | 
|  |  | 1.  Character class objects include code unit values from the full space of Unicode codepoints. | 
|  |  | 1.  Compiling full Unicode definitions can be performed by the UTF compiler. | 
|  |  | \ No newline at end of file |