cameron · 6ce9e45b
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 63 deletions

Character-Code-Compilers.md Character-Code-Compilers.md +11 -63

No files found.
--- a/Character-Code-Compilers.md
+++ b/Character-Code-Compilers.md
+# The Parabix Character Code Compilers 
+
+## Basic Concepts 
+
+### Character Class Bit Streams 
+
+Given an input stream of character code units, a ''character class'' bit stream
+is a stream of bits defined in one-to-one correspondence with the input stream
+such that one bits mark instances of character code units within the class, and 
+zero bits mark instances of character code units outside the class.
+
+For example, consider the ASCII character class expression `[abc]` standing
+for the class comprising ASCII bytes having code unit values 97 (ASCII value for `a`), 
+98 (ASCII value for `b`) or 99 (ASCII value for `c`).   The following
+example shows the `[abc]` character class bit stream aligned with an example
+ASCII input stream.
+
+```
+input:  This is an example ASCII byte stream.  ASCII is an abbreviation.  
+[abc]:  ........1....1...........1........1.............1..111..........
+```
+
+By convention, zero bits within a character class bit stream are marked with periods,
+so that the one bits (each marked with the digit 1) stand out.
+
+### Character Class Expressions 
+
+Following traditional regular expression notation, character classes can be
+defined by listing individual characters and ranges of characters within
+square brackets.   For example, the set of delimiter characters consisting
+of colon, period, comma, semicolon may be denoted `[:.,;]`, while the
+alphanumeric ASCII characters (any uppercase or lowercase ASCII letter
+or any ASCII digit) is represented `[A-Za-z0-9]`.  
+
+### Characters vs. Code Units 
+
+''Code units'' are the fixed-size units that are used in defining a character
+encoding system.   Very often, 8-bit code units (bytes) are used as the
+basis of an encoding system.   But in some cases, such as the UTF-8 representation
+of Unicode, multiple code units may be required to define a single character.
+In UTF-8, characters are encoded using sequences that are either one, two, 
+three, or four code units in length.
+
+At the fundamental level, the Parabix character class compilers operate
+as compilers for identifying individual code units.   Defining characters
+that are comprised of sequences of code units involves an additional transformation 
+structure.
+
+## Using the Character Code Compilers
+
+The Parabix framework includes a character class compiler that can be
+dynamically invoked to compile character code unit classes at run-time.
+However there is also a [StaticCCC](static character class compiler written in Python)
+that is sometimes useful and illustrates the basic concept.
+
+
+### The Parabix Character Class Compiler
+
+The Parabix+LLVM toolchain incorporates a just-in-time character class compiler that allows applications
+to be built when the definition of a character class is determined at run-time.  
+For example, this supports regular expression applications such as icgrep, in which 
+users may provide arbitrary regular expressions
+involving arbitrary character classes as input.
+
+#### Construction of Character Class Objects
+
+ 1.  Character classes may be built-up using the operations defined in parabix-devel/include/re/ADT/re_cc.h, as follows.
+     1.  Basic character classes may be constructed using a single codepoint or a codepoint range, using `re::makeCC(cp)` or `re:makeCC(lo_cp, hi_cp)`, respectively.
+     1.  Character classes may be further constructed by union of two character classes using `re:makeCC(cc1, cc2)`.
+     1.  Subtraction and intersection of character classes is also supported.
+ 1.  Character classes may also be constructed by parsing a character class expression as a regular expression object, using the `re::parse` routine of [source:icGREP/icgrep/icgrep-devel/re/re_parser.h]
+ 1.  In the Parabix+LLVM framework, the current character class objects are based on Unicode codepoints in the range 0 to 0x10FFFF.
+     1.  Character class definition based on other alphabets can also potentially be supported - future work.
+
+#### Compilation of Character Code Unit Classes
+ 1.  When a character class is confined to a single code unit (byte, presently), the `compileCC` operations of the CC compiler can be used to generate Pablo code, in a manner analogous to the Python character class compiler.
+1.  The character code unit compiler can be invoked from within a Pablo kernel to integrate
+the recognition of a character class into a larger Pablo kernel, or may be directly used
+to create a kernel specifically for the character classes of interest.
+
+#### Compilation of Full UTF-8 Character Classes 
+ 1.  Character class objects include code unit values from the full space of Unicode codepoints.
+ 1.  Compiling full Unicode definitions can be performed by the UTF compiler.
\ No newline at end of file