UCD_Compiller Work
The UCD_Compiler is an important component for compiling sets of Unicode codepoints into Parabix Pablo code.
It is named UCD Compiler due to its use for compiling the various "properties" of the Unicode Database (UCD). However, it can be used to compile other Unicode sets as well.
Pablo code produce by the UCD compiler generally tends to work with a basis bits representation of UTF-8 code units (8 bits), producing bit streams that mark the positions of members of Unicode sets at the final UTF-8 code unit position of each member.
Input Basis Generalization
Some applications may find it more convenient to deal with different input representations. (a) UTF-8 code units represented as i8 streams (bytes). (b) UTF-16 code units represented as a 16xi1 stream set. (c) UTF-16 code units represented as two 8xi1 stream sets for the high/low bytes of each code unit. (c) UTF-16 code units represented as a 2xi8 stream set for the high/low bytes of each code unit. (d) UTF-16BE code units represented as i16 streams (big endian byte ordering). (e) UTF-16LE code units represented as i16 streams (little endian byte ordering). (f) UTF-32 code units represented as a 21xi1 stream set. (g) Various other representations of UTF-32.
Output Generalization
Rather than producing marks at the final position of code unit sequences, it may be desirable to produce marks at the initial position. This option may be supported through the use of Pablo LookAhead operations, with the appropriate LookAhead attribute on the basis set used for input (LookAhead(3) for UTF-8, LookAhead(1) for UTF-16).
If-Hierarchy Evaluation and Generalization
The UCD compiler generates code that achieves good performance primarily through an if-hierarchy. This is based on the notion that input data in each block will generally be confined to a small number of consecutive ranges of Unicode code points. The current if-hierarchy is based on a hand-developed initial structure. Studies should be made to assess and/or improve the hierarchy in various cases.
Ternary Logic
Recent work allows the Parabix system to take advantage of AVX-512 ternary operations for bitwise logic (thanks, Luiz!). Further evaluation and testing of this in the context of the UCD compiler should take place.