cameron · 89da4403
Hide whitespace changes
Inline Side-by-side

Showing with 0 additions and 0 deletions

UCD:-Unicode-Property-Database-and-Compilers.md UCD:-Unicode-Property-Database-and-Compilers.md +0 -0

No files found.
--- a/UCD:-Unicode-Property-Database-and-Compilers.md
+++ b/UCD:-Unicode-Property-Database-and-Compilers.md
+# UCD: The Unicode Database
+
+The Unicode consortium defines a database of character properties for Unicode characters,
+as documented in [UAX #44: Unicode Character Database](https://unicode.org/reports/tr44/).
+Parabix has built-in support for these properties.   
+
+## Property Names and Aliases
+
+An enumeration  `UCD::property_t` is defined for each of the properties of the Unicode Character Database.   
+`PropertyAliases.h` defines this enumeration as well as their standard names and aliases. 
+
+## Property Values
+
+For any given property, there is a value of that property for each character in Unicode.
+Many of the properties are enumerated properties such that the value of each character
+according to a property is one of a fixed set of named values.   The names of all
+values for enumerated properties are found in `PropertyValueAliases.h`, together with aliases for those names when they exist.
+For example, the `Script` property has enumeration codes `Latn`, `Grek` and `Cyrl` and so on, for defining
+a specific script for each character.   
+
+Binary properties are a special case of enumerated
+properties having the values `Y` and `N` standing for yes or no (true or false).
+
+## Unicode Sets
+
+Parabix describes many of the properties of Unicode characters in terms of UnicodeSet data objects. 
+A UnicodeSet is a compact bitset representation defining all the characters having some
+property-value combination.   For example, a UnicodeSet can be produced to determine the set of
+Greek characters as the set of all characters whose Script property has the value `Grek`.
+
+UnicodeSet objects provide many of the standard set operations, such as membership tests, union, intersection,
+set difference and so on.   See `unicode/core/unicode_set.h`.
+
+Most Unicode properties have predefined UnicodeSet values built in.   For example, the UnicodeSet for
+Greek characters is defined in `Scripts.h` as follows.
+```
+    /** Code Point Ranges for Grek
+    [0370, 0373], [0375, 0377], [037a, 037d], [037f, 037f], [0384, 0384],
+    [0386, 0386], [0388, 038a], [038c, 038c], [038e, 03a1], [03a3, 03e1],
+    [03f0, 03ff], [1d26, 1d2a], [1d5d, 1d61], [1d66, 1d6a], [1dbf, 1dbf],
+    [1f00, 1f15], [1f18, 1f1d], [1f20, 1f45], [1f48, 1f4d], [1f50, 1f57],
+    [1f59, 1f59], [1f5b, 1f5b], [1f5d, 1f5d], [1f5f, 1f7d], [1f80, 1fb4],
+    [1fb6, 1fc4], [1fc6, 1fd3], [1fd6, 1fdb], [1fdd, 1fef], [1ff2, 1ff4],
+    [1ff6, 1ffe], [2126, 2126], [ab65, ab65], [10140, 1018e],
+    [101a0, 101a0], [1d200, 1d245]**/
+
+
+    namespace {
+    const static UnicodeSet::run_t __grek_Set_runs[] = {
+    {Empty, 27}, {Mixed, 3}, {Full, 1}, {Mixed, 1}, {Empty, 201},
+    {Mixed, 3}, {Empty, 1}, {Mixed, 1}, {Empty, 10}, {Mixed, 1}, {Full, 1},
+    {Mixed, 2}, {Full, 1}, {Mixed, 3}, {Empty, 9}, {Mixed, 1},
+    {Empty, 1105}, {Mixed, 1}, {Empty, 686}, {Full, 2}, {Mixed, 2},
+    {Empty, 1666}, {Full, 2}, {Mixed, 1}, {Empty, 31085}};
+    const static UnicodeSet::bitquad_t  __grek_Set_quads[] = {
+    0xbcef0000, 0xffffd750, 0xfffffffb, 0xffff0003, 0x000007c0, 0xe0000000,
+    0x000007c3, 0x80000000, 0x3f3fffff, 0xaaff3f3f, 0x3fffffff, 0xffdfffff,
+    0xefcfffdf, 0x7fdcffff, 0x00000040, 0x00000020, 0x00007fff, 0x00000001,
+    0x0000003f};
+    }
+
+    const static UnicodeSet grek_Set{const_cast<UnicodeSet::run_t *>(__grek_Set_runs), 25, 0,
+                                     const_cast<UnicodeSet::bitquad_t *>(__grek_Set_quads), 19, 0};
+```
+In general, these predefined UnicodeSet values are generated automatically by
+scripts written in Python, based on the text files of the Unicode Character Database (UCD).
+`Scripts.h` is generated based on the UCD file `Scripts.txt` for example.   
+
+## Unicode Property Objects
+
+The principal means for accessing information associated with UCD properties is through
+its `PropertyObject`.   There are several subclasses of `PropertyObject` depending
+on the type of property: binary, enumerated, numeric or string.
+See `PropertyObjects.h` for the definitions of property object types.
+See `PropertyObjectTable.h` for the actual definition of property objects for each UCD property.
+
+For example, to determine the UnicodeSet of Greek characters, one can use the
+operation:
+```
+llvm::cast<UCD::EnumeratedPropertyObject>(UCD::property_object_table[UCD:sc])->GetCodepointSet("Greek");
+```
+
+
+## Unicode Property Resolution
+
+When a regular expression `Name` object refers to a Unicode character property, resolving that name
+into its corresponding UnicodeSet may be performed using operations defined in `re/unicode/resolve_properties.h`.
+
+The `UnicodePropertyKernelBuilder` in `include/kernel/unicode/UCD_property_kernel.h` is useful for constructing property streams
+for any of the defined Unicode properties.   An example of its use for counting
+occurrences of a property within a file may be found in the `ucount` utility,
+see `tools/wc/ucount.cpp`.
+
+## Grapheme Cluster Boundaries
+
+Grapheme clusters are sequences of Unicode codepoints are generally considered together to represent one logical character.  For example, a base character such as the letter `a` may be followed by an accent character such as ´ to produced the accented character `á`.   The task of separating a stream of characters into grapheme clusters is a text segmentation problem known as the grapheme cluster boundary problem.   The full Unicode rules for this are documented in 
+[UAX #29: Unicode Text Segmentation](https://unicode.org/reports/tr29/).
+
+The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the `gcount` utility, see `tools/wc/gcount.cpp`.
\ No newline at end of file