Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 8
    • Issues 8
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • UCD: Unicode Property Database and Compilers

UCD: Unicode Property Database and Compilers · Changes

Page history
Update UCD: Unicode Property Database and Compilers authored May 22, 2024 by cameron's avatar cameron
Hide whitespace changes
Inline Side-by-side
Showing with 0 additions and 0 deletions
+0 -0
  • UCD:-Unicode-Property-Database-and-Compilers.md UCD:-Unicode-Property-Database-and-Compilers.md +0 -0
  • No files found.
UCD:-Unicode-Property-Database-and-Compilers.md 0 → 100644
View page @ 89da4403
# UCD: The Unicode Database
The Unicode consortium defines a database of character properties for Unicode characters,
as documented in [UAX #44: Unicode Character Database](https://unicode.org/reports/tr44/).
Parabix has built-in support for these properties.
## Property Names and Aliases
An enumeration `UCD::property_t` is defined for each of the properties of the Unicode Character Database.
`PropertyAliases.h` defines this enumeration as well as their standard names and aliases.
## Property Values
For any given property, there is a value of that property for each character in Unicode.
Many of the properties are enumerated properties such that the value of each character
according to a property is one of a fixed set of named values. The names of all
values for enumerated properties are found in `PropertyValueAliases.h`, together with aliases for those names when they exist.
For example, the `Script` property has enumeration codes `Latn`, `Grek` and `Cyrl` and so on, for defining
a specific script for each character.
Binary properties are a special case of enumerated
properties having the values `Y` and `N` standing for yes or no (true or false).
## Unicode Sets
Parabix describes many of the properties of Unicode characters in terms of UnicodeSet data objects.
A UnicodeSet is a compact bitset representation defining all the characters having some
property-value combination. For example, a UnicodeSet can be produced to determine the set of
Greek characters as the set of all characters whose Script property has the value `Grek`.
UnicodeSet objects provide many of the standard set operations, such as membership tests, union, intersection,
set difference and so on. See `unicode/core/unicode_set.h`.
Most Unicode properties have predefined UnicodeSet values built in. For example, the UnicodeSet for
Greek characters is defined in `Scripts.h` as follows.
```
/** Code Point Ranges for Grek
[0370, 0373], [0375, 0377], [037a, 037d], [037f, 037f], [0384, 0384],
[0386, 0386], [0388, 038a], [038c, 038c], [038e, 03a1], [03a3, 03e1],
[03f0, 03ff], [1d26, 1d2a], [1d5d, 1d61], [1d66, 1d6a], [1dbf, 1dbf],
[1f00, 1f15], [1f18, 1f1d], [1f20, 1f45], [1f48, 1f4d], [1f50, 1f57],
[1f59, 1f59], [1f5b, 1f5b], [1f5d, 1f5d], [1f5f, 1f7d], [1f80, 1fb4],
[1fb6, 1fc4], [1fc6, 1fd3], [1fd6, 1fdb], [1fdd, 1fef], [1ff2, 1ff4],
[1ff6, 1ffe], [2126, 2126], [ab65, ab65], [10140, 1018e],
[101a0, 101a0], [1d200, 1d245]**/
namespace {
const static UnicodeSet::run_t __grek_Set_runs[] = {
{Empty, 27}, {Mixed, 3}, {Full, 1}, {Mixed, 1}, {Empty, 201},
{Mixed, 3}, {Empty, 1}, {Mixed, 1}, {Empty, 10}, {Mixed, 1}, {Full, 1},
{Mixed, 2}, {Full, 1}, {Mixed, 3}, {Empty, 9}, {Mixed, 1},
{Empty, 1105}, {Mixed, 1}, {Empty, 686}, {Full, 2}, {Mixed, 2},
{Empty, 1666}, {Full, 2}, {Mixed, 1}, {Empty, 31085}};
const static UnicodeSet::bitquad_t __grek_Set_quads[] = {
0xbcef0000, 0xffffd750, 0xfffffffb, 0xffff0003, 0x000007c0, 0xe0000000,
0x000007c3, 0x80000000, 0x3f3fffff, 0xaaff3f3f, 0x3fffffff, 0xffdfffff,
0xefcfffdf, 0x7fdcffff, 0x00000040, 0x00000020, 0x00007fff, 0x00000001,
0x0000003f};
}
const static UnicodeSet grek_Set{const_cast<UnicodeSet::run_t *>(__grek_Set_runs), 25, 0,
const_cast<UnicodeSet::bitquad_t *>(__grek_Set_quads), 19, 0};
```
In general, these predefined UnicodeSet values are generated automatically by
scripts written in Python, based on the text files of the Unicode Character Database (UCD).
`Scripts.h` is generated based on the UCD file `Scripts.txt` for example.
## Unicode Property Objects
The principal means for accessing information associated with UCD properties is through
its `PropertyObject`. There are several subclasses of `PropertyObject` depending
on the type of property: binary, enumerated, numeric or string.
See `PropertyObjects.h` for the definitions of property object types.
See `PropertyObjectTable.h` for the actual definition of property objects for each UCD property.
For example, to determine the UnicodeSet of Greek characters, one can use the
operation:
```
llvm::cast<UCD::EnumeratedPropertyObject>(UCD::property_object_table[UCD:sc])->GetCodepointSet("Greek");
```
## Unicode Property Resolution
When a regular expression `Name` object refers to a Unicode character property, resolving that name
into its corresponding UnicodeSet may be performed using operations defined in `re/unicode/resolve_properties.h`.
The `UnicodePropertyKernelBuilder` in `include/kernel/unicode/UCD_property_kernel.h` is useful for constructing property streams
for any of the defined Unicode properties. An example of its use for counting
occurrences of a property within a file may be found in the `ucount` utility,
see `tools/wc/ucount.cpp`.
## Grapheme Cluster Boundaries
Grapheme clusters are sequences of Unicode codepoints are generally considered together to represent one logical character. For example, a base character such as the letter `a` may be followed by an accent character such as ´ to produced the accented character `á`. The task of separating a stream of characters into grapheme clusters is a text segmentation problem known as the grapheme cluster boundary problem. The full Unicode rules for this are documented in
[UAX #29: Unicode Text Segmentation](https://unicode.org/reports/tr29/).
The logic for computing grapheme cluster boundaries with Parabix methods is illustrated by the `gcount` utility, see `tools/wc/gcount.cpp`.
\ No newline at end of file
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages