Commit 75a71fba authored by aytang's avatar aytang
Browse files

added comments in source code and updated implementation guide

parent 84fa6f0e
......@@ -18,29 +18,32 @@ radicalgrep
`radicalgrep.cpp` is the main framework for the Radical Grep program. The auxilary functions and radical-set maps can be found in `radical_interface.h`.
### **kRSKangXi.h**
The Radical Count program uses the kRSKangxi property to distinguish all 214 radicals in the [Kangxi Radical System](https://en.wikipedia.org/wiki/Kangxi_radical). In `kRSKangxi.h`, we have a unicode set for each radical, where each set contains the codepoint ranges for the Chinese characters with the corresponding radical. We generated this header file by using [unihan-scripts](https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-devel/tree/delta-radicalgrep/unihan-scripts), based off of [Unihan_RadicalStrokeCounts.txt](https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-devel/blob/delta-radicalgrep/unihan-scripts/Unihan/Unihan_RadicalStrokeCounts.txt) of the Unihan database.
Radical Grep uses the kRSKangxi property to distinguish all 214 radicals in the [Kangxi Radical System](https://en.wikipedia.org/wiki/Kangxi_radical). In `kRSKangxi.h`, we have a unicode set for each radical, where each set contains the codepoint ranges for the Chinese characters with the corresponding radical. We generated this header file by using [unihan-scripts](https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-devel/tree/delta-radicalgrep/unihan-scripts), based off of [Unihan_RadicalStrokeCounts.txt](https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-devel/blob/delta-radicalgrep/unihan-scripts/Unihan/Unihan_RadicalStrokeCounts.txt) of the Unihan database.
## **radical_interface.h & radical_interface.cpp**
The `radical_interface` files defines the namespace `BS` and the corresponding functions and variables related to radical grep.
The `radical_interface` files defines the namespace `BS` and the functions and variables used in Radical Grep.
Members of class `UnicodeSetTable`:
* `map<string, const UCD::UnicodeSet*>_unicodeset_radical_table`:
This is a map that lists all 214 radicals and their corresponding Unicode set there were predefined from [kRSKangXi.h](https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-devel/blob/delta-radicalgrep/include/unicode/data/kRSKangXi.h). This is not used in the current iteration, but will be implemented later on.
This is a map that lists all 214 radicals and their corresponding Unicode set there were predefined from [kRSKangXi.h](https://cs-git-research.cs.surrey.sfu.ca/cameron/parabix-devel/blob/delta-radicalgrep/include/unicode/data/kRSKangXi.h). Th
* `map<string, const UCD::UnicodeSet*> radical_table`:
Instead of using a numeric key, the actual Kangxi radical is used and mapped to their corresponding values. Note that one unicode set may belong to different radicals (e.g. 水 and 氵both map to set 85).
* `get_uset()`:
This function maps the inputted radical to the corresponding UnicodesSet predefined in `radical_table`. If the program is in index mode (`-i`), the function looks for the requested radical in `_unicodeset_radical_table` and checks if the input is valid. In the case of an invalid input, an error message will appear and terminate the program.
This function maps the inputted radical to the corresponding UnicodesSet predefined in `radical_table`. If the program is in index mode (`-i`), the function looks for the requested radical in `_unicodeset_radical_table` and checks if the input is valid. In mixed mode (`-m`), the functions searches for the radical in both tables. In the case of an invalid input, an error message will appear and terminate the program.
Members of class `RadicalValuesEnumerator`:
* `parse_input()`:
This function parses the inputted radical expression (e.g. 氵_ or 氵_子_ ) and stores it in a vector `radical_list`. The variable `radical_num` Store the number of inputted radical(s).
This function parses the inputted radical expression (e.g. 氵_ or 氵_子_ ) and stores it in a vector `radical_list`. In alt mode (`-alt`), further parsing must be done. Radicals that are not bounded by parentheses are put in storage buffer vectors `zi` and `zi2`. The radicals in the parentheses are sent to `reParse()` for further processing.
* `reParse()`:
This function in alt mode, and tokenizes the radicals that are bounded by the parenthesis. When given a radical expression of `X_Y_{A/B}_`, `reParse()` tokenizes {A/B} with '\' as the delimiter. `A` and `B` are pushed into the vector `reTemp`.
* `createREs()`:
This function finds the inputted radical from `radical_list`, and searches for it by invoking `get_uset()`.
This function takes the radicals that have been parsed from `parse_input()` and `reParse()`, and returns a vector `REs`. `REs` is a regular expression that represents the inputted radical expression, and contains "alt" nodes of each radical character that were retrieved from `radical_list`, `reTemp`, `zi` and `zi2`.
## **radicalgrep.cpp**
......@@ -59,26 +62,39 @@ This file is the main framework of Radical Grep. The LLVM input parser takes in
* `allfiles`: Stores the filepaths. When a file has finished being processed, it gets popped from the vector so that a new file can be looked at.
* `indexMode`: An optional command flag; indicates if radical indices are being used.
* `radicalREs`: Stores the return value of `generateREs()`.
* `color`: Command options for colourization.
* `matchFound`: Indicates if matching line(s) have been found in the files.
* `radicalREs`: Stores the return value of `generateREs()`.
### Command Line Flags
* `indexMode`: Indicates if radical indices are being used.
* `mixMode`: Indicates if radical indices and radical characters are being used in conjunction.
* `altMode`: Indicates if alternative character options are provided.
* `Color`: Command options for colourization.
* `LineNumberOption`: Prints out the line count of the file for each outputted line.
* `WithFilenameOption`: Prints out file name for each outputted line.
* `CLKCountingOption`: Prints out the runtime of the search.
### Functions
* `generateREs(std::string input_radical)`: This function parse the input and gets the REs by invoking `createREs()`.
* `generateREs(std::string input_radical)`: This function parse the input and returns the regular expression of `input_radical`.
* `setColoring()`: Defined in file `grep_engine.cpp`; it changes the terminal's text colour to red for the characters with the corresponding radicals.
* `initFileResult(allfiles)`: Defined in file `grep_engine.cpp`; this is a construction and initialization function. Like the name suggests, it initializes the input paths, path size and so on.
* `initREs(radicalREs) `: Defined in file `grep_engine.cpp`; this is also a construction and initialization function. It takes care of all Unicode related tasks of the REs provided by `generateREs()`.
* `initREs(radicalREs) `: Defined in file `grep_engine.cpp`; this is also a construction and initialization function. It takes care of all Unicode related tasks of the regular expression provided by `generateREs()`.
* `grepCodeGen()`: Defined in file `grep_engine.cpp`; this is the main function of Radical Grep. It is a code generation function; which returns the number of equivalent characters found.
* `grepCodeGen()`: Defined in file `grep_engine.cpp`; this is the main function of Radical Grep. It generates the grep pipeline.
* `searchAllFiles()`: Defined in file `grep_engine.cpp`; this function searches all the files for matches at the same time. If there results have been found, it returns true. Else, it returns false.
**Authored by Team Delta:** Anna Tang, Lexie Yu (Yu Ruonan), Pan Chuwen
**Authored by Team Delta:** Anna Tang, Lexie Yu (Yu Ruo Nan), Pan Chu Wen
**Last Updated:** 2020/06/01
**Last Updated:** 2020/06/12
......@@ -10,17 +10,18 @@ using namespace BS;
using namespace UCD::KRS_ns;
namespace BS
{
{
//A functor used to invoke get_uset() in createREs()
static UnicodeSetTable ucd_radical;
const UCD::UnicodeSet&& UnicodeSetTable::get_uset(string radical, bool indexMode, bool mixedMode) //Map the input radical to the corresponding UnicodeSet predefined in kRSKangXi.h
const UCD::UnicodeSet&& UnicodeSetTable::get_uset(string radical, bool indexMode, bool mixedMode)
{
if (indexMode) { //search using the index (e.g. 85_)
if(_unicodeset_radical_table.find(radical) != _unicodeset_radical_table.end())
return std::move(*_unicodeset_radical_table[radical]);
else
llvm::report_fatal_error("A radical set for this input does not exist.\nEnter a integer in [1,214], followed by _.");
} else if (mixedMode) {
} else if (mixedMode) { //search using radical characters and the index (e.g. 氵_85_)
if (_unicodeset_radical_table.find(radical) != _unicodeset_radical_table.end())
return std::move(*_unicodeset_radical_table[radical]);
else if (radical_table.find(radical) != radical_table.end())
......@@ -35,9 +36,10 @@ namespace BS
}
}
//Search for the results by making CCs of each radical and pushing them the vector REs
std::vector<re::RE*> RadicalValuesEnumerator::createREs(bool indexMode, bool mixMode, bool altMode)
{
//REs stores the regular expression nodes for each character and gets returned to the main function.
//temp and temp0 are storage buffers.
std::vector<re::RE*> REs;
std::vector<re::RE*> temp;
std::vector<re::RE*> temp0;
......@@ -68,7 +70,7 @@ namespace BS
}
}
else if (position > 0 && position < c1-1)
else if (position > 0 && position < c1-1)
{
for (std::size_t i = 0; i < zi.size(); i++)
{
......@@ -110,8 +112,16 @@ namespace BS
}
}
else
else //non alt mode inputs.
{
//Suppose we have an inputted radical expression like this: 亻_心_ ,
//and we want to return a regular expression representing that input.
//After parsing, radical_list will look like this: {亻,心}
//First Run of the loop likes this: REs[0] = (亻|亻)
//It makes a "alt" node, similar to the alt syntax in regex.
//Second Run: REs[1] = (心|心)
//After both characters have been processed, REs = (亻|亻)(心|心)
//This means that the first character must have only the 亻radical, and the second character must have the 心 radical.
for (std::size_t i = 0; i < radical_list.size(); i++)
{
temp.push_back(re::makeCC(UCD::UnicodeSet(ucd_radical.get_uset(radical_list[i], indexMode, mixMode))));
......@@ -137,8 +147,10 @@ namespace BS
c1++;
}
position=0;
//Tokenize input_radical, with '_' as the delimiter.
while (getline(ss, temp, '_'))
{ //tokenize the input
{
if (altMode)
{
/* As an example, say we have a radical expression of X_Y_{A/B}_.
......
......@@ -18,6 +18,7 @@
#include <re/toolchain/toolchain.h>
#include <unicode/data/kRSKangXi.h>
//Adapted from icgrep
enum ColoringType {alwaysColor, autoColor, neverColor};
extern ColoringType ColorFlag;
extern bool LineNumberFlag;
......@@ -36,24 +37,35 @@ namespace BS
class UnicodeSetTable
{
public:
const UCD::UnicodeSet&& get_uset(string radical, bool indexMode, bool mixedMode); //Map the input radical to the corresponding UnicodeSet predefined in kRSKangXi.h
//Map the input radical to the corresponding UnicodeSet predefined in kRSKangXi.h
const UCD::UnicodeSet&& get_uset(string radical, bool indexMode, bool mixedMode);
private:
static map<string, const UCD::UnicodeSet*> _unicodeset_radical_table;
static map<string, const UCD::UnicodeSet*> radical_table; //The map list all kinds of radicals and their corresponding UnicodeSet prodefined in kRSKangXi.h
static map<string, const UCD::UnicodeSet*> mixed_table;
//This map contains all 214 radical indices, which are mapped to their corresponding UnicodeSet predefined in kRSKangXi.h
static map<string, const UCD::UnicodeSet*> _unicodeset_radical_table;
//The map list all kinds of radical characters and their corresponding UnicodeSet prodefined in kRSKangXi.h
static map<string, const UCD::UnicodeSet*> radical_table;
};
class RadicalValuesEnumerator
{
public:
std::vector<re::RE*> createREs(bool indexMode, bool mixedMode, bool altMode); //Search for the results
void parse_input(string input_radical, bool altMode); //Search for the results by making CCs of each radical and pushing them the vector REs
void reParse(string expr); //For -re mode; tokenizes {X/Y}
//Creates the regular expression of the input and returns it in the form of a vector
std::vector<re::RE*> createREs(bool indexMode, bool mixedMode, bool altMode);
//Search for the results by making CCs of each radical and pushing them the vector REs
void parse_input(string input_radical, bool altMode);
//For -alt mode; tokenizes expressions that are in the form of {X/Y}
void reParse(string expr);
private:
std::vector<string> radical_list; //Store the input radical(s)
std::vector<string> reTemp; //For -alt mode; stores the tokenized radicals in reParse()
std::vector<string> zi; //For -alt mode; stores the non-changed radical in rePare() (e.g. zi would store 亻 and 衣 of 亻_衣_{生/亅})
//Stores the parsed radicals from input
std::vector<string> radical_list;
//For -alt mode; stores the tokenized radicals from reParse()
std::vector<string> reTemp;
//For -alt mode; stores the non-changed radical in reParse()
//(e.g. zi would store 亻 and 衣 of 亻_衣_{生/亅})
std::vector<string> zi;
//Similar to the zi vector, acts as another storage buffer for alt mode
std::vector<string> zi2;
//For -alt mode; these variables keep track of positions of certain characters
int position;
int c1;
};
......
......@@ -52,28 +52,32 @@ using namespace pablo;
using namespace kernel;
using namespace codegen;
static cl::OptionCategory radicalgrepFlags("Command Flags", "Options for Radical Grep"); //The command line
static cl::opt<std::string> input_radical(cl::Positional, cl::desc("<Radical Index>"), cl::Required, cl::cat(radicalgrepFlags)); //The input radical(s)
static cl::list<std::string> inputfiles(cl::Positional, cl::desc("<Input File>"), cl::OneOrMore, cl::cat(radicalgrepFlags)); //search for multiple input files is supported
static cl::opt<bool> indexMode("i", cl::desc("Use radical index instead of the radical character to perform search.\n Link to Radical Indices: https://www.yellowbridge.com/chinese/radicals.php"), cl::init(false), cl::cat(radicalgrepFlags));
//category for Radical Grep specific cmd line flags
static cl::OptionCategory radicalgrepFlags("Command Flags", "Options for Radical Grep");
//Input; the radical expression & file(s) to search
static cl::opt<std::string> input_radical(cl::Positional, cl::desc("<Radical Index>"), cl::Required, cl::cat(radicalgrepFlags));
static cl::list<std::string> inputfiles(cl::Positional, cl::desc("<Input File>"), cl::OneOrMore, cl::cat(radicalgrepFlags));
//Radical Grep Input Flags; index mode, mixed mdde, and alt mode
static cl::opt<bool> indexMode("i", cl::desc("Use radical index instead of the radical character to perform search.\n Link to Radical Indices: https://www.yellowbridge.com/chinese/radicals.php"), cl::init(false), cl::cat(radicalgrepFlags));
static cl::opt<bool> mixMode("m", cl::desc("Use both radical character and radical index to perform search."), cl::init(false), cl::cat(radicalgrepFlags));
static cl::opt<bool> altMode("alt", cl::desc("Use regular expressions to search for multiple phrases."), cl::init(false), cl::cat(radicalgrepFlags));
//Adpated from grep_interface.cpp
//Adpated from grep_interface.cpp; icgrep output flags - colourization, line number, file name, runtime
ColoringType ColorFlag;
static cl::opt<ColoringType, true> Color("c", cl::desc("Set the colorization of the output."),
static cl::opt<ColoringType, true> Color("c", cl::desc("Set the colorization of the output."), //Turn on/off colourization
cl::values(clEnumValN(alwaysColor, "always", "Turn on colorization when outputting to a file and terminal"),
clEnumValN(autoColor, "auto", "Turn on colorization only when outputting to terminal"),
clEnumValN(neverColor, "never", "Turn off output colorization")
CL_ENUM_VAL_SENTINEL), cl::cat(radicalgrepFlags), cl::location(ColorFlag), cl::init(neverColor));
bool LineNumberFlag, WithFilenameFlag, CLKCountingFlag;
static cl::opt<bool, true> LineNumberOption("n", cl::location(LineNumberFlag), cl::desc("Show the line number with each matching line."), cl::cat(radicalgrepFlags));
static cl::opt<bool, true> WithFilenameOption("h", cl::location(WithFilenameFlag), cl::desc("Show the file name with each matching line."), cl::cat(radicalgrepFlags));
static cl::opt<bool, true> CLKCountingOption("clk", cl::location(CLKCountingFlag), cl::desc("Show the runtime of the function."), cl::cat(radicalgrepFlags));
std::vector<fs::path> allfiles; //Store all path of files
static cl::opt<bool, true> LineNumberOption("n", cl::location(LineNumberFlag), cl::desc("Show the line number with each matching line."), cl::cat(radicalgrepFlags));
static cl::opt<bool, true> WithFilenameOption("h", cl::location(WithFilenameFlag), cl::desc("Show the file name with each matching line."), cl::cat(radicalgrepFlags));
static cl::opt<bool, true> CLKCountingOption("clk", cl::location(CLKCountingFlag), cl::desc("Show the runtime of the function."), cl::cat(radicalgrepFlags));
std::vector<fs::path> allfiles; //Stores all the inputted file paths
std::vector<re::RE*> generateREs(std::string input_radical, bool altMode); //This function parse the input and get the results
std::vector<re::RE*> generateREs(std::string input_radical, bool altMode);
int main(int argc, char* argv[])
{
......@@ -87,30 +91,43 @@ int main(int argc, char* argv[])
CPUDriver pxDriver("radicalgrep");
allfiles=argv::getFullFileList(pxDriver, inputfiles);
//If > 1 files are inputted, the file name will automatically be shown next to each matching line.
if ((allfiles.size() > 1)) WithFilenameFlag = true;
//Adapted from icgrep.cpp; the Parabix Grep engine
std::unique_ptr<grep::GrepEngine> grep;
grep = make_unique<grep::EmitMatchesEngine>(pxDriver);
//For the -clk flag, the runtime starts from here.
long begintime;
if(CLKCountingFlag)
begintime=clock();
auto radicalREs=generateREs(input_radical, altMode); //get the results
if(CLKCountingFlag) begintime=clock();
//generateREs() takes input_radical and parses the expression into a regular expression for processing.
//For each inputted radical, the respective radical set is retrieved and made into a node.
//Following the pattern of the input_radical, this function will return a regular expression in the form of a vector.
auto radicalREs=generateREs(input_radical, altMode);
//If enabled, line count and file name will be shown in the output.
if (WithFilenameFlag) grep->showFileNames();
if (LineNumberFlag) grep->showLineNumbers();
//turn on colorizartion if specified by user
if ((ColorFlag == alwaysColor) || ((ColorFlag == autoColor) && isatty(STDOUT_FILENO))) grep->setColoring();
grep->initFileResult(allfiles); //Defined in file grep_engine, Initialize results of each file
grep->initREs(radicalREs); //Defined in file grep_engine, Initialize the output
grep->grepCodeGen(); //Return the number of the result
const bool matchFound=grep->searchAllFiles(); //Return if there have found any result, if yes, return true, else return false
//if there does not exist any results
//Defined in file grep_engine.cpp and adapted from icgrep.cpp
//Initialize inputted files and the radical regular expression returned from generateREs,
//Create the grep pipeline and search for the radicals in a parallel fashion.
//If lines with matching radicals have been found, set matchFound to True.
//Else, return False.
grep->initFileResult(allfiles);
grep->initREs(radicalREs);
grep->grepCodeGen();
const bool matchFound=grep->searchAllFiles();
//If no matches were found, return an error message and terminate program.
if (!matchFound) cout << "No matches are found for " << input_radical << endl;
//Endpoint of measuring the runtime.
if(CLKCountingFlag)
{
long endtime=clock();
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment