Bracket Matching

The bracket matching problem deals with parentheses (), square brackets [], braces {} and other syntactic elements that provide for nested syntactic structures with balanced sets of delimiters.

To tackle bracket matching with Parabix methods, a key concept is that of a Nesting Depth BixNum. This is a representation of the nesting depth of syntactic elements at each position in an input stream. The following example shows an input stream with square brackets and braces, the corresponding nesting depth of each element in the data stream and the nesting depth bixnum (ND) as a set of 3 parallel bit streams (giving a 3-bit number at each character position).

Bracket Stream   ...[..{.}..{...[.[]..]..}...]...
Nesting Depth    00011122211222233443332221111000
ND BixNum[2]     .................11.............
ND BixNum[1]     ......111..111111..111111.......
ND BixNum[0]     ...111...11....11..111...1111...

NestingDepth Kernel

The NestingDepth kernel is a Parabix kernel (to be implemented) that can compute the nesting depth of characters in an input stream, given two character class bit streams L identifying all the left or opening delimiters and R identifying all the right or closing delimiters. For our example bracket stream above, L and R are shown below.

Bracket Stream   ...[..{.}..{...[.[]..]..}...]...
L                ...1..1....1...1.1..............
R                ........1.........1..1..1...1...

The NestingDepth Kernel also takes a parameter 'MaxDepth` identifying the maximum nesting depth expected for an input source. The number of bit streams in the NestingDepth BixNum must be equal to ceillog2(MaxDepth + 1).

Syntax Validation Using Nesting Depth

Consider the problem of validating JSON array syntax (where whitespace may be freely included between tokens):

array ::= '[' [value {',' value}] ']' 
object ::= '{' string : [value {',' value}] '}' 
value ::= object | array | numeral | string | 'true' | 'false' | 'null'

Validating non-nested JSON

A JSON grammar is given by JSON ::= value, meaning that in some instances - numeral | string | 'true' | 'false' | 'null' - we do not have nesting. See an example below:

// string.json

"This is valid JSON"

Thus, when we apply NestingDepth kernel to this stream, we get the output:

Bracket Stream   ....................
Nesting Depth    00000000000000000000
ND BixNum[2]     ....................
ND BixNum[1]     ....................
ND BixNum[0]     ....................

Assume that valueToken is a bit stream that marks the end position of any legal JSON numeral, string or keyword, then the example above would look like:

valueToken       ...................1.
EOFbit           ....................1

Then we can validate if we have valid non-nesting values by checking:

  otherND = bnc.UGT(ND, 0)
  zeroND = bnc.EQ(ND, 0);
  begin = ~Advance(<1>, 1)
  valueAtZero = valueToken & zeroND
  stopAtEOF = ~EOFbit

  // If we have simple value at depth 0, we cannot have any other token
  firstValue = ScanTo(begin, valueAtZero)
  nonNestedValue = ScanTo(Advance(firstValue, 1), anyToken)
  errValue = ScanThru(Advance(nonNestedValue, 1), stopAtEOF)

  // If we have any symbol, we cannot have any value at depth 0
  firstSymbol = ScanTo(begin, symbols)
  valueAtZeroAfterSymbol = ScanTo(Advance(firstSymbol, 1), valueAtZero)
  errSymbol = ScanThru(Advance(valueAtZeroAfterSymbol, 1), stopAtEOF)

  // EOFbit is always at depth 0, otherwise we have unmatched parens
  errEOF = EOFbit & otherND

  // combine all possible errors at depth zero
  errSimpleValue = errValue | errSymbol | errEOF

Validating arrays

Assume we have two bit streams for tokens: valueToken (explained above) and anyToken, a bit stream marking the end position of any legal or illegal token. Assume also that LBrak, RBrak, and Comma are streams marking the position of JSON [, ], and , tokens respectively.

Suppose that a BixNum ND has been computed as the nesting depth involving the left and right delimiter sets consisting of square brackets and braces.

Then the validation of array syntax at each nesting depth d can be determined as follows.

Each array will be a span from an opening '[' to its closing ']'. Between these brackets will be other tokens as well as nested arrays and objects. However, nested arrays and objects can be easily computed as those spans of elements having a nesting depth greater than d. Using a BixNum compiler bnc, the spans may be computed as bnc.UGT(ND, d). Tokens at depth d can be identified as those at positions identified by the stream bnc.EQ(ND, d).

The following code can be used to validate all arrays at depth d, determining whether there are any errors after the opening LBrak or after any value or Comma.

  zeroND = bnc.EQ(ND, 0)
  validEndValues = (valueToken | RBrak | RBrace) & ~zeroND
  validEndValuesMinusStr = validEndValues & ~DQuote

  // Validate that every value that is not a string is followed either
  // by a comma or a the end validRBracket.
  // String is a special case and is checked on kernel JSONParserObj
  afterToken = Advance(validEndValuesMinusStr, 1)
  tokenNext = ScanThru(afterToken, whitespace)
  errAfterValue = tokenNext & ~(Comma | RBrak | RBrace | zeroND)
  //
  // Every Comma must be followed by a value
  advComma = Advance(Comma, 1)
  validBeginValues = (valueToken | LBrak | LBrace) & ~zeroND
  errAfterComma = ScanTo(advComma, anyToken) & ~validBeginValues)
  
  for d in 1...MaxDepth
    atDepth = bnc.EQ(ND, d)
    nested = bnc.UGT(ND, d)
    arrayStart = atDepth & LBrak
    arrayEnd = ScanThru(arrayStart, nested | (atDepth & ~ (RBrak | RBrace)))
    errorAtEnd = arrayEnd & RBrace
    //
    // After the LBrak we must have either a value or an RBrak.
    scanArrAnyToken = ScanTo(Advance(arrayStart, 1), anyToken)
    errAfterLBrak = scanArrAnyToken & ~ (nested | (valueToken & atDepth) | RBrak)

Validating objects

Assume we have two bit streams for tokens: valueToken (explained above) and anyToken, a bit stream marking the end position of any legal or illegal token. Assume also that LBrace, RBrace, DQuote, Colon and Comma are streams marking the position of JSON {, }, ", : and , tokens respectively.

  str = valueToken & DQuote
  zeroND = bnc.EQ(ND, 0)

  // process strings as both key and value
  validStr = str & ~zeroND
  afterTokenStr = Advance(validStr, 1)
  tokenNextStr = ScanThru(afterTokenStr, whitespace)
  errAfterValueStr = tokenNextStr & ~(Comma | Colon | RBrace | RBrak | zeroND)

  // Every Colon must be followed by a value
  validBeginValues = (valueToken | LBrak | LBrace) & ~zeroND
  advColon = Advance(Colon, 1)
  errAfterColon = ScanTo(advColon, anyToken) & ~validBeginValues)

  for d in 1...MaxDepth
    atDepth = bnc.EQ(ND, d)
    nested = bnc.UGT(ND, d)
    objStart = atDepth & LBrace
    objEnd = ScanThru(objStart, nested | (atDepth & ~ (RBrak | RBrace)))
    objAtEnd = objEnd & RBrak
    objSpan = ExclusiveSpan(objStart, objEnd)
    errorAtEnd = objEnd & RBrak
    //
    // Every Colon must be within an object span
    errInColon = (Colon & atDepth) & ~objSpan
    //
    // Every Comma must be followed by a key string
    strAtDepth = str & atDepth
    advComma = Advance(Comma & atDepth & objSpan, 1)
    errAfterComma = ScanTo(advComma, anyToken) & ~ strAtDepth
    //
    // After the LBrace we must have either a value or an RBrace.
    scanObjStartAnyToken = ScanTo(Advance(objStart, 1), anyToken)
    errAfterLBrace = scanObjStartAnyToken & ~ (nested | valueToken | RBrace)

CFG Kernel (note: this doc isn't ready)

The CFG kernel is a Parabix kernel (to be implemented) that can compute a context-free grammars given a set of rules with its own syntax (close to EBNF), a BixNum representing the nesting depth of its tokens (see NestingDepth kernel), and a StreamSet representing all valid markers.

enum Token {
   lCurly,
   rCurly,
   lBracket,
   rBracket,
   colon,
   str,
   value
};

// TODO: This is just an initial idea and it wouldn't work. Need to improve that.
CFGRule * valueRule = CGF::match(Token::value)
CFGRule * arrRule   = CFG::matchBracket(
   Token::lBracket,
   { Token::value },
   Token::rBracket
);
CFGRule * objRule   = CFG::matchBracket(
   Token::lCurly,
   { Token::str, Token::colon, Token::value },
   Token::rCurly
);

Nesting Depth    00011122211222233443332221111000
ND BixNum[2]     .................11.............
ND BixNum[1]     ......111..111111..111111.......
ND BixNum[0]     ...111...11....11..111...1111...