# Changeset 4471 for docs/Working/icGrep

Ignore:
Timestamp:
Feb 6, 2015, 9:29:26 PM (4 years ago)
Message:

Unicode matching process description and illustration.

Location:
docs/Working/icGrep
Files:
2 edited

### Legend:

Unmodified
 r4464 The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size. \subsection{UTF-8 Advance using ScanThru} Each bit position in the character class bitstream of a single byte ASCII character marks either the location of, or the absence of the search character. To match the location of a character the current position of the cursor is checked to see if the bit is set and then the cursor is advanced by one position. To match the position of a multibyte search character the procedure is different. For multibyte UTF-8 characters of length \verb:k:, it is the last \verb:(k-1):th byte of the multibyte sequence in the bitstream that marks the character's location. Figure~\ref{fig:multibytesequence} illustrates the process of matching a character class of a three byte multibyte character.  The locations of the first two bytes of each character in the character class CC have been marked with zeros while the bitstream $M_1$ marks the current cursor positions. To match multibyte characters, first a \emph{nonfinal} helper bitstream must be formed.  The \emph{Nonfinal} bitstream is formed by marking the locations of the first bytes of two byte sequences, the first two bytes of three byte sequences, and the first three bytes of any four byte sequences. The \verb:ScanThru(current, nonfinal): operation is then applied, in order to advance all of the current cursor positions to the locations of the \verb:(k-1):th final character positions.  To find any matches the result is then compared with the bits that are set in the UTF-8 character class bitstream. After this, the cursor is advanced by one position to be ready for the next matching operation. \subsection{Match Unicode Characters} After the transformation, we can then generate the character class bitstreams with our character class compiler. As shown in figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes. To better demonstrate the process, we use $ni3$, $hao$ and $men$ to represent these characters. $CC_{\mbox{ni3}}$ is the bitstream that marks character $ni3$ and $CC_{\mbox{hao}}$ is the bitstreams that marks character stream $hao$. To match a two utf8 character sequence $ni3hao$, we need to first create a $Initial$ steams that marks the first byte of all the valid characters. We also need to generate a $NonFinal$ streams that marks every bytes of all the multibyte characters except the last byte. Using $Initial$ to ScanThru $NonFinal$, we get the bitstream $M_2$ which marks the positions of the last byte of every character. An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character. As illustrated by $Adv$, we find two matches for $ni3$ and from these two positions we can start the matching process for the next character $hao$. The final result stream shows 1 match for Multibyte Sequence ni3hao. \begin{figure}[tbh] \begin{center} %\begin{tabular}{cr}\\ \begin{tabular}{rclr} \\ $CC$ & \verb001...001.........\\ $M_1$ & \verb1.....1....1......\\ $nonfinal$ & \verb11....11..........\\ $T_1 = ScanThru(M_1, nonfinal)$ & \verb..1.....1.........\\ $T_2 = CC \land T_1$ & \verb..1.....1.........\\ $M_2 = Advance(M_1)$ & \verb...1.....1........ \begin{tabular}{cr}\\ input data  & \verbni3hao(Hello),ni3men(You),\\ $CC_{\mbox{ni3}}$ & \verb..1.............1.........\\ $CC_{\mbox{hao}}$ & \verb.....1....................\\ $Initial$ & \verb1..1..111111111..1..111111\\ $NonFinal$ & \verb11.11.........11.11.......\\ $M_1 = ScanThru(Initial, NonFinal)$ & \verb..1..111111111..1..1111111\\ $Adv = Advance(M_1 \land CC_{\mbox{ni3}})$ & \verb...1.............1........\\ $M_2 = ScanThru(Initial \land Adv, NonFinal)$ & \verb.....1.............1......\\ $match = M_2 \land CC_{\mbox{hao}}$ & \verb.....1.................... \end{tabular} \end{center} \caption{Processing of a Multibyte Sequence} \label{fig:multibytesequence} \end{figure} \subsection{MatchStar for Unicode character classes} Figure~\ref{fig:multibytesequence_matchstar} shows how the MatchStar operation can be used to find all matches of a multibyte UTF-8 sequence.   The problem is to find all matches to the character class CC that can be reached from the current cursor positions in $M_1$. First we form two helper bitstreams \emph{initial} and \emph{nonfinal}.  The initial bitstream marks the locations of all single byte characters and the first bytes of all multibyte characters.  Any full match to a multibyte sequence must reach the initial position of the next character.  The nonfinal bitstream consists of all positions except those that are final positions of UTF-8 sequences.   It is used to "fill in the gaps" in the CC bitstream so that the MatchStar addition can move through a contiguous sequence of one bits.  In the figure, the gaps in CC are filled in by a bitwise-or with the nonfinal bitstream to produce $T_1$.   This is then used as the basis of the MatchStar operation to yield $T_2$.  We then filter these results using the initial bitstream to produce the final set of complete matches in $M_2$. \begin{figure}[tbh] \begin{center} %\begin{tabular}{cr}\\ \begin{tabular}{rclr} \\ $CC$ & \verb001001001.........\\ $M_1$ & \verb1...........1..1..\\ $initial$ & \verb1..1..1..1..1..1..\\ $nonfinal$ & \verb11.11.11.11.11.11.\\ $T_1 = nonfinal \lor CC$ & \verb11111111111.11.11.\\ $T_2 = MatchStar(M_1, T_1)$ & \verb111111111111......\\ $M_2 = T_2 \land initial$ & \verb1..1..1..1........\\ \end{tabular} \end{center} \caption{Processing of MatchStar for a Multibyte Sequence} \label{fig:multibytesequence_matchstar} \caption{Processing of a Multibyte Sequence ni3hao} \label{fig:multibytesequence} \end{figure}