# Changeset 4494 for docs

Ignore:
Timestamp:
Feb 11, 2015, 5:55:14 AM (4 years ago)
Message:

Various tweaks

Location:
docs/Working/icGrep
Files:
4 edited

### Legend:

Unmodified
 r4493 {\em final} positions.   Thus the one bits of a Unicode character class stream are not necessarily contiguous.  This in turn means that the addition operation within the MatchStar means that the carry propagation within the MatchStar operation may terminate prematurely. In order to remedy this problem, \icGrep{} forms two helper bitstreams \emph{Initial} and \emph{NonFinal}.  The {\em Initial} bitstream marks the locations of all single byte characters and the first bytes of all multibyte characters.  Any full match to a multibyte sequence must In order to remedy this problem, \icGrep{} again uses the two helper bitstreams \emph{Initial} and \emph{NonFinal}.   Any full match to a multibyte sequence must reach the initial position of the next character. The {\em NonFinal} bitstream consists of all positions except those that are final positions of UTF-8 sequences. It is used to fill in the gaps'' in the CC bitstream so that the that are final positions of UTF-8 sequences. It is used to fill in the gaps'' in the CC bitstream so that the MatchStar addition can move through a contiguous sequence of one bits.  In this way, matching of an arbitrary Unicode character class $C$ (with a 1 bit set at final positions of any members of the class), can be implemented using ${\mathit{MatchStar}(M, C | \mathit{NonFinal})}$. $C$ (with a 1 bit set at final positions of any members of the class), can be implemented using ${\mathit{MatchStar}(M, C |\mathit{NonFinal})}$. \paragraph{Predefined Unicode Classes.} \icGrep{} employs a set of predefined bitstreams that are precompiled \icGrep{} employs a set of bitstreams that are precompiled into the executable.  These include all bitstreams corresponding to Unicode property expressions such as \verb:\p{Greek}:. Each category potentially contains  many code points, so we further devised an \emph{If Hierarchy} optimization for these properties. This optimization applies whenever an input document contains codepoints from only a small number of writing systems -- the normal case. The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class equations for observed code point ranges. The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search. As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of code points are selected. Each property potentially contains many code points, so we further embed the calculations within an if hierarchy.   Each if-statement within the hierarchy determines whether the current block contains any codepoints at all in a given Unicode range.   At the outer level, the ranges are quite coarse, becoming successively refined at deeper levels.  This technique works well when input documents contain long runs of text confined to one or a few ranges. %\subsection{Character Class Intersection and Difference}