Changeset 4494


Ignore:
Timestamp:
Feb 11, 2015, 5:55:14 AM (4 years ago)
Author:
cameron
Message:

Various tweaks

Location:
docs/Working/icGrep
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/background.tex

    r4491 r4494  
    5555and shifting operations as well as an operation for finding all matches to a
    5656character class repetition known as MatchStar.  At each step of the
    57 process a {\em marker} bit stream identifying the full set of positions
     57process a {\em marker} bit stream identifies the full set of positions
    5858within the input data stream that match the regular expression to this point.
    5959
  • docs/Working/icGrep/evaluation.tex

    r4488 r4494  
    8282\end{tikzpicture}
    8383\end{center}
    84 \caption{Comparative Matching Performance}\label{fig:property_test}
     84\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test}
    8585\end{figure}
    8686
  • docs/Working/icGrep/unicode-re.tex

    r4493 r4494  
    7474{\em final} positions.   Thus the one bits of a Unicode character
    7575class stream are not necessarily contiguous.  This in turn
    76 means that the addition operation within the MatchStar
     76means that the carry propagation within the MatchStar
    7777operation may terminate prematurely.
    7878
    79 In order to remedy this problem, \icGrep{} forms two helper bitstreams
    80 \emph{Initial} and \emph{NonFinal}.  The {\em Initial} bitstream marks the
    81 locations of all single byte characters and the first bytes of all
    82 multibyte characters.  Any full match to a multibyte sequence must
     79In order to remedy this problem, \icGrep{} again uses the two helper bitstreams
     80\emph{Initial} and \emph{NonFinal}.   Any full match to a multibyte sequence must
    8381reach the initial position of the next character. 
    8482The {\em NonFinal} bitstream consists of all positions except those
    85 that are final positions of UTF-8 sequences. 
    86  It is used to ``fill in the gaps'' in the CC bitstream so that the
     83that are final positions of UTF-8 sequences.
     84It is used to ``fill in the gaps'' in the CC bitstream so that the
    8785 MatchStar addition can move through a contiguous sequence of one
    8886 bits.  In this way, matching of an arbitrary Unicode character class
    89  $C$ (with a 1 bit set at final positions of any members of the
    90  class),
    91 can be implemented using ${\mathit{MatchStar}(M, C |
    92   \mathit{NonFinal})}$.
     87 $C$ (with a 1 bit set at final positions of any members of the class),
     88can be implemented using ${\mathit{MatchStar}(M, C |\mathit{NonFinal})}$.
    9389
    9490\paragraph{Predefined Unicode Classes.}
    95 \icGrep{} employs a set of predefined bitstreams that are precompiled
     91\icGrep{} employs a set of bitstreams that are precompiled
    9692into the executable.  These include all bitstreams corresponding
    9793to Unicode property expressions such as \verb:\p{Greek}:.
    98 Each category potentially contains  many code points, so we further
    99 devised an \emph{If Hierarchy} optimization for
    100 these properties.
    101 This optimization applies whenever an input document contains
    102 codepoints from only a small number of writing systems -- the normal case.
    103 The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
    104 equations for observed code point ranges.
    105 The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search.
    106 As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of
    107 code points are selected.
     94Each property potentially contains many code points, so we further
     95embed the calculations within an if hierarchy.   Each if-statement
     96within the hierarchy determines whether the current block contains
     97any codepoints at all in a given Unicode range.   At the outer
     98level, the ranges are quite coarse, becoming successively refined
     99at deeper levels.  This technique works well when input documents
     100contain long runs of text confined to one or a few ranges.
    108101
    109102%\subsection{Character Class Intersection and Difference}
Note: See TracChangeset for help on using the changeset viewer.