public interface ITokenizer
TT_
prefix, e.g. TT_TERM
.TF_SEPARATOR_SENTENCE
).TokenTypeUtils
Modifier and Type | Field and Description |
---|---|
static short |
TF_COMMON_WORD
The current token is a common word.
|
static short |
TF_QUERY_WORD
The current token is part of the query.
|
static short |
TF_SEPARATOR_DOCUMENT
Current token is a document separator (never returned from parsing).
|
static short |
TF_SEPARATOR_FIELD
Current token separates document's logical fields.
|
static short |
TF_SEPARATOR_SENTENCE
Current token is a sentence separator.
|
static short |
TF_TERMINATOR
Current token terminates the input (never returned from parsing).
|
static int |
TT_ACRONYM |
static int |
TT_BARE_URL |
static int |
TT_EMAIL |
static int |
TT_EOF
Indicates the end of the token stream.
|
static int |
TT_FILE |
static int |
TT_FULL_URL |
static int |
TT_HYPHTERM |
static int |
TT_NUMERIC |
static int |
TT_PUNCTUATION |
static int |
TT_TERM |
static int |
TYPE_MASK |
Modifier and Type | Method and Description |
---|---|
short |
nextToken()
Returns the next token from the input stream.
|
void |
reset(Reader reader)
Resets the tokenizer to process new data
|
void |
setTermBuffer(MutableCharArray array)
Sets the current token image to the provided buffer.
|
static final int TYPE_MASK
static final int TT_TERM
static final int TT_NUMERIC
static final int TT_PUNCTUATION
static final int TT_EMAIL
static final int TT_ACRONYM
static final int TT_FULL_URL
static final int TT_BARE_URL
static final int TT_FILE
static final int TT_HYPHTERM
static final int TT_EOF
static final short TF_SEPARATOR_SENTENCE
static final short TF_SEPARATOR_DOCUMENT
static final short TF_SEPARATOR_FIELD
static final short TF_TERMINATOR
static final short TF_COMMON_WORD
static final short TF_QUERY_WORD
void reset(Reader reader) throws IOException
reader
- the input to tokenize. The reader will not be closed
by the tokenizer when the end of stream is reached.IOException
short nextToken() throws IOException
TT_TERM
and other
constants or TT_EOF
when the end of the data stream has been
reached.IOException
TokenTypeUtils
void setTermBuffer(MutableCharArray array)
array
- buffer in which the current token image should be
stored