Click here to view and discuss this page in DocCommentXchange. In the future, you will be sent there automatically.

SQL Anywhere 11.0.1 » SQL Anywhere Server - SQL Usage » Querying and Modifying Data » Querying data » Text configuration objects

 

Text configuration object settings

SQL Anywhere provides two default text configuration objects, default_char for use with non-NCHAR data, and default_nchar. For information about their settings, see Default text configuration objects.

The following table explains text configuration object settings and how they impact what is indexed and how a full text search query is interpreted. For examples of text configuration objects and their impact on text indexes and full text searching, see Example text configuration objects.

  • Term breaker algorithm (TERM BREAKER)   The TERM BREAKER setting specifies the algorithm to use for breaking strings into terms. The choices are GENERIC (the default) for storing terms, or NGRAM for storing n-grams. An n-gram is a group of characters of length n where n is the value of MAXIMUM TERM LENGTH.

    Regardless of the term breaker you specify, the database server records in the text index the original positional information for the terms when they are inserted into the text index. In the case of n-grams, the positional information of the n-grams is stored, not the positional information for the original terms.

    TERM BREAKER impact to text index TERM BREAKER impact to query terms

    GENERIC text index   When building a GENERIC text index (the default), groups of alphanumeric characters appearing between non-alphanumeric characters are processed as terms by the database server. After the terms have been defined, terms that exceed the term length settings, and terms found in the stoplist, are counted but not inserted in the text index.

    Performance on GENERIC text indexes can be faster than NGRAM text indexes. However, you cannot perform fuzzy searches on GENERIC text indexes.

    NGRAM text index   When building an NGRAM text index, the database server treats as a term any group of alphanumeric characters between non-alphanumeric characters. Once the terms are defined, the database server breaks the terms into n-grams. In doing so, terms shorter than n, and n-grams that are in the stoplist, are discarded.

    For example, for an NGRAM text index with MAXIMUM TERM LENGTH 3, the string 'my red table' is represented in the text index as the following n-grams: red tab abl ble.

    GENERIC text index   When querying a GENERIC text index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing query terms to terms in the text index.

    NGRAM text index   When querying an NGRAM text index, terms in the query string are processed in the same manner as if they were being indexed. Matching is performed by comparing n-grams from the query terms to n-grams from the indexed terms.

  • Minimum term length setting (MINIMUM TERM LENGTH)   The MINIMUM TERM LENGTH setting specifies the minimum length, in characters, for terms inserted in the index or searched for in a full text query. MINIMUM TERM LENGTH is not relevant for NGRAM text indexes.

    MINIMUM TERM LENGTH has special implications on prefix searching. See Prefix searching.

    The value of MINIMUM TERM LENGTH must be greater than 0. If you set it higher than MAXIMUM TERM LENGTH, then MAXIMUM TERM LENGTH is automatically adjusted to be equal to MINIMUM TERM LENGTH.

    The default for MINIMUM TERM LENGTH is taken from the setting in the default text configuration object, which is typically 1. See Default text configuration objects.

    MINIMUM TERM LENGTH impact to text index MINIMUM TERM LENGTH impact to query terms

    GENERIC text index   For GENERIC text indexes, the text index will not contain words shorter than MINIMUM TERM LENGTH.

    NGRAM text index   For NGRAM text indexes, this setting is ignored.

    GENERIC text index   When querying a GENERIC text index, query terms shorter than MINIMUM TERM LENGTH are ignored because they cannot exist in the text index.

    NGRAM text index   The MINIMUM TERM LENGTH setting has no impact on full text queries on NGRAM text indexes.

  • Maximum term length setting (MAXIMUM TERM LENGTH)   The MAXIMUM TERM LENGTH setting is used differently depending on the term breaker algorithm.

    The value of MAXIMUM TERM LENGTH must be less than or equal to 60. If you set it lower than the MINIMUM TERM LENGTH, then MINIMUM TERM LENGTH is automatically adjusted to be equal to MAXIMUM TERM LENGTH.

    The default for this setting is taken from the setting in the default text configuration object, which is typically 20. See Default text configuration objects.

    MAXIMUM TERM LENGTH impact to text index MAXIMUM TERM LENGTH impact to query terms

    GENERIC text indexes   For GENERIC text indexes, MAXIMUM TERM LENGTH specifies the maximum length, in characters, for terms inserted in the text index.

    NGRAM text index   For NGRAM text indexes, MAXIMUM TERM LENGTH determines the length of the n-grams that terms are broken into. An appropriate choice of length for MAXIMUM TERM LENGTH depends on the language. Typical values are 4 or 5 characters for English, and 2 or 3 characters for Chinese.

    GENERIC text indexes   For GENERIC text indexes, query terms longer than MAXIMUM TERM LENGTH are ignored because they cannot exist in the text index.

    NGRAM text index   For NGRAM text indexes, query terms are broken into n-grams of length n, where n is the same as MAXIMUM TERM LENGTH. Then, the database server uses the n-grams to search the text index. Terms shorter than MAXIMUM TERM LENGTH are ignored because they will not match the n-grams in the text index.

  • Stoplist setting (STOPLIST)   The stoplist setting specifies the terms that must not be indexed.

    The default for this setting is taken from the setting in the default text configuration object, which typically has an empty stoplist. See Default text configuration objects.

    STOPLIST impact to text index STOPLIST impact to query terms

    GENERIC text indexes   For GENERIC text indexes, terms that are in the stoplist are not inserted into the text index.

    NGRAM text index   For GENERIC text indexes, the text index will not contain the n-grams formed from the terms in the stoplist.

    GENERIC text indexes   For GENERIC text indexes, query terms that are in the stoplist are ignored because they cannot exist in the text index.

    NGRAM text index   Terms in the stoplist are broken into n-grams and the n-grams are used for the stoplist. Likewise, query terms are broken into n-grams and any that match n-grams in the stoplist are dropped because they cannot exist in the text index.

    Carefully consider whether you want to put terms in your stoplist. In particular, do not include words that have non-alphanumeric characters in them such as apostrophes or dashes. These characters act as term breakers. For example, the word you'll (which must be specified as 'you''ll') is broken into you and ll and stored in the stoplist as these two terms. Subsequent full text searches for 'you' or 'they''ll' are negatively impacted.

    Stoplists in the case of NGRAM text indexes can cause unexpected results because the stoplist that is stored is actually in n-gram form, not the actual stoplist terms you specified. For example, in an NGRAM text index where MAXIMUM TERM LENGTH is 3, if you specify STOPLIST 'there', the following n-grams are stored as the stoplist: the her ere. This impacts the ability to query for any terms that contain the n-grams the, her, and ere.

    Note

    The same restrictions with regards to specifying string literals also apply to stoplists. For example, apostrophes must be escaped, and so on. For more information on formatting string literals, see String literals.

    The Samples directory contains sample code that loads stoplists for several languages. These sample stoplists are recommended for use only on GENERIC text indexes. For the location of the Samples directory, see Samples directory.

See also