GE provides built-in inverted index support for substring queries on cell field values. GE only indexes the cell field values that are marked with [index] attribute in TSL.
Only the cell fields with string type can be indexed. There are two cases. 1)
The type of the cell field is string. During substring query processing, a
cell is matched if the value of its indexed field contains the queried
substring. 2) The cell field is a collection of strings, e.g.,
List<List<string>>. During substring query processing, a cell is matched as
long as any string in the collection contains the queried substring.
Index Declaration #
To make a field indexed, add an
[index] attribute to it:
The index attribute is only valid for the fields whose type is string or a collection of strings.
It is possible to declare an index on a nested field, for example:
Because there is no way to globally identify such a struct in GE, the index
struct leaf alone does not produce a valid index. Only when such
a struct is contained in a cell, say
root.leaf.data becomes an
accessible cell field and a valid index will be built. It is allowed to have an
indexed field included in a substructure of a substructure,
cell.inner_1.inner_2. ... .leaf.data is indexable.
If a substructure with one or more indexed fields is included in multiple cell structs, then an index will be built for each of these cell types, and the indexes are independent with each other.
Substring Queries #
For an indexed field, we can issue a substring query against it by calling the
Index.SubstringQuery with a field identifier and a query string. A
field identifier, such as Index.root.leaf.data, is an automatically generated
nested class defined within the
Index class. It is used to specify which cell
field we are going to query against: Index.root.leaf.data identifies the
data field in the
leaf struct within the root cell.
Index.SubstringQuery(Index.root.leaf.data, "query string") returns a list of
cell ids. Each root.leaf.data field of the corresponding cells contains the
The method Index.SubstringQuery also accepts a sequence of query strings.
Given a sequence of query strings
q1, q2, ..., qn, this method will perform a
wildcard search with the
*q1*q2*...*qn* pattern. It means that the strings
q1, q2, ..., qn are substrings of a string in the order specified in the
Index Update #
If the cells are continuously updated, the changes made to the indexed fields may not be immediately reflected in the index. That is, a substring query may return outdated results. To rule out false positives (the previously matched cells do not match now), we can check the field values again after getting the matched cell ids. It is not easy to address false negatives though, i.e., a cell should be matched, but not included in the index yet.
The indexes are updated periodically by the system. To manually update an index,
we can call
Index.UpdateSubstringIndex() on a specific field identifier.
LINQ Integration #
The inverted index subsystem is integrated with LINQ. In a LINQ query over a
selector, GE translates
String.Contains on an indexed field into inverted index queries. The same
rule applies for
IEnumerable<string>.Contains. Refer to the
LINQ section for more details.