Profiling Engine SDK
Query Language Summary

Query Language
Introduction
Summary
Operators
Tips, Questions, and Answers
   
 
Main Index
Index
Tutorial
API Functions
Query Language
   
Technology Overview
   
Contact Us
   
 
Other Products
Onix Text Search and Retrieval Engine
Brevity Document Summarizer
Lextek Document Profiler & Categorizer
RouteX Document Routing Engine
Lextek Language Identifier
 

 

Introduction

This manual is a brief summary of the query language used by the Lextek Profiling Engine. The manual is designed to outline the structure of the language itself. If you are unfamiliar with programming languages and search engines, we suggest you go through the Tutorial manual. That manual will walk you through conducting searches and designing more complex queries.

The Onix Query language is based on regular programming languages that you may already be familiar with. The only difference from languages like C is that logical operations don't operate purely between single valued numbers, but between lists of records. Consider the following C code

A = 1;

B = 0;

C = A & B;    // C now holds the value of 0

Contrast this with the following query from the Profiler.

A = 'apple';  // A now holds the LIST of all record 

              // numbers containing the word apple



B = 'orange'; // B now holds the LIST of all record 

              // numbers containing the word orange



C = A & B;    // C now holds the LIST of all record 

              // numbers that have both apple and orange

There are a few other differences from C in this code fragment. We'll deal with those a little later. For now we just want to emphasize that unlike many query languages, the Lextek Profiling Engine's query language is oriented around writing complex queries. By following the design of a programming language rather than series of "macros" we are able to make the Profiler more flexible and easily allow further expansion in the future.


1.0 - Language Basics

The Onix Query language consists of a series of statements. Each statement is a query expression, a variable being set to a query expression, or the definition of a function containing other statements. Query expressions are like the search queries in simple search tools. They consist of a series of operands (terms) combined with operators (symbols representing logical operations). A statement can span multiple lines and is terminated by a semi-colon character (;), just as in C or C++. To send several statements to the query processor you simply create a string with a semi-colon separating each statement.

Sample query consisting of multiple statements:

a = 'apple' & ( 'oranges' | 'orange' );

c = a | 'pear';

To have Onix return to you the results of a statement you must make the last statement an expression. This results of this expression will then be stored in what is called a vector, returned by prProcessQuery. (See the API Function Summary for more information on SDK calls). A vector is simply a list of "hits." Each hit is a document and word number with the weight of that hit. Each hit represents those documents you've indexed that match your query.

Generally each query will represent a category or concept you've created. So each hit tells you those documents that fit your category and how well they fit your category.

 

2.0 - Terms

Terms are the basic unit of the Profiling Engine. Each term is something that you indexed using prIndexWord or prIndexWordSpecial. Technically they need not be words. For instance some applications index trigrams (3 character pairs) to analyze texts rather than indexing whole words. Further some programmers add control characters flag words as having special meanings. In general though, most terms are words that have been indexed in a document.

The simplest query possible is to look for a single term. The following returns all documents that contain the word cat.

'cat';

As with all queries we terminate it with a semi-colon. (See 1.1 Language Basics). Any alpha-numeric term can be specified by simply enclosing it within single quotes.

Terms can also be specified in a hexadecimal form. The hexadecimal form consists of a 0x followed by hexadecimal numbers representing each byte in the term. For instance the query above could also have been written as

0x636174;

Note that the format of terms are totally controlled by the indexing process. Whatever is passed in to prIndexWord or prIndexWordSpecial is a term. Your program has total control and flexibility over what is or isn't a term. It is therefore important that you remember to match what you use in your queries with what you used in your indexing.

 

3.0 - Basic Operators

The basic operators that are available in most search engines are also available in the Profiler.

 

 Operator Example Returns
 or 'apple' | 'pear'; records where either term occurs
 and 'apple' & 'pear'; records where both terms occurs
 not 'apple' ! 'pear'; records where the first term occurs but the second term does not occur
 xor 'apple' ^ 'pear'; records where either term occurs but both don't occur.
 near 'apple' / 5 'pear'; records where the two term occur within a certain number of words of each other
 phrase " 'john' 'smith' "; records where the phrase occurs (each term adjacent to each other and in the order provided)
 phrase < 'john' 'smith' >; records where the phrase occurs (each term adjacent to each other and in the order provided) alternative form
 before 'apple' before 'pear'; records where the first term occurs before the second term
 after 'apple' after 'pear'; records where the first term occurs after the second term

Most of these operators are fairly self-explanatory. The only unusual ones are the phrase operators and the near operator.

For the phrase operator you can enclose terms between quote characters " " or alternatively braces < >. It is, however, important that you still enclose terms within single quotes if you are using that form. A common mistake is to type "first second" rather than "'first' 'second'" as is required. Remember that the single quotes translate ASCII text into the format that the Profiler understands.

The near operator is one of the most useful operators as it lets you specify that a term is significant only if it occurs nearby an other operator. To specify how near they must be you simply use the forward slash character / followed by the number of terms they must occur within. Thus if I wanted to find the term apple within five words of pear I'd type the example given above. Note that the near operator doesn't care what order the words occur in. It simply checks to see how close they are.

You can group operators using parenthesis. These allow you to specify the order in which they are evaluated. For example the following query evaluates all documents with pear and orange occuring as well as any document with apple occuring.

'apple' | ( 'pear' & 'orange');

 

4.0 - Weights

Weights are values that are associated with each term or query result. You can specify the weight of a term or query result by enclosing the value within brackets. If the term or result already has a weight specifying a new weight will replace the weight.

Weights have a meaning for ranking hits. Generally weights take a value between 0 and 1. Weights of 0 are discarded as they are equivalent to "not present" or "not significant." Weights of 1 mean "is the category" or "is a member." Values between 0 and 1 usually are taken to represent the degree to which a term or query is a member of a concept or else the probability that they are members of a concept or category. Technically though the value can mean whatever you wish. There really isn't any requirement that they be between 0 an 1. However if you have values over 1 their meaning is hard to see. Usually values that don't range betwen 0 and 1 are normalized so that they can represent a probability. The Lextek Profiler has several methods for normalizing weigths which are discussed later.

Examples of assigning weights.

'apple' [.4];

( 'apple' & 'pear' ) [.4];

'pear' [.9] | 'apple' [.3];

If you don't specify a weight a term defaults to a weight of 1. This is because a weight of 1 represents present or 100% probability.

Note that when you specify a weight it applies to the term or expression to its left. In the third example above the weight of .3 is applied to the term 'apple' and not the entire expression. To modify the entire expression we must enclose it in parenthesis as in the second example.

 

5.0 - More Operators

The Profiling query language supplies numerous more operators than the basic ones given above. Because of the numerous operators we can't use the symbolic notation for all of them. Instead they are specified using a functional form. Using these operators is akin to calling a function in C or other programming language. Even the basic operators we listed in 1.2 are available in a functional form.

Most of the extra operators function equivalently to our basic operators, but add advanced ranking methods. Ranking methods determine how to deal with terms or subqueries that have different weights. The simplest method is to ignore their weights. Other methods take the maximum or minimum weight. More advanced methods utilize statistical modeling or even vector space modeling. We'll go through each of the methods with a very brief summary along with the functions that use that method.

5.1 - Existential Operators

Existential operators simply ignore underlying weights. They care only whether a term exists in a document or not. Thus they either return a weight of 0 (not present) or 1 (present). They don't deal with uncertainty or probability.

 Operators Example Description
Phrase " 'eat' 'apples' " Finds terms occuring as a phrase.
Phrase < 'eat' 'apples' > Alternative form of phrase.
Phrase phrase( 'eat', 'apples' ) Alternative form of phrase.
Near 'eat' / 5 'apples' Finds terms within n words of each other
Near near( 5, 'eat', 'apples' ) Alternative form of near.
Before 'eat' before 'apples' First term occurs before second
After 'eat' after 'apples' First term occurs after second
All all( 'pear', 'apple', 'orange' ) All terms must occur (like an and)
Any any( 'pear', 'apple', 'orange' ) Any of the terms could occur (like an or)
AtLeast atleast( 2, 'pear', 'apple', 'orange') At least n terms must occur in the document. (like an or)
AtMost atmost( 2, 'pear', 'apple', 'orange') No more than n terms can occur int he document. (like an or)
Ordered ordered( 'pear', 'apple', 'orange' ) All the terms must occur and the must occur in the order listed. (like an and)
Ordered_Near ordered_near (3, 'pear', 'apple', 'orange' ) All terms must occur in the order listed and must be within n words.

5.2 - Fuzzy Logic Operators

Fuzzy Logic operators use the rules of fuzzy logic to determine resultant weights. For operators that require all terms to be present they return the maximum weight of the terms. (i.e. operators that work like an and) For operators that don't require all terms be present they return the minimum weight. (i.e. operators that work like an or) They care only whether a term exists in a document or not. Thus they either return a weight of 0 (not present) or 1 (present). They don't deal with uncertainty or probability.

 Operators Example Description
And 'apple' & 'orange' Returns max weight
And and( 'apple', 'orange' ) Alternative form of and. Returns max weight.
Or 'apple' | 'orange' Returns min weight
Or or( 'apple', 'orange' ) Alternative form of or. Returns min weight
Xor 'apple' ^ 'orange' Returns min weight
Xor xor( 'apple', 'orange' ) Alternative form of xor. Returns min weight
Max max( .5, 'apple', 'orange' ) Like an or, but ignores those terms or subqueries with a weight more than n. It is like setting a threshhold. Often this is used with a single term or subquery.
Min min( .5, 'apple', 'orange' ) Like and or, but ignores those terms with weights less than n. It is like setting a threshhold. Often this is used with a single term or subquery.





    

5.3 - Probabilistic Operators

Probabilitistic operators treat weights as a probability. To calculate the weight of an operator it calculates the probability of all the returned terms occurring together. Statistically this is equivalent to multiplying the probabilities of each term not occuring and then taking the inverse of that. Thus if I have a term with a weight of .4 and consider that weight as a probability, the probability of that term not belonging is .6. So if I have three terms with weights of .5, .3 and .9 the probability that they all belong is 1 - ( (1-.5)*(1-.3)*(1-.9) ) = .965. This is one of the more commonly used ranking methods. Probabilistic operators usually take the same name as the basic operator but with an r (for ranked) as a prefix.

 Operators Example Description
r_and r_and( 'apple', 'orange' ) All terms must occur
r_or r_or( 'apple', 'orange' ) Any term can occur
value value( 'apple', 'orange') Alternative form of r_or. (Used because it is a natural term for the statistics - what is returned is the value of all the terms occuring together)
r_phrase r_phrase( 'eat', 'apples' ) Terms occuring as a phrase.
r_near r_near( 7, 'apples', 'oranges' ) Terms within n words of one an other.
r_ordered r_ordered( 'apple', 'orange' ) Terms must occur in the order listed
r_ordered_near r_ordered_near( .5, 'apple', 'orange' ) Terms within n words of each other and in the order listed
r_atleast r_atmost(2, 'apple', 'pear', 'car') At least n of the terms must occur, otherwise like an or
r_atmost r_atmost(2, 'apple', 'pear', 'car') At most n of the terms can occur, otherwise like an or

In addition to the above operators there is one additional operator that uses a slightly different probabilistic calculation. The bayesian method is very similar to the probabilistic method, except that they use a form of statistics called Bayesian Statistics. Bayesian Statistics differ from regular statistics in that they deal with how likely something is based upon incomplete information. The calculation is to take the probability you think something will occur and divide that by the probability you think it won't occur plus the probability you think it will. In practice if you need to use the bayesian operator you'll already understand how it works.

bayesian bayesian( 'apple', 'pear', 'car') Any term can occur

 

5.4 - P-Norm Operators

P-norm operators are a little difficult to explain. Basically they treat weights like a vector in space (if you remember your physics or vector algebra). Each term represents a unique direction. The weight represents the distance along that direction. What the p-norm does is allow you to pick a kind of measurement of the total distance of all the terms together. Each p-norm calculation requires what is called a p-value. The p-value specifies what kind of distance calculation is used. A p-value of 2 is like calculating a traditional vector distance. A p-value of 1 simply adds the weights together. Higher p-values actually make the calculation become more and more like a fuzzy logic calculation. So in a sense a P-norm calculation allows you to specify how much like a distance and how much like a min/max your weight returns.

Even though P-norm calculations are very difficult to understand, in practice they are the most effective ranking method available. They are a little more computationally intensive, however, which can be significant if speed is a factor and you are doing thousands of calculations. Since it can be difficult to get a grasp on how P-norm operators work conceptually you should try various p-values to see which give you the best results pragmatically. After you tune your queries in this way you 'll find that P-norm calculations can really help the accuracy of your categorizing.

The calculation is to take the weights to the power of p, take the average, and then take the pth root. If you are familar with vector algebra you can see how this is like a geometric distance. As with fuzzy logic calculations, a slightly different form is used for operators that are like an or as opposed to those like an and. Operators that are like an and use the probability of the term not occuring (1 - weight). Operators that are like an or use the weight itself.

The first parameter passed to p-norm operators is the p-value. If the operators take further numeric parameters (as with p_near, for example) those other values follow the p-value.

 Operators Example Description
p_and p_and( 2.5, 'apple', 'orange' ) All terms must occur
p_or p_or( 4, 'apple', 'orange' ) Any term can occur
p_phrase p_phrase( 3, 'eat', 'apples' ) Terms occuring as a phrase.
p_near p_near( 2.5, 7, 'apples', 'oranges' ) Terms within n words of one an other.
p_ordered p_ordered( 3, 'apple', 'orange' ) Terms must occur in the order listed
p_ordered_near p_ordered_near( 4, .5, 'apple', 'orange' ) Terms within n words of each other and in the order listed
p_atleast r_atmost(4, 2, 'apple', 'pear', 'car') At least n of the terms must occur, otherwise like an or
p_atmost r_atmost(4, 2, 'apple', 'pear', 'car') At most n of the terms can occur, otherwise like an or

 

5.5 - MMM Operators

The mixed minimum and maximum method (MMM for short) for ranking is actually a variation of the fuzzy logic method. For fuzzy logic we take the maximum weight of underlying terms and subqueries if we require all of them to occur. If we only require some to occur we use the minimum. What the MMM method does is take the opposing weight and use it to bias the fuzzy logic weight. In a sense what the MMM does is take a value, called the m-value, which specifies how much like and and is like an or and vice versa. The m-value is the ratio between those two values. An m-value of .5 means that the and part and the or part contribute equally to the final answer.

From testing the most effective m-values for m_and ranges from .5 to .8. For m_or we've found that an m-value of .2 to .5 works best. However you should do some testing to see what value matches your data and categories best. As with the p_norm ranking method, the MMM method requires a little trial and error, but can be very effective and improving your profiling efficiency.

 Operators Example Description
m_and m_and( .6, 'apple', 'orange' ) All terms must occur
m_or m_or( .2, 'apple', 'orange' ) Any term can occur

 

5.6 - Variable Distance Operators

Some operators explicitly depend upon how far apart terms are. For some applications terms which occur close to one an other are far more significant than terms which are farther apart. The variable distance method of ranking compares some maximum distance to the distance between terms and uses that as a guide for assigning a rank between 0 and 1.

 Operators Example Description
v_near v_and( 10, 'apple', 'orange' ) Terms must be within n words
v_ordered_near v_ordered_near( 20, 'apple', 'orange' ) Terms must be within n words and in the ordered presented

 

5.7 - Normalization Operators

Sometimes the weights assigned to terms arise out of external programs. Those programs might be due to feedback by users, results of document similarity calculations, or even statistics from a dictionary. Consider, for example, a mail router that routes mail based upon how well it fits categories supplied by a user. Some applications provide a checklist to see how relevant that document really was. Based upon what the user tells them, the weights of certain categories are adjusted so that only appropriate messages make it through. Those calculations may results in weights becoming larger than 1. (For instance a term that already has a weight of 1 which you want to make more significant)

The approach to taking a range of values and assigning them new values between 0 and 1 is called normalization. There are many approaches to normalizing ranked results. The simplest, although least accurate, is to simply take the highest value and make that 1. The problem with that method is that it means that a collection of somewhat accurate data is treated the same as a collection of very accurate data. Sadly this is the typical method most ranking schemes use - often in a fashion that doesn't let the programmer adjust the data themselves. Sometimes this renomralization is entirely hidden from the programmer, leading to misleading statistics and weights.

The most accurate method of normalizing weights is to look at all the data and utilize features from the data itself to get an idea of the "typical" value. The most widely used method is called the RMS method. This method, familar to people in physics or engineering, is to take an average of the square of each value. You then take the square root of that average. To normalize you then divide each value by this RMS figure. Any value that would be greater than 1 is set to 1. In this way the several values that are most likely to be the sought for hits are treated as being most significant. This helps eliminate misleading data and results in a ranking that is most accurate.

Since the RMS method doesn't meet everyones' needs, we've included a traditional ranking which simply divides all the values by the maximum value. We've also included a custom normalization method that lets you specify what value is the "maximum" value.

 Operators Example Description
normalize normalize( 'apple' ) Each hit's weight is divided by the RMS of all the weights in the subquery. Any resultant weights greater than 1 are set to 1
maxnormalize maxnormalize( 'apple' ) Each hit's weight is divided by the maximum of all the weights in the subquery. Any resultant weights greater than 1 are set to 1
mynormalize mynormalize( .7, 'apple' ) Each hit's weight is divided by n. Any resultant weights greater than 1 are set to 1
complement complement( 'apple' ) Each hit's weight is set to 1 - its weight. Effectively this transforms all weights from the probability something is a member of a concept or category to the probability it is not a member.

 

5.8 - Other Operators

In addition to the various operators listed above, we have some specialized operators that don't bear a clear connection to any ranking scheme or to our basic operators.

The first group of operators are like an if statement in programming, but which functions on hits rather than a normal condition. These fit fairly specialized needs, but allow you to return terms or subqueries based upon the results of other subqueries. Usually these operators are parts of fairly complex queries.

 Operators Example Description
gate gate( 'apple', 'pie', 'cherry') Any records with term 1 return the hits for term 2 in that record, otherwise they return the hits for term 3 in that record. You can leave out term 3 if you wish.
iif iif( 'apple', 'pie', 'cherry' ) If term 1 has at least one hit, then return the hits for term 2, otherwise return the hits for term 3. You can leave out term 3 if you wish.

The main difference between gate and iif is that gate works on the record level. If a record contains the first term, then it returns the second term otherwise the third term. Basically it is an "if-statement" that is evaluated for each and every record. The iif statement works over the entire set of records being profiled. Thus if the first term has a hit in any record it returns the hits for term 2 othewise term 3. The reason we've supplied both forms is that many clients break up documents into multiple records while profiling. Each record has a special meaning. While this use of the Profiler isn't typical, for those who need the power it comes in very handy.

 

6.0 Comments

To enable you to document your queries and add notes, the language allows the use of C and C++ style comments.

Everything within /* */ is treated as a comment and the comments can span numerous lines. This form of comments is useful for commenting out blocks of code or writing long explanations.

Everything between the double slash // and the end of the line is also a comment. This is useful for putting short notes after a statement.

 

7.0 Variables

Variables are used to hold the results of a subquery. It is important to note that they exist only to hold results. If you index more documents, the variables will not update to reflect the new documents. If you need variables to reflect the current state of the index you should use functions, discussed below.

Variables are most often used for clarity or to hold intermediate steps when conducting interactive uses of the Profiler. They work exactly the same as variables in a traditional programming language. Because of the nature of queries, we find that you will rarely need to use variables. They exist mainly for special circumstances.

To set a variable you simply type the name of the variable followed by an equals sign and the subquery you wish to assign to the variable. The name of a variable starts with a letter followed by any combination of alphanumeric characters and the underscore character. Variable names are not case sensitive. Thus BOBs_Query and Bobs_Query are equivalent.

Examples of variables and their use

a = 'apple';  

b = a | 'pear';   // equivalent to b = 'apple' | 'pear';

a;                // returns value of a to ixProcessQuery

It is important to remember that in this version of the Profiler, all variables are global. In practice this shouldn't be a problem. However when you use variables within functions you should remember that those variable names are not local to that function.

 

7.0 Functions

Functions are very similar to variables, except that they reflect the current state of the index. Thus if you reset the Profiler and re-index some new documents, the information returned by a function will reflect the new documents. In a sense a function is basically a named query.

It is important to use functions whenever possible. Not only do they make your queries easier to read and understand, but the Profiler uses it to optimize your queries. The more functions your query uses, the more the Profiler can optimize it.

A function name must start with a letter followed by any combination of alpha-numeric characters and the underscore character. The names are not case sensitive. If you attempt to create a function with a name that has already been used an error will be returned.

To define a function you type the function name followed by an equals sign. After the equals sign you put the subquery between braces { }. Each statement in the subquery must end with a semicolon. The function must also terminate with a semi-colon.

Examples of functions

my_function = { 'apple'; };  // returns all documents with apple



other_function = {

        fruit = 'apple' | 'cherry';  // uses variable 

        NEAR( 5, fruit, pie );       // function returns this 

                                     // value

        };



caller_funct = {

        other_function & 

           ( 'baking' | 'bake' | 'bakes' );

        };

It is important to remember that variables are not local to functions. While most functions typically consist of a single operator that then calls other subqueries, a function can consist of more than one statement. In this case only the last statement should return a value. Any other statements return values are ignored.