Profiling Engine SDK
Product Overview

Main Index
Index
Tutorial
API Functions
Query Language
   
Technology Overview
   
Contact Us
   
 
Other Products
Onix Text Search and Retrieval Engine
Brevity Document Summarizer
Lextek Document Profiler & Categorizer
RouteX Document Routing Engine
Lextek Language Identifier
 

 

Our goal with the Lextek Profiling Engine was to develop a tool that was optimized to meet the needs of the categorization, profiling and routing markets. These include significant new technologies. Some of these are designed to allow code reuse for the types of queries categorizers and routers typically use. This manual goes through the aims of our product, its capabilities, and then some of the strategies for using the Profiler in your own services and technologies. These strategies are designed to give you ideas for integrating the Profiler into your existing projects. They also will help you see new features and improvements you can add to your services and applications. They may even let you see ways of leveraging your existing technologies into new markets.

 

Introduction

Traditionally routing and categorization products have used either full-featured indexing engines or simple character scanners to analyze documents. Simple scanners look for patterns in text as that text in read in character by character. These scanners are often appealing in terms of cost, but quickly become unwieldily and are difficult to maintain. Further they typically lack the power and flexibility needed to do most types of analysis. For example the complex relationships between the terms in a document simply can't easily be uncovered with these types of scanners. These scanners, while useful for some types of analysis, simply can't answer the needs of modern routing and categorization services. The usual alternative of using traditional indexing engines faces the problem that those indexers were primarily designed purposes different than those of the categorization industry.

Traditional indexers are designed to create large, static disk based indexes. Because they are optimized for making numerous queries on basically the same data, they can be inefficient for the types of queries used by profilers. The methodologies for creating an index designed to handle static searches over megabytes or even gigabytes of data result in relatively slow speeds. Traditional indexers also often have to add overhead for features that categorizers rarely need. Routers and categorizers usually need only small temporary indexes to determine the nature of the document being analyzed right then. In addition to indexing speed and overhead, traditional indexers are usually designed with interactive searching in mind. They thus focus in on the types of queries that end users are most likely to make. Even when they add advanced functionality, these indexers are designed with less of an emphasis on large complex queries and query reuse.

To deal with the very different needs of the categorization and routing markets, the Lextek Profiling Engine was designed to create efficient memory-based indexes. Most routers, rather than indexing hundred or thousands of documents at a time, index only a few documents. Our Profiler includes in the indexing process only those features needed by these services, improving speed and reducing overhead. Because we are not creating indexes that will be reused, we can eliminate compression, encryption, and technologies designed to deal with problems of disk access. Minimizing disk access while stepping through an index is a complex problem programmers writing indexers face. Much of the time taken in querying a traditional index is taken up in searching through the index on the disk. That takes time both to wait for the disk to transfer the information but also to compute where the desired data is. In general accessing a disk is several orders of magnitude more lengthy than going directly to memory. Since profiling applications don't need disk indexes, we can avoid all these design compromises. By keeping our index in memory we can access it with dramatic speed increases.

Effectively what people in the categorization and routing industries do is create concepts or categories and compare documents against them. Applications in these industries then take action based upon which of these "ideas" the document matches. These actions might be to send a warning to a system administrator using a firewall, delete a harmful email, or even just label the document as belonging to a specific category. Consider, for example, an application that analyzes news and forwards a copy to people if that news concerns companies they own stock in. To do this an application must have concepts representing various aspects of the stock. Further it must have an idea concerning how close a document matches those concepts. Each document is viewed in terms of all these concepts and how closely that document matches each concept.

This is quite different from what most indexers are designed to do. Most indexers are used to compare very simple queries against numerous documents and return the relevant documents. The emphasis in on the documents, not the concepts. Even when we consider the types of complex queries used with traditional indexers, we find that they rarely are over and over again. The main use of a traditional indexer is to search for only a handful of partially related terms. According to many studies the majority of queries are at most only a few lines long. In contrast our clients in the categorization industry have queries over a megabyte in size. These queries often are made up of dozens or even hundreds of complex sub-queries. Each of these sub-queries may be used by hundreds of other queries. Usually the real value that categorization companies provide to their customers is the creation of these queries representing concepts, categories or ideas. Thus these companies need very robust and powerful queries.

Lextek, in consultation with several of our clients, has attempted to meet these needs by designing a query language optimized for extremely complex queries. The query part of the SDK has actually become a full programming language. This gives you immense power over your analysis. You can create named sub-queries that can be called over and over again with the Profiler. The Profiler uses the fact that these sub-queries are reused to dramatically improve speed. We not only enable you to reuse your queries but we have made it so that your query results can also be stored and reused. This means that when a particular query has been completed we don't need to do it again. Most importantly you can take these reusable queries and analysis new documents with them. These reusable queries will then update to reflect the new documents you are examining.

In practice this means that you can pre-compute your categories or concepts and then start profiling documents. Contrast this with most indexers where you have to load the same queries over and over again. Likewise you can keep libraries of queries, ideas and concepts that you have developed. You can have small queries utilize these libraries when you need. This means that you can leverage the research and development you've created for any query you might need. Our clients have found that this can open new doors for the value that they add to any indexing engine.

 

 

Next: Query Capabilities