Select Performance AND Optimization OR Optimization
By John Brunett, Senior Software Engineer
The Synergy/DE Select class is a very powerful tool, but if you’re not careful, it can be a performance bottleneck. One of the advantages of a 3GL like Synergy/DE over a 4GL relational database is the performance potential of a more direct or intentional I/O. The Synergy/DE Select is simply an added layer above a flat or ISAM file that retrieves data based on a set of relational criteria. SQL can spend a significant amount of time analyzing those criteria against database cardinality (or projected number of rows) and then make a best guess as to how it should retrieve the data. The Synergy/DE Select quickly analyzes those criteria using the specified relational qualities against existing keys without knowledge of their cardinality. This is where the intentional part comes in: Your job is to construct a Where expression that is optimized specifically for your data. The purpose of optimization is to reduce the overall number of records/rows read to the absolute minimum, nearing the total number of selected matches.
First, if you’re using Select on a relative file or an ISAM file with no key references, you might as well just use a READS directly. The best performance is achieved when the Select is optimized. Currently, the following optimization levels can be viewed in DBG_SELECT output or identified by examining the IsOptimized Select property:
0 – Not optimized
1 – Head optimized
2 – Tail optimized
3 – Head and tail optimized
The best form of optimization on a Select occurs when you can isolate your selection between the lowest relation (head optimization) and the highest relation (tail optimization) to examine the fewest rows over the entire dataset. This includes comparing a key field for equality, e.g., (K1 == somevalue), or a narrowing relation such as (K1 >= lowvalue .AND. K1 <= highvalue), both of which provide level 3 optimization. Finding keys that establish this optimization holds the highest precedence in the Select optimizer. However, the Select doesn’t know your data. If your data layout would encounter fewer rows against a key that wasn’t chosen, you can override the Select’s key choice by explicitly specifying Where.Keynum with the better key.
The next best form of optimization is either head or tail optimization where you have isolated all possible selections to either the upper end or the lower end of a particular key value. This usually takes the form of positive relation (K1 > lowvalue), which tells the optimizer where to begin, or negative relation (K1 <= highvalue), which tells the optimizer where to stop. In either case, the objective is to reduce the number of rows fetched and compared.
When not optimized, the Select has to read every record in the file, which you should try to avoid, especially on large files. For a frequently used query, consider adding an appropriate key that would provide an optimized selection path. If you can’t do that, another performance-boosting alternative is to try adding OrderBy (either Ascending or Descending, not AscendingKey or DescendingKey) against one of the fields referenced in the Where expression. This form of OrderBy can process a large file much more quickly because it performs block I/O instead of individual record I/O. In a standalone test against a 3,000,000-record (477 MB) file, using OrderBy on a non-optimized Select resulted in a nine-fold performance improvement.
One of the most important things to understand in ensuring your Select has the best optimization is your use of ANDs and ORs and the order of parentheses. If K1 represents a key field and F1 a non-key field, the expression (K1 == V1 .OR. F1 == V2) or a general use of OR does not optimize. Even though the equality of K1 can be optimized, the equality of F1 requires all rows in the file to be checked. Beginning in Synergy/DE version 10.3.1, a new form of optimization was added for certain expressions containing OR. When the terms on either side of an OR operator reference the same key for equality and all other OR expressions do the same, and as long as there are no AND operators on the same leg of the expression tree, Select can perform a multi-pass level-3 optimization for each OR’d set of terms. This happens automatically when using the Where.In method. When this occurs, the OR optimization takes precedence over any potential AND optimization. The expression (K1 == V1 .OR. K1 == V2) with OR optimization generates two level-3 passes on key 1 and returns V1 + V2 rows. In addition, the expression ((K1 == V1 .OR. K1 == V2) .AND. F1 == V3) with OR optimization performs the same two passes on key 1 but only returns rows whose field F1 matches V3, returning V1 & V3 + V2 & V3 rows.
Recently, we discovered that choosing the OR optimization over an alternative AND optimization doesn’t always generate the best performance. The expression
((K1 == V1 .OR. K1 == V2) .AND. K2 == V3)
was optimizing at level 3 for key K1, causing a two-pass scan across the V1 and V2 rows. Upon examination, we found the number of V1 + V2 rows on K1 vs. the number of V3 rows on K2 was over 230:1, which caused a noticeable lag. Starting in version 10.3.3b, explicitly choosing a different key by using Where.Keynum now allows for the intentional override of the OR optimization with an AND optimization specified by the key. So, by changing the expression to
(Where.Keynum(2) .AND. (K1 == V1 .OR. K1 == V2) .AND. K2 == V3)
the optimizer is able to optimize on K2 at level 3. Also, starting in version 10.3.3c, for an expression containing both ORs and ANDs when the ORs can’t be optimized, we now automatically check for AND optimization. If an AND operator is at the top of the expression tree and only ANDs are present on one side of the tree, we can optimize for AND automatically (without the need for an explicit Where.Keynum):
((K1 == V1 .OR. F1 == V2) .AND. (K2 == V3 .AND. F2 == V4))
In this example, OR optimization cannot occur because K1 and F1 are either different keys or non-key fields. However, the AND operator is at the top of the expression tree and only AND operators appear on the other side, so AND optimization automatically occurs.
To ensure your Select has the best optimization, using DBG_SELECT can be a valuable tool. Simply setting DBG_SELECT to 1 (and optionally setting DBL_SELECT_FILE to a filename) will produce output that shows each Select by source line number, whether it is optimized, and by which key. It also shows actual performance against your existing data in the form of total rows fetched vs. total number of rows available, with greater disparity being better. In addition, you can check the exact order of operation in your expression by matching up parentheses. When not explicitly specified, the expression processor adds missing parentheses for every term and then displays the result in the DBG_SELECT output. If you find that your intention was not exactly the same as that chosen by the expression processor, you can add your own parentheses to force the specific order of operation you intended. Especially in the OR + AND situation, you may find that an OR operator is at the top of the tree (indicated in the DBG_SELECT output), in which case you could try adding parentheses around the OR operator to force an AND operator to the top for optimization to occur.
(K1 == V1 .OR. K1 == V2 .AND. K2 == V3)
— OR+AND (OR on top), optimization is cancelled
((K1 == V1 .OR. K1 == V2) .AND. K2 == V3)
— AND is at the top, OR or AND optimization is possible
It’s obvious these two expressions wouldn’t produce the same results, but I’ve included them as an illustration of an OR-topped vs. an AND-topped expression tree.
Knowing your data is a good way to ensure the best performance from Select. So, with a little care you can ensure AND optimization or OR optimization for the best Select performance for your application.