Select Join Optimization: Tag, You’re It

By John Brunett, Senior Software Engineer

In Synergy Version 10.3.3, we added initial support for Join. You can find an article I wrote about Join back in the July 28, 2016, issue of Synergy-e-News entitled “Join the Select Who Select Using Join” that describes our somewhat esoteric/non-conventional implementation of Join. The last article I wrote, “Select Performance AND Optimization OR Optimization,” published in the March 26, 2018, Synergy-e-News, described how to optimize your Select. Now I’m focusing on how to optimize your Joins.

Like Select, which uses the (Where) expression to optimize access, Join uses the (On) expression. If you’re already familiar with the (Where) expression, the (On) expression looks nearly identical with a few exceptions.

When optimizing for a single table Select, the (Where) expression is examined for optimal performance by attempting to find the best key to use. Each field in the expression is matched against defined keys in the table’s file. If no keys are matched, no optimization occurs. However, for each field that does match a key, the best key (using implicit key selection or best association—EQ or narrowing) is eventually chosen. With a Join, the (On) expression uses the same implicit key selection, but if no keys are matched, the join cannot occur.

The primary function of a single Join is to match a row (record) from one outer table (file) with one or more rows in another inner table. With our implementation of Join, this requires a matching key on the inner table. Allowing non-optimized (or non-key) lookups on an inner table would require table scans against each row in the outer table, producing an undesirable amount of I/O—effectively that of a cross-join or Cartesian product. By restricting a Join to only keys on the inner table, the Join always produces some form of optimization and protects you from undesirable results.

Similar to (Where) optimization, the (On) expression is examined for a best matching key on the inner table. By convention, we require the first part of the (On) expression to be expressed as an “equi-join.” Consider the equi-join as one or more (.EQ./.EQS.) operations separated by (.AND.), with each referencing fields on the outer table against fields on the inner table that represent a key. Since many keys are made up of multiple field segments, the optimizer will stitch together those fields that are contiguous, hence constructing more key material that results in fewer I/O operations (and is more optimized). Any expression following the equi-join part is simply a filter condition on the row.

As with the (Where) expression, the (On) expression uses the following order of precedence for implicit key selection (as excerpted from the Synergy DBL Language Reference):

  • A key that does not allow duplicates, where all key segments make up the entire length of the key, determines the optimized key. When more than one of these exist, the one whose key segments match the order specified in the Where or On clause takes precedence.
  • A key with the most contiguous segments determines the optimized key. When more than one of these exist, the one whose key segments match the order specified in the Where or On clause takes precedence.
  • A key with the longest single key segment determines the optimized key.

In counting segments, the Select defines a segment as either a physical key segment (a key segment defined by a file) or a pseudo-segment (a partial key or key section). For example, given an inner table with the following keys and record layout:

      (Primary key 0: Start=1, Length=20)
      (Alternate key 1: Start=1:16, Length=6:5)
(Alternate key 2: Start=15:1:16, Length=1:6:5)
      record myRec
            PartNo,a4
            CamNo ,a2
            Date  ,a8
            RecCd ,a1
            SeqNo ,a5

and the (On) expression “equi-join” below:

(On)(A.PartNo.EQ.myRec.PartNo.AND.A.CamNo.EQ.myRec.CamNo.AND.A.SeqNo.EQ.myRec.SeqNo)

the 11-byte alternate key 1 will be chosen as the optimal key, even though the 20-byte primary key is longer. This happens because only the first contiguous part (or one pseudo-segment) of the primary key can be used, whereas two contiguous segments of key 1 can be used.

One additional (On) optimization that is not available in (Where) is the use of tags. Tags are a structure attribute of Synergy/DE Repository that have a fixed value and are frequently added as key segments. Join tags are fields specified in the filter condition section following the join condition, compared for equality against a literal or value that didn’t come from the outer table. Combined with the equi-join key segments, the tag(s) in this section may result in choosing a better key, which should result in fewer I/O operations.

In the example keys and record layout above, the following (On) statement would choose key 2 with the addition of the tag:

(On)(A.PartNo.EQ.myRec.PartNo.AND.A.CamNo.EQ.myRec.CamNo.AND.A.SeqNo.EQ.myRec.SeqNo.AND.A.lot1.GT.0
.AND.myRec.RecCd.EQ."1")

Explicitly choosing a key in an (On) expression using the Keynum() method was not offered in the initial Join release because it didn’t seem necessary. However, after some live field analysis, it was determined to be a very useful feature. It has been added as a new Join feature in the Synergy/DE 11 release. As with (Where), simply begin the (On) expression with On.Keynum(n) and pass the key index you’d like to use.

The example below will force the use of the primary key instead of the implicit key that would otherwise be chosen:

(On)(On.Keynum(0).AND.A.PartNo.EQ.myRec.PartNo.AND.A.CamNo.EQ.myRec.CamNo.AND.A.SeqNo.EQ.myRec.SeqNo)

Beware: Unlike a single table Select using (Where), you’re not allowed to choose just any key—it must be a key that matches at least one segment specified in the equi-join part of the (On) expression.

Due to .NET compiler compatibility, Keynum() won’t be available until 11.1.1. However, a workaround is available for Synergy .NET only. Use the syntax “new On(23, n)” in place of On.Keynum(n), like this:

(On)(new On(23,0).AND.A.PartNo.EQ.myRec.PartNo.AND.A.CamNo.EQ.myRec.CamNo.AND.
A.SeqNo.EQ.myRec.SeqNo)

Also added to the latest release is additional DBG_SELECT=1 logging to help better optimize Join as well as Select. Using the example above, logging will now show the following:

From(table1.ism) As a
Join From(table2.ism) As b
On(a.A(1:4).EQ.b.A(1:4).AND.a.A(5:2).EQ.b.A(5:2)).AND.a.A(7:5).EQ.b.A(16:5)
Where()
Non-optimized key: 0
Inner table (b)
   Implicit key selection via optimization: (See LRM)
       Key: Ordered nodup contig segs/Nodup contig segs/Ordered contig segs/Contig segs/Contig size
       0: 0  0  2  2  6
      >1: 0  0  3  3  11
       2: 0  0  0  0  0
   Optimized key:1 with segment ranges: (1:4) (5:2) (16:5)
Total rows fetched from table a: 900 out of 900
Total rows fetched from table b: 180 out of 180
Total rows joined: 90

And the following, with the addition of a tag field:

From(table1.ism) As a
Join From(table2.ism) As b
On((a.A(1:4).EQ.b.A(1:4).AND.a.A(5:2).EQ.b.A(5:2)).AND.a.A(7:5).EQ.b.A(16:5)).AND.b.A(15:1).EQ."1"
Where()
Non-optimized key: 0
Inner table (b)
   Implicit key selection via optimization: (See LRM)
     Key: Ordered nodup contig segs/Nodup contig segs/Ordered contig segs/Contig segs/Contig size
       0: 0  0  2  2  6
       1: 0  0  3  3  11
      >2: 1* 0  3  4  12 *includes 1 join-tag
   Optimized key:2 with segment ranges: Tag(15:1) (1:4) (5:2) (16:5) 
Total rows fetched from table a: 900 out of 900
Total rows fetched from table b: 18 out of 180
Total rows joined: 9

And the following, with Keynum:

From(table1.ism) As a
Join From(table2.ism) As b
On(Keynum(0).AND.((a.A(1:4).EQ.b.A(1:4).AND.a.A(5:2).EQ.b.A(5:2)).AND.a.A(7:5).EQ.b.A(16:5)).AND.b.A(15:1).EQ."1"))
Where()
Non-optimized key: 0
Inner table (b)
   Explicit key selection: Keynum(0)
   Optimized key:0 with segment ranges: (1:4) (5:2)
Total rows fetched from table a: 900 out of 900
Total rows fetched from table b: 269 out of 180
Total rows joined: 9

Select and Join may not be able to pick the best keys based on your data, so knowing your data and directing the optimization is very important. The best key resulting in the fewest I/O operations is really… well, key. Whether it’s simply a single field matching a key, or a group of fields matching key segments with a tag, in the end, you’re it.