System Administration

Manage Duplicate Detection

Duplicate Detection is a validation rule included with the Stewardship Tier that uses search engine indexes to find possible duplicate records. Unlike most validation rules, which are hard coded views residing in the database, Stewardship Tier dynamically builds a derived table at runtime to perform Duplicate Detection. The derived table is treated much the same way as any user-defined validation rule. The results of the query display the same as any other validation results.

The Duplicate Detection validation rule runs as a foreground event, triggered when:

A record is created or modified on a page where Duplicate Detection is enabled for a column(s)

NOTE: If both search and duplicate detection are used, the user can enable a validation as a page event that runs in the foreground to check for duplicates when a single record is added. Refer to Configure an Index, Add a Column to an Index and Register a Validation Rule to a Page for more information.

A user manually triggers an event that contains the validation rule

NOTE: The validation can be assigned to any event, with any severity or priority. Add a Duplicate Detection validation rule to the OnValidate event as a warning. Refer to Create Events for more information.

The Duplicate Detection validation rule is available for use on a page if the following criteria have been met:

The page must be indexed.
Duplicate detection must be enabled on one or more index columns.

Refer to Build indices for a Data Source for Search and Duplicate Detection and Enable Duplicate Detection for more information.

NOTE: There is a distinction between Duplicate Detection, configured in System Administration, and Bulk Duplicate Detection performed in Common. Bulk Duplicate Detection is an extension of the Duplicate Detection capability. Bulk Duplicate Detection is a background event that detects duplicates for each record in a table. An Administrator or Designer runs the Bulk Duplicate Detection process in Common. The results are stored in a Stewardship Tier-created table where a client can use them.

This section contains these topics:

Enable Duplicate Detection
Search for Duplicates
Determine the Duplicate Detection Threshold
Define an Index for Duplicate Detection
Control Display and Limit Possible Candidates

Enable Duplicate Detection

Before Duplicate Detection can be enabled, an index must be created. Users can perform Duplicate Detection using the same index that is used for searching or an alternate index. The index must have at least one column on which Duplicate Detection has been enabled.

Refer to Build indices for a Data Source for Search and Duplicate Detection, Define an Index for Duplicate Detection and Add a Column to an Index for more information.

After Duplicate Detection is enabled, a validation rule named Duplicate Detection displays in the View list box on the Page Validation Rules page’s Vertical View.

To enable duplicate detection for a column:

Select Admin > Data Sources from the Navigation pane.
Click the Index icon for the data source.
Click the Columns icon for the indexed table.
Click Edit for the column.
Click Duplicate Detection check box.
Click Save.

Any number of columns can be used for duplicate detection. For best results, use only one or two columns. Limiting the number of words that are compiled into the search string increases performance. The more words searched, the greater the noise in the search results. Refer to Determine Duplicate Detection Threshold for more information.

Search for Duplicates

Duplicate records are searched the same way a key word is searched. The only difference is the search string is constructed with Duplicate Detection.

To build the search string, the Stewardship Tier extracts the values from each of the columns that have been enabled for Duplicate Detection. The words are concatenated together and separated by the OR logical operator. The search is then performed with this search string just as if it had been manually entered.

For example, assume a record is being added to the Customers table. Duplicate Detection has been enabled for the CompanyName column. The user supplies a company name of Eastern Vacuum Mart. The system converts the company name to the search string Eastern OR Vacuum OR Mart. All records matching this search string with a rank higher than the Duplicate Detection Threshold value is returned. The record itself is explicitly excluded and the user cannot see the recently-added record (Eastern Vacuum Mart).

Refer to Determine the Duplicate Detection Threshold for more information.

Determine the Duplicate Detection Threshold

The Search Threshold is used when users perform a search on a page. In contrast, the Duplicate Detection Threshold is used when searching the index for duplicate records.

If the duplicate detection process finds a match for a record whose quality is below the cut-off threshold (expressed as a percentage value), the match is not stored in the #Duplicate table.

Both are set on the Index Specifications page. Refer to Configure an Index for more information about setting thresholds.

In general, the Duplicate Detection Threshold should be set higher than the Search Threshold. The higher the number, the less likely false positives display. The fewer false positives, the more accurate the results are.

The number of columns used to construct the Duplicate Detection search string should affect the threshold setting. If Duplicate Detection has been enabled on one or two columns, then the default value of 50.00% is appropriate. The more columns searched, the longer the search string and the lower the threshold setting. If Duplicate Detection has been enabled on five columns, a rank of 20.00% may be more effective.

The number of words used to construct the duplicate detection search string affects the Duplicate Detection Threshold. If duplicate detection is enabled on five columns, but three of the columns are usually NULL when the record is created, then the default value of 50.00% may still be effective. In contrast, if duplicate detection is enabled on two columns that normally contain ten words each, then the lower value may be necessary.

Define an Index for Duplicate Detection

By default, the Duplicate Detection validation rule uses the search index. A page can be explicitly associated with an index for searching and with another index for Duplicate Detection. This association is useful when records are added to one table; however, duplicates need to be checked for in another table.

NOTE: When performing Duplicate Detection against an index of another table, the view must contain the primary key columns as defined at the index level. If there are no keys defined at the index level, then the view must contain the source table’s primary keys.

To define the index for Duplicate Detection:

Click Admin > WebApps on the Navigation pane.
Click the Pages icon for a webapp.
Click Vertical View for a page.
Click the Search Options tab.
Click Edit.

View the field descriptions for the Pages page Vertical View

Select an option from Duplicate Detection Search ID list box.
Click Save.

Control Display and Limit Possible Candidates

With the results of Duplicate Detection validation, the primary key in the Duplicate Detection Validation fields and any fields for which Duplicate Detection has been enabled display. To include or exclude specific columns, create a view containing the primary key columns and any additional columns that should display to the user.

Once created, the view needs to be registered in Stewardship Tier.

To register the view:

Click Admin > WebApps on the Navigation pane.
Click the Pages icon for a webapp.
Click the Events icon for a page.
Click the Validation Rules icon for an event.
Click Vertical View for a rule.
Click Edit.
Select view from Duplicate Detection View list box.
Click Save.

When the Duplicate Detection Validation is triggered, the columns in the view display rather than the default columns selected by Stewardship Tier. In addition to controlling which columns display to the user, the Duplicate Detection view can be used to filter out records. This filtering process allows the developer to control which records are candidates for Duplicate Detection.

By way of example, an index may contain customer records from around the world, but only checking for Duplicate Detection in the United States is necessary. A Where Clause can be added to the Duplicate Detection view. The Where clause selects only those customers located in the United States by performing an Inner Join to the Duplicate Detection view at runtime. Therefore, any records not present in the view are, by definition, excluded from the validation result set.