Common
Find Duplicates Overview
NOTE: There is a distinction between Duplicate Detection, configured in System Administration, and Bulk Duplicate Detection performed in Common. Bulk Duplicate Detection is an extension of the Duplicate Detection capability. The Duplicate Detection validation rule runs as a foreground event, triggered when:
- A record is created or modified on a page where Duplicate Detection is enabled for a column(s)
- A user manually triggers an event that contains the validation rule
Bulk Duplicate Detection is a background event that detects duplicates for each record in a table. An Administrator or Designer runs the Bulk Duplicate Detection process in Common.
This use case covers the following topics:
- Configure Duplicates Parameters
- Create Object Views
- Register Objects
- Configure Columns
- Find Duplicates
- Run Post Duplicate Detection Process
- Resolve Duplicate Candidates
- View Results
- Review Duplicate Records
- Review Non Duplicate Records
Configure Duplicates Parameters
Parameters are Duplicate-specific settings that must be configured per installation. Parameters are delivered with recommended default values. Review these values and make any changes necessary for the installation.
NOTE: Additional settings for Duplicate Detection are configured in System Administration. Refer to Manage Duplicate Detection in System Administration for more information.
To configure duplicate parameters:
- Select Common > Configuration > Modules > Parameters-Duplicates in the Navigation pane.
-
Click Edit .
View the field descriptions for the Parameters-Duplicates page
- Update all fields if default values are not applicable.
- Click Save.
Create Object Views
The first step in setting up duplicate detection is to create a view that checks the object for duplicate records.
Create a view in the Common database that references an object (vendor, customer, material, etc.), a key and all relevant columns for Duplicate Detection. The view must include a single key field (use concatenation if more than one field to create the key exists in the table). Up to five columns can be used for duplicate comparison.
Register Objects
Objects are views or tables that contain data analyzed for duplication.
To register an object:
-
Select Common > Analyze in the Navigation pane.
NOTE: The Analyze page displays data sources registered in System Administration that are not added to the Ignore Data Sources page and to which the logged in user has security to view. Data sources are analyzed for database object changes, table statistics and duplicate records.
-
Click Duplicates for DATA SOURCE ID.
NOTE: If no records exist, the page automatically displays in add-mode. Otherwise, click Add.
- Enter an object name in Object field.
- Select a view from View Name list box. This is the view created in Create Object Views.
-
Update Search ID field if default value is not applicable.
NOTE: The Search ID is the name of the search table that controls which record pairs in the source data are stored as a duplicate. The Stewardship Tier is delivered with a default search table, DSPCommon.ttDuplcate, that has been set up for the BDD process.
NOTE: If a search table other than DSPCommon.ttDuplicate is to be used, it must be set up in System Administration.
-
Update Non Searchable Characters field if default value is not applicable.
NOTE: The Non Searchable Characters field controls characters excluded from the duplicate detection search.
-
Select ID from Stop List Id list box to control list of words ignored during duplicate detection process.
NOTE: The Stop List ID field indicates a list of words ignored during the duplicate detection search. Stop Lists are managed in System Administration.
-
Update Search Threshold field if default value is not applicable.
NOTE: The Search Threshold field controls the level to ignore false positives.
-
Update Duplicate Detection Threshold field if default value is not applicable.
NOTE: The Duplicate Detection Threshold field controls the weight percent of the calculated value for matched words. Words that match carry more weight than words that sound alike.
-
Update Synonym Weight field if default value is not applicable.
NOTE: The Synonym Weight field controls the weight value of synonym matches.
-
Update Sound Ex Weight field if default value is not applicable.
NOTE: The Sound Ex Weight field controls the percentage of combined calculated value for words found within the search (number of words found plus the number of words that sound alike divided by the total number of words). Words that match carry more weight than words that sound alike.
-
Select ID from Custom Sound Ex Function ID list box if a custom SQL Server Sound Ex function is used.
NOTE: Update Index Batch Size field if default value is not applicable. The Index Batch Size field controls the number of records to process in one pass through the data
-
Update Duplicate Detection Batch Size field if default value is not applicable.
NOTE: The Duplicate Detection Batch Size field controls the number of records queued up in the duplicate detection process. This field allows a subset of large files to be processed and at the same time, limits the resources required.
-
Update Word Ratio Threshold field if default value is not applicable.
NOTE: The Word Ratio Threshold field controls the number of words in each duplicate pair. A value less than 50% marks a duplicate value for removal.
- Click Remove Blank Lines check box to enable it, removing blank lines from the HTML formatted output, which displays on the Candidates page.
- Click Unicode Separate Characters check box to indicate Unicode characters (double-byte) are included.
- Click Save.
Configure Columns
Once the object is registered in Common, the way in which columns display on the Results page can be configured, such as bolding a field name or adding a page break.
To configure view columns:
- Click Analyze in the Navigation pane.
- Click Duplicates for Data Source ID.
- Select an object.
-
Click Columns on Page toolbar.
NOTE: All columns are pulled in by the duplicates view. The Order is determined in the view.
-
If records display on the page, the page opens in Add mode. Otherwise, click Add.
View the field descriptions for the Object Columns page
NOTE: The KEY check box is enabled for the key column of the table.
-
Click DUPLICATE DETECTION check box for up to five columns.
NOTE: If more than five columns are required for duplicate detection, use a view to concatenate some columns from the source table together to create only five duplicate detection columns. For example, to fit an address with a phone number into five columns, combine City, Region and Zip into a single field called CityStateZip.
-
Configure how the data displays in the Results page using the following settings:
- Show Name – Displays the column name.
- Bold– Displays the column in bold. It is recommended that only the main column (name, material description, etc.) is bold.
- Line Break – Places a line break after the column. This feature allows the five fields to be formatted to fit nicely in the Results page. For example, a line break would appear after the Name and Address fields, but not after the City and Region fields. The later fields are typically displayed on the same line in an address (City State Zip).
- Click Vertical View for a column to configure further.
-
Click Edit on Page toolbar.
View the field descriptions for the Object Columns page's Vertical View
- Select a value from the Dictionary ID list box to associate a dictionary (that contains a list of words and synonyms) with the column, if applicable.
- Click Save.
-
Configure the sensitivity of duplicate detection matches using the following settings:
- Must Match – Requires two records to match exactly in order to be considered a duplicate, regardless of the boaRank, which is the match value returned by the bulk duplicate detection. Must Match is only available if Duplicate Detection is enabled for the column.
- Allow Synonym – Indicates that two records may be duplicates if the field is not an exact match, but instead is a match based on a field dictionary synonym, for example, ST = STREET. The Allow Synonym check box is only available if a dictionary for the column is specified and if Must Match is enabled.
Allow Soundex – Indicates that two records may be a duplicate if the field is a Soundex match rather than an exact match. (A Soundex match is a coding standard used for indexing words by sound.)
NOTE: Any duplicate pair that does not match (either exactly, as a synonym or as a Soundex) is not a duplicate detection candidate.
Find Duplicates
Once the duplicates view has been registered and the columns for the Results page are configured, duplicates can be found.
To find duplicates:
- Click Analyze in the Navigation pane.
- Click Duplicates for Data Source ID.
- Click Build for Object; a confirmation message displays.
-
Click Ok.
NOTE: The build process searches the Data Source for potential duplicates. Only one object can be built at a time.
- Click Results on Page toolbar to view duplicate candidates.
Run Post Duplicate Detection Process
In the event the Duplicate Detection process does not finished due to a time out, run the Post Duplicate Detection process to present duplicate candidates.
To run the post duplicates detection process:
- Click Analyze in the Navigation pane.
- Click Duplicates for Data Source ID.
- Click Vertical View for Object.
- Click Actions tab.
- Click Post Process; a confirmation message displays.
- Click Ok.
Resolve Duplicate Candidates
Once duplicate records are identified, which are referred to as unresolved candidates at this point, they must be manually reviewed to determine what is an actual duplicate record and what is a unique record.
To resolve duplicate candidates:
- Click Analyze in the Navigation pane.
- Click Duplicates for Data Source ID.
-
Click Unresolved Candidates for Object.
- Decide if a duplicate pair should use the original value, use the duplicate value or is a non duplicate.
-
Options for resolving duplicate candidates are:
- Merge Left– Click to keep the original value and discard the duplicate value.
- Not Duplicate – Click if the value is not a duplicate, but rather a unique value.
- Merge Right – Click to use the duplicate value and discard the original value.
NOTE: Use caution when deciding what is a duplicate record. A duplicate source record can be used over another source record. A duplicate ERP record can be used over a duplicate source record. But do not use a duplicate source record over an ERP record – this can be accommodated by adding a NoChange column to the duplicate detection view and setting the value to 1.
The following example illustrates the results of NoChange = 1 on the SAP account. It is not possible to merge the SAP record into the Oracle record.
CREATE VIEW dDup.dbo.tvDEMO_NoChange_FLAGSel as
SELECT ObjectID, Name, Street, City, Region, PostalCode, Phone, zSource,
(select 1 where zSource = ‘SAP’) as NoChange
FROM AddressServer.dbo.ttADRC
WHERE ObjectType = ‘Vendor’
View Results
After duplicates have been identified and resolved, review results to verify the duplicate resolution is accurate.
To view results:
- Click Analyze in the Navigation pane.
- Click Duplicates for Data Source ID.
- Select an object.
-
Click Results on Page toolbar.
-
Review the RESULT STATUS to determine if there are unresolved candidates that must be reviewed. Options are:
- Green– All duplicates have been resolved.
- Yellow– Ten or fewer duplicates have been resolved.
- Red – No duplicates have been resolved or if there are no records returned.
NOTE: If there are unresolved candidates, click Objects icon on the Page toolbar and resolve the duplicate candidates.
NOTE: To allow business users to see the results of the Duplicate Detection process:
- Add the business user to the Common WebApp, then add the user to the Analyze WebApp Group. Refer to Assign Users to WebApp Groups in System Administration for more information.
- Assign a user-specific security definition to each business user. Add the Analyze data source(s) that are used in the Duplicate Detection process as a key to each user’s security definition. Refer to Establish User-specific Security Definitions in System Administration for more information.
Review Duplicate Records
Review duplicate records to decide if the record is a true duplicate, if the record must be reset or if the record is not a duplicate.
To review duplicate records:
-
From the Results page, click Duplicates.
-
Review the duplicate record pairs. Options are:
- Leave the record pair as is if the Obsolete Object and Actual Object are duplicates.
- Click Reset to reset the record pair as an unresolved candidate so it can be reanalyzed. This action removes the record pair from the Results page. If that record pair is in the current set of source data, it re-appears on the Candidates page.
- Click Mark As Non Duplicate to mark the Obsolete Object and Actual Object as non-duplicate records.
Review Non Duplicate Records
Review duplicate records to decide if the record is a true non duplicate or if the record must be reset.
To review non duplicate records:
-
From the Results page, click Non Duplicates icon.
-
Review the duplicate record pairs. There are two options:
- Leave the record pair as is if the Obsolete Object and Actual Object are duplicates.
- Click Reset icon to reset the record pair as an unresolved candidate so it can be reanalyzed. This action removes the record pair from the Results page. If that record pair is in the current set of source data, it re-appears on the Candidates page.