Duplicate Record Remover Help

 

STEP 5: Select fields to be used for examination process

 

In this step you select those fields that you want the fuzzy-matching engine to use during its examination process to find records that are similar to each other.

 

There are some important factors to consider when selecting which fields to use for comparison:

1)    The time the examination process will take:  The more fields you select, the longer the fuzzy-matching engine will take to identify duplicates as there is more data to examine.  This will only be a problem if you have a large number of records (100,000 or more) or you are running the tool on slow hardware.  Other factors that determine how long the process will take include how much data is in the fields and other things the machine will be expected to do while the examination process is underway. 

As a general rule if you have a large dataset, you should select as few fields as you can, without compromising on the accuracy of the end result.  Some experimental testing may be necessary to establish the right fields to examine.  Click here more information on ‘Why the examination process takes so long’.

2)    Reducing False Duplicates (or mistaken matches):  False Duplicates occur when the tool finds a high degree of similarity between two records which are not actual duplicates.  These cause more time during manual processing, as you have to click Ignore Match for more records than you should.  Ways to keep false duplicates to a minimum include:

a.    Only select fields that have a high degree of uniqueness in their data such as Names, Addresses, Phone Numbers and Email Addresses.  These fields should be unique to each record as they are typically not shared between multiple people (Although this depends on the nature of your data).   If you have many records of people that share a company name and address – then you may want to exclude these fields as they will increase the similarity score between records and return more false duplicates.

b.    Avoid fields that share the same value across large numbers of records (eg: ‘Status’ or ‘Customer Type’ – where there are a limited list of valid status and customer types in the entire database – therefore many records would share the same value).  When such fields are used for comparison they will again give the fuzzy-matching engine a false sense of similarity between records, increasing the likelihood of false duplicates.

3)    Reducing Missed Duplicates:  Missed duplicates are when the tool misses real duplicate records, failing to identify them correctly as duplicates due to some other data in the record making it look more different than it really is.  Ways to avoid this include:

a.    Avoiding fields that will have different value between true duplicates.  For example a date or numeric field such as ‘Sales Value’ or ‘Created Date’ where two true duplicate records are unlikely to have the same value.  Such fields decrease the similarity between true duplicates, increasing the likelihood of missing them as duplicates.

b.    Avoiding text fields that contain large amounts of text.  For example a free-text field like ‘Customer Notes’ might contain a large amount of text in one record and not the other.  This would greatly decrease the similarity between duplicates, also increasing the likelihood of it being missed as a true duplicate.

As a general rule, we would recommend you select all name fields (First Name, Last Name, Middle Name, Title, Initials, Salutation, Nickname, etc), all Address fields – but you could consider avoiding City and State as these have a low degree of uniqueness, all Phone Number and Email fields as these would typically also be unique to each record (although email addresses containing the same domain name might artificially increase the similarity).

We strongly recommend you do some test runs to see how many false duplicates you get – however remember that its normal to see many false duplicates at the lower likeness levels in the 40%-80% likeness range – but you should have very few at the 80%-100% range. 

 

Next Step: STEP 6: Import and Validate Data

 

 

Related Topics

Setup Wizard

 

Duplicate Record Remover
Copyright (c) 2009 Precision Data, All Rights Reserved.