| |||||||||||
|
|
STEP 5: Select fields to be used for examination process
In this step you select those fields that you want the
fuzzy-matching engine to use during its examination process to find records
that are similar to each other.
There are some important factors to consider when
selecting which fields to use for comparison: 1) The time the examination process will take: The more fields you select, the longer the
fuzzy-matching engine will take to identify duplicates as there is more data to
examine. This will only be a problem if
you have a large number of records (100,000 or more) or you are running the
tool on slow hardware. Other factors
that determine how long the process will take include how much data is in the
fields and other things the machine will be expected to do while the
examination process is underway. As a general rule if
you have a large dataset, you should select as few fields as you can, without
compromising on the accuracy of the end result.
Some experimental testing may be necessary to establish the right fields
to examine. Click
here more information on ‘Why the examination process takes so long’. 2)
Reducing False
Duplicates (or
mistaken matches): False Duplicates occur when the tool finds a high
degree of similarity between two records which are not actual duplicates. These cause more time during manual
processing, as you have to click Ignore Match
for more records than you should. Ways
to keep false duplicates to a minimum include: a. Only
select fields that have a high degree of uniqueness in their data such as
Names, Addresses, Phone Numbers and Email Addresses. These fields should be unique to each record
as they are typically not shared between multiple people (Although this depends
on the nature of your data). If you
have many records of people that share a company name and address – then you
may want to exclude these fields as they will increase the similarity score
between records and return more false duplicates. b. Avoid
fields that share the same value across large numbers of records (eg: ‘Status’
or ‘Customer Type’ – where there are a limited list of valid status and
customer types in the entire database – therefore many records would share the
same value). When such fields are used
for comparison they will again give the fuzzy-matching engine a false sense of
similarity between records, increasing the likelihood of false duplicates. 3) Reducing Missed Duplicates: Missed duplicates are when the tool misses
real duplicate records, failing to identify them correctly as duplicates due to
some other data in the record making it look more different than it really
is. Ways to avoid this include: a. Avoiding
fields that will have different value between true duplicates. For example a date or numeric field such as
‘Sales Value’ or ‘Created Date’ where two true duplicate records are unlikely
to have the same value. Such fields
decrease the similarity between true duplicates, increasing the likelihood of missing
them as duplicates. b. Avoiding
text fields that contain large amounts of text.
For example a free-text field like ‘Customer Notes’ might contain a
large amount of text in one record and not the other. This would greatly decrease the similarity
between duplicates, also increasing the likelihood of it being missed as a true
duplicate. As a general rule, we would recommend you select all name
fields (First Name, Last Name, Middle Name, Title, Initials, Salutation, Nickname,
etc), all Address fields – but you
could consider avoiding City and State as these have a low degree of
uniqueness, all Phone Number and Email fields as these would typically
also be unique to each record (although email addresses containing the same
domain name might artificially increase the similarity). We strongly recommend you do some test runs to see how
many false duplicates you get – however remember that its normal to see many false
duplicates at the lower likeness levels in the 40%-80% likeness range – but you
should have very few at the 80%-100% range.
Next Step: STEP 6: Import
and Validate Data Related Topics | ||||||||||
|
Duplicate Record Remover
Copyright (c) 2009 Precision Data, All Rights Reserved. | |||||||||||