PART 2: Merging your duplicate customer records
What issues do you need to be aware of when removing duplicate customer database
records by deleting or merging them?
Note for Part 1 - see:
Finding your duplicate customer records.
Duplicate Data Cleaning – Step 2: Removing Duplicates
Once you have found your duplicate records you then face the problem of
merging two records into one without any data-loss. Some of the problems you
might face when merging records include:
Incomplete Data across two records:
Perhaps one record contains the customers fax number, and the other contains
their email address – you cannot simply delete one record or you risk losing one
of these important pieces of information.
The Duplicate Record Remover
automatically brings across any field-level data
from the slave record (the duplicate to be deleted) into the master record
(the record to keep) so such data isn’t lost when a merge takes place.
Conflicting Data:
Perhaps the two records represent the same customer, yet
they both have different phone numbers. Which phone number should be kept?
The Duplicate Record Remover
validation tool presents all the data for each
record side-by-side with full color coding so you can review each match and
see at a glance which fields have conflicting data. You are then able to
directly edit each field as needed to ensure the records are merged
correctly.
Automatic merging rules based on field type:
Some records are clearly duplicates and require no human intervention or processing
(you can set a threshold above which records are automatically merged). However while the
name and address might be 100% the same, some other fields might be
different and require special merging rules to ensure data is not lost.
For example if one record has a ‘Do Not Contact’ flag set to NO, while the
duplicate as it set to YES – you need to determine which setting is given
preference during the merge.
The Duplicate Record Remover allows you to configure how a field will be
merged, so conflicting date field data can select the maximum or minimum
date to keep. Numeric field data can select the sum, min, max or average
between the two records. And text field data can be concatenated together
using various delimiters.
Foreign-Key Table Relationships
Perhaps you have orders attached to each of
the two duplicate records (or some other relational table data that relates
to each customer – as is common in any RDBMS database system). You cannot
simply delete one record as this would break the referential integrity rules
of the database and leave orphaned foreign table records (in this case you
would have orders that didn’t belong to a corresponding customer).
Before deleting the slave duplicate record, you must first update all
foreign-key tables that reference that record – pointing them instead to the
master record that you are keeping. You can then delete the unwanted slave
record without breaking the databases referential integrity and ending up
with invalid data.
The Duplicate Record Remover
provides the maximum amount of flexibility for
updating your master/live database once the duplicates have been identified
and validated. You are able to export a list of edits, updates and deletes
as an XML Change Log or a list of T-SQL statements which can be
programmatically executed against your database – ensuring that each
foreign-key table is updated first.
False Positives
When two records come up with a high degree of likeness,
but are not actual duplicates (for example two people with the same name
living on the same street – they might differ only by a street number so end
up with a high likeness percentage). You don’t want your false positives to
be falsely merged together as one record – effectively causing the deletion
of a valid record.
The Duplicate Record Remover
overcomes the problems of false positives with
an easy to use validation tool that allows a human to quickly and easily
scan and identify any false positives and mark them to be ignored as part of
the validation process.