PART 2: Merging your duplicate customer records

What issues do you need to be aware of when removing duplicate customer database records by deleting or merging them?

Note for Part 1 - see: Finding your duplicate customer records.

Duplicate Data Cleaning – Step 2: Removing Duplicates

Once you have found your duplicate records you then face the problem of merging two records into one without any data-loss. Some of the problems you might face when merging records include:

Incomplete Data across two records:

Perhaps one record contains the customers fax number, and the other contains their email address – you cannot simply delete one record or you risk losing one of these important pieces of information.

The Duplicate Record Remover automatically brings across any field-level data from the slave record (the duplicate to be deleted) into the master record (the record to keep) so such data isn’t lost when a merge takes place.

Conflicting Data:

Perhaps the two records represent the same customer, yet they both have different phone numbers. Which phone number should be kept?

The Duplicate Record Remover validation tool presents all the data for each record side-by-side with full color coding so you can review each match and see at a glance which fields have conflicting data. You are then able to directly edit each field as needed to ensure the records are merged correctly.

Automatic merging rules based on field type:

Some records are clearly duplicates and require no human intervention or processing (you can set a threshold above which records are automatically merged). However while the name and address might be 100% the same, some other fields might be different and require special merging rules to ensure data is not lost.

For example if one record has a ‘Do Not Contact’ flag set to NO, while the duplicate as it set to YES – you need to determine which setting is given preference during the merge.

The Duplicate Record Remover allows you to configure how a field will be merged, so conflicting date field data can select the maximum or minimum date to keep. Numeric field data can select the sum, min, max or average between the two records. And text field data can be concatenated together using various delimiters.

Foreign-Key Table Relationships

Perhaps you have orders attached to each of the two duplicate records (or some other relational table data that relates to each customer – as is common in any RDBMS database system). You cannot simply delete one record as this would break the referential integrity rules of the database and leave orphaned foreign table records (in this case you would have orders that didn’t belong to a corresponding customer).

Before deleting the slave duplicate record, you must first update all foreign-key tables that reference that record – pointing them instead to the master record that you are keeping. You can then delete the unwanted slave record without breaking the databases referential integrity and ending up with invalid data.

The Duplicate Record Remover provides the maximum amount of flexibility for updating your master/live database once the duplicates have been identified and validated. You are able to export a list of edits, updates and deletes as an XML Change Log or a list of T-SQL statements which can be programmatically executed against your database – ensuring that each foreign-key table is updated first.

False Positives

When two records come up with a high degree of likeness, but are not actual duplicates (for example two people with the same name living on the same street – they might differ only by a street number so end up with a high likeness percentage). You don’t want your false positives to be falsely merged together as one record – effectively causing the deletion of a valid record.

The Duplicate Record Remover overcomes the problems of false positives with an easy to use validation tool that allows a human to quickly and easily scan and identify any false positives and mark them to be ignored as part of the validation process.