Duplicate Record Remover Help

 

Why does the Examination Process take so long?

 

An important consideration to remember is that every record in the database is compared with every other record in the database.  This means the processing grunt-work required by the fuzzy-matching engine increases exponentially as the dataset also increases in size. 

For example: If you have 100,000 records – and add 1 more record to your dataset, you need to do 100,000 more comparisons (that record is compared with every other record in the database).  And if you add 5 more records you need to do 500,000 more comparisons!  So you can see how quickly the number of records plays a crucial role in how long the examination process takes.

 

The tool uses SQL Server 2005 Express Edition as the database engine that’s doing most of this grunt work.  This technology was selected for this job because of the highly efficient query optimization technologies already built into it, and because the SQL Server query engine makes full use of parallelism across multiple processors and processor cores and manages memory as efficiently as possible. 

So even though the number of calculations being executed is sometimes extremely high, the selected technology is ideal for making these calculations as quickly and efficiently as possible.

 

Does it matter what machine its run on?

Most certainly yes!  When you have a high number of records, the faster the machine and the more the memory you have, the quicker the examination process will run.  We recommend large datasets (50,000+) should be left to run overnight or even over the weekend.

 

Does it run faster on a multi-processor machine?

The basic install of the tool installs SQL Server 2005 Express edition.  The limitations of the Express Edition of SQL Server means the examination process will only take advantage of the first 1 Gig of memory, and the first physical processor (although it will use multi-cores in a single processor). 

This can prove to be a major limitation when examining large datasets (100,000+ records) as more processors and more memory can significantly reduce the time the examination process takes to run.

Therefore for those who are on a Gold-Level support contract, we provide the ability to connect to an external SQL Server that would typically be a full-version running on a dedicated multi-processor and multi-gigabyte database server.  The examination process will run significantly faster on such a machine. 

Please contact us if you want to discuss this option.

 

Related Topics

Introduction

 

Duplicate Record Remover
Copyright (c) 2009 Precision Data, All Rights Reserved.