WindowsDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


AddThis Social Bookmark Button

Ferreting Out Near-Identical Records in Access

by Ken Bluttman, author of Access Hacks
05/24/2005

Working with lists of contacts is a common database activity. As a list grows, the probability of duplicate records increases. They could be exact duplicates, which are easy to find by running a duplicates query. In Access, the Find Duplicates Query Wizard is great for finding exact duplicates based on selected fields. This wizard is available in the New Query dialog box. On the Queries tab, click on the New button and the dialog appears; see Figure 1.

But what about records that are not absolutely identical but are duplicates nonetheless? For example, one record has John Smith as the contact and another has Johnny Smith. These could be two records for the same person, but Access can't resolve them as duplicates since they are not identical.

In this article I show you how to ferret out these near-identical records. The approach works by identifying whether the text value of a field in a record is found in a larger text string of the same field in another record. This approach will find that Johnny and John are indeed duplicates.

Figure 1
Figure 1. The Find Duplicates Query Wizard will lead you through the process of identifying exact matches in your Access database

Proximate Matching with the InStr Function

When working with lists of contacts, most often there are separate fields for the first name and last name. Usually it is the first name that is entered with variation. Last names don't change often. My first name could be entered into a database table as Ken, Kenneth, K., and so on.

The method is to uncover these variations in one field while keeping another field fixed to a value. For example, take John Smith and look for variations of the first name while keeping the last name fixed to Smith. Your data may have other fields that are relevant to your needs, and you can apply the same approach to work with your fields.

Figure 2 shows a simple form with check boxes that indicate which field(s) must match. Although the form allows both the first name and last name to be either checked or unchecked, the best approach is for one to be checked and not the other. This will become clear as the example is explained.

Figure 2
Figure 2. Specify whether the first name or the last name field must match in your search

Figure 3 shows my example data. I have just two fields: FirstName and LastName. The records are sorted by last name. As is evident, there are complete duplicates and near duplicates.

Figure 3
Figure 3. An example database with a scattering of duplicate and near-duplicate entries

The method I am about to explain uses the InStr function to find when the value in one field is inside the value of the same field, in another record. On the form, I checked that the last name must match and the first name must not match (see Figure 2). Therefore, the process will see whether the first name, per record, is found in inside the first name field of any other record that has the same last name.

Didi will not be found in Deidre, even though the last name is the same. On the other hand, Marl will be found in Marleen, and Sondie will be found in Sondie H.

The method does not work perfectly, as in the case of Didi not found in Deidre because of the spelling. It is even questionable whether those are two records for the same person. Having worked on numerous projects of this type, I can state there is no foolproof method for uncovering data that is close but not exact. Consider this one more utility for the toolbox. Let's look at what happens when the Process button is clicked on. A code routine runs that uses ADO and recordsets to work with the data. Here is the code in the button's click event, followed by a detailed discussion:


Private Sub cmdProcess_Click()
 On Error GoTo err_end
 Dim conn As ADODB.Connection
 Set conn = CurrentProject.Connection
 Dim ssq1 As String
 Dim ssq2 As String
 Dim ssq3 As String
 Dim rs As New ADODB.Recordset
 Dim rs2 As New ADODB.Recordset
 Dim F As Boolean ' First Name
 Dim L As Boolean 'Second Name
 
 'save checkbox values
 F = False
 L = False
 If Me.chkFirstName = True Then F = True
 If Me.chkLastName = True Then L = True
 
 'clear out the results table
 conn.Execute "Delete * From Results"

 'open recordset on data table
 ssql = "Select * From Data1 Order By LastName"
 rs.Open ssql, conn, adOpenKeyset, _
    adLockOptimistic
 Do Until rs.EOF
   'build SQL to open inner recordset
   ssql2 = "Select * From Data1 Where "
  
   If F = False And L = False Then
     ssql2 = ssql2 & "FirstName<>'" & _
       rs.Fields("FirstName") & _
       "' And LastName<>'" & _
       rs.Fields("LastName") & "'"
   End If
  
   If F = False And L = True Then
     ssql2 = ssql2 & "FirstName<>'" & _
     rs.Fields("FirstName") & _
     "' And LastName='" & _
     rs.Fields("LastName") & "'"
   End If
   
   If F = True And L = False Then
     ssql2 = ssql2 & "FirstName='" & _
       rs.Fields("FirstName") & _
       "' And LastName<>'" & _
       rs.Fields("LastName") & "'"
   End If
   
   If F = True And L = True Then
     ssql2 = ssql2 & "FirstName='" & _
       rs.Fields("FirstName") & _
       "' And LastName='" & _
       rs.Fields("LastName") & "'"
   End If
   
   'open inner recordset
   rs2.Open ssql2, conn, adOpenKeyset, _
      adLockOptimistic
   Do Until rs2.EOF
     'test if value from one recordset field
     'is in the same field of the other recordset
        If (InStr(1, rs.Fields("FirstName"), _
          rs2.Fields("FirstName") > 0) Or _
          (InStr(1, rs2.Fields("FirstName"), _
          rs.Fields("FirstName"))) > 0) And _
          ((InStr(1, rs.Fields("LastName"), _
          rs2.Fields("LastName")) > 0) Or _
         (InStr(1, rs2.Fields("LastName"), _
         rs.Fields("LastName"))) > 0) Then
          
         'if values match on at least one field,
         'write a result record in Results
         ssql3 = "Insert Into Results Values ("
         ssql3 = ssql3 & "'" & _
              rs.Fields("FirstName") & "', "
         ssql3 = ssql3 & "'" & _
              rs.Fields("LastName") & "', "
         ssql3 = ssql3 & "'" & _
              rs2.Fields("FirstName") & "', "
         ssql3 = ssql3 & "'" & _
              rs2.Fields("LastName") & "')"
         conn.Execute ssql3
      End If
        
   rs2.MoveNext
   Loop
   rs2.Close

 rs.MoveNext
 Loop

 rs.Close
 Set rs = Nothing
 Set rs2 = Nothing
 conn.Close
 MsgBox "done"
 Exit Sub
err_end:
MsgBox Err.Description
End Sub

When the Process button on the form is clicked on, some initialization takes place, most notably the Boolean variables F and L are set to True or False, depending on the check boxes on the form. For instance, if the "Last Name must match" check box on the form is checked, then the variable L is set to True. Here's how:


 F = False
 L = False
 If Me.chkFirstName = True Then F = True
 If Me.chkLastName = True Then L = True

The Data1 table is opened as a recordset. The entire table is looped through. As each individual record is addressed, a second recordset is opened based on the same Data1 table, but this time criteria is applied based on the values in the current record of the first (outer) recordset. Equality or inequality operators are applied based on the check boxes on the form. For example, to match on last name and not on first name, the SQL statement used to open the second (inner) recordset looks like this:


Select * From Data1 Where 
FirstName <> rs.Fields("FirstName") And
LastName = rs.Fields("LastName")

Used with real values, the SQL statement comes out like this:


Select * From Data1 Where FirstName<>'Sondie' 
And LastName='Armstrong' 

The first name Sondie and the last name Armstrong are the values in the current record of the outer recordset. The SQL statement that incorporates these values is used to open the inner recordset. The inner recordset is looped through, and each of those records is tested to see whether its field values are found within the associated fields in the current record of the outer recordset. In fact, the test is done both ways. For example, the first-name value of the outer recordset is tested to see whether it exists within the first-name value of the inner recordset, and the reverse is performed—the first-name value of the inner recordset is tested to see whether it is a substring within the first-name value of the outer recordset:


   If (InStr(1, rs.Fields("FirstName"), _
          rs2.Fields("FirstName") > 0) Or _
          (InStr(1, rs2.Fields("FirstName"), _
          rs.Fields("FirstName"))) > 0) And _
          ((InStr(1, rs.Fields("LastName"), _
          rs2.Fields("LastName")) > 0) Or _
         (InStr(1, rs2.Fields("LastName"), _
         rs.Fields("LastName"))) > 0) Then

Taking real values from the data Sondie H. will not be found to exist in Sondie, but Sondie will be found to exist in Sondie H. The lengthy If statement that uses the InStr function applies this cross-check to both the first name and the last name, as a coding convenience. In the example, the last name must match (as checked on the form), and the SQL statement that opened the inner recordset already ensures that the last names match. But it was easier to create a single test that accommodates all variations of the check boxes on the form. The fact that the InStr function is testing whether one string exists inside another, when both are the same, still returns a true result.

When the InStr functions return a true result to the If statement, then a SQL Insert statement is assembled to put the record from the outer recordset and the record from the inner recordset together into a single record in a table named Results:


ssql3 = "Insert Into Results Values ("
         ssql3 = ssql3 & "'" & _
              rs.Fields("FirstName") & "', "
         ssql3 = ssql3 & "'" & _
              rs.Fields("LastName") & "', "
         ssql3 = ssql3 & "'" & _
              rs2.Fields("FirstName") & "', "
         ssql3 = ssql3 & "'" & _
              rs2.Fields("LastName") & "')"
         conn.Execute ssql3

Figure 4 shows how the Results table has filled up with matched records.

Figure 4
Figure 4. The Results table shows the matching records

Comparing the records that did get written into the Results table with the records from the Data1 table itself highlights the usefulness of this technique. Didi Aboud and Deidre Aboud were not found to be essential duplicates. They may be so, but Didi was not found to a substring in Deidre, and Deidre was not found to be a substring in Didi.

Identical duplicates such as Wendy Abrams and Thomas Babcock are not identified either. The way the check boxes on the form were set in this example required that the first names do not match. Therefore, exact duplicates like these are not returned as records in the inner recordset.

The technique has correctly identified Marleen Adams and Marl Adams as essential duplicates. This holds true as well for the Sondie Armstrong and Sondie H. Armstrong records, as well as the Russ Askenburg and Russel Askenburg records.

At the bottom of Figure 3 are records for William Beauchamp and Dawn Beauchamp. These were not flagged as near duplicates, and I think we can agree that these are records for two distinct people who happen to have the same last name.

You can apply the technique shown in this article to your data. I built the form and the code around the testing of first and last names because this a common issue when identifying duplicates. The technique can be applied to other fields; you know your data best and on which fields to apply the testing.

This technique is just one way to work with identifying near-identical data. It is not foolproof, but for the most part it will help to identify a majority of near duplicates.

Ken Bluttman has written many articles and computer books, including Excel Charts for Dummies and O'Reilly's Access Hacks. His technical chops include Microsoft Office, XML, VBA, VB.NET, SQL Server, and assorted web technologies.

Access Hacks

Related Reading

Access Hacks
Tips & Tools for Wrangling Your Data
By Ken Bluttman

Return to the Windows DevCenter.