Duplicate Report System

Overview

The Duplicate Report system identifies potential duplicate subscriptions by analyzing shipping addresses and customer data. It processes subscriptions asynchronously and logs potential duplicates for review.

Key Components

Detection Methods

  1. Address-Based Detection
    • Uses Soundex algorithm for phonetic matching
    • Compares shipping address keys
    • Identifies similar addresses that might represent the same location
  2. Multiple Self Subscriptions
    • Identifies when a single user has multiple active subscriptions
    • Excludes gifted and comped subscriptions from consideration

Processing Logic

  1. Subscription Status Filtering
    • Only processes subscriptions with statuses:
      • Pending
      • Active
      • On-hold
      • Pending-cancel
  2. Batch Processing
    • Processes subscriptions in batches of 50
    • Marks subscriptions as checked after processing
    • Handles asynchronous processing to prevent timeout

Duplicate Logging

  1. Log Entry Types
    • Standard duplicates (address matches)
    • Multiple-self subscriptions
    • Each entry includes:
      • Dupe key (soundex:address_key)
      • Original subscription ID
      • Duplicate subscription ID
      • Status
  2. Status Tracking
    • candidate: Potential duplicate awaiting review
    • multiple-self: Same customer with multiple subscriptions
    • ignored: Manually marked as not a duplicate
    • merged: Subscriptions have been combined

Database Structure

Duplicate Log Table

CREATE TABLE duplicate_log (
    candidate_id bigint(20) NOT NULL auto_increment,
    dupe_key varchar(128),
    status varchar(40),
    subscription_id bigint(20),
    duplicate_id bigint(20),
    updated datetime,
    PRIMARY KEY (candidate_id)
)

Known Limitations

  1. Address matching may produce false positives for:
    • Multiple units in same building
    • Business addresses
    • Similar street names
  2. Cannot detect duplicates with significantly different address formats
  3. Multiple-self detection doesn’t consider historical subscriptions

Usage

The system:

  1. Runs as an asynchronous report
  2. Processes unverified subscriptions
  3. Logs potential duplicates
  4. Allows manual review and resolution
  5. Updates subscription status after processing

Data Flow

  1. Fetch unprocessed subscriptions
  2. Generate address keys and soundex codes
  3. Compare against existing subscriptions
  4. Check for multiple subscriptions per user
  5. Log potential duplicates
  6. Mark subscriptions as processed