Duplicate Report System
Overview
The Duplicate Report system identifies potential duplicate subscriptions by analyzing shipping addresses and customer data. It processes subscriptions asynchronously and logs potential duplicates for review.
Key Components
Detection Methods
- Address-Based Detection
- Uses Soundex algorithm for phonetic matching
- Compares shipping address keys
- Identifies similar addresses that might represent the same location
- Multiple Self Subscriptions
- Identifies when a single user has multiple active subscriptions
- Excludes gifted and comped subscriptions from consideration
Processing Logic
- Subscription Status Filtering
- Only processes subscriptions with statuses:
- Pending
- Active
- On-hold
- Pending-cancel
- Only processes subscriptions with statuses:
- Batch Processing
- Processes subscriptions in batches of 50
- Marks subscriptions as checked after processing
- Handles asynchronous processing to prevent timeout
Duplicate Logging
- Log Entry Types
- Standard duplicates (address matches)
- Multiple-self subscriptions
- Each entry includes:
- Dupe key (soundex:address_key)
- Original subscription ID
- Duplicate subscription ID
- Status
- Status Tracking
- candidate: Potential duplicate awaiting review
- multiple-self: Same customer with multiple subscriptions
- ignored: Manually marked as not a duplicate
- merged: Subscriptions have been combined
Database Structure
Duplicate Log Table
CREATE TABLE duplicate_log (
candidate_id bigint(20) NOT NULL auto_increment,
dupe_key varchar(128),
status varchar(40),
subscription_id bigint(20),
duplicate_id bigint(20),
updated datetime,
PRIMARY KEY (candidate_id)
)
Known Limitations
- Address matching may produce false positives for:
- Multiple units in same building
- Business addresses
- Similar street names
- Cannot detect duplicates with significantly different address formats
- Multiple-self detection doesn’t consider historical subscriptions
Usage
The system:
- Runs as an asynchronous report
- Processes unverified subscriptions
- Logs potential duplicates
- Allows manual review and resolution
- Updates subscription status after processing
Data Flow
- Fetch unprocessed subscriptions
- Generate address keys and soundex codes
- Compare against existing subscriptions
- Check for multiple subscriptions per user
- Log potential duplicates
- Mark subscriptions as processed