Skip to main content

Distinct Function

The Distinct function removes duplicate rows from your data based on specified fields. It ensures each unique combination of key values appears only once in the output.

How It Works

Distinct processes your data by:

  1. Identifying rows with matching values in key fields
  2. Keeping one row per unique combination
  3. Discarding duplicate rows
  4. Preserving all columns from the kept rows

Example:

Input: 100 rows with 30 duplicates
Output: 70 unique rows

Adding a Distinct

  1. Drag Distinct from the Functions section of the Element Panel
  2. Connect it to your data source
  3. Click the Distinct node to configure key fields

Configuration Panel

Key Fields

Specify which fields determine uniqueness:

All Fields

If no fields specified, all fields are compared:

  • Rows must match in every column to be duplicates
  • Most restrictive deduplication

Selected Fields

Choose specific fields for uniqueness:

Single field:

  • Key: CustomerID
  • Result: One row per unique CustomerID

Multiple fields:

  • Keys: CustomerID, OrderDate
  • Result: One row per unique CustomerID-OrderDate combination

Keep Option

When duplicates exist, which row to keep:

First

Keeps the first occurrence (default):

  • First row encountered in order
  • Subsequent duplicates discarded

Use when:

  • Order matters and first is preferred
  • Data is sorted chronologically (keep oldest)
  • Original entry is most accurate

Last

Keeps the last occurrence:

  • Earlier duplicates discarded
  • Last row encountered is kept

Use when:

  • Data is sorted chronologically (keep newest)
  • Latest update is most accurate
  • Recent values override older ones

Output Fields

All fields from the kept row are included in output:

  • Key fields are included
  • Non-key fields from the kept row are preserved
  • No aggregation occurs (use Group By for that)

Common Use Cases

Remove Exact Duplicates

Scenario: Data has complete duplicate rows

Configuration:

  • Key Fields: (All fields - leave empty)
  • Keep: First

Result: One copy of each unique row

Unique Customers

Scenario: Customer list has repeat entries

Configuration:

  • Key Fields: CustomerID
  • Keep: First

Result: One row per customer

Latest Record Per Entity

Scenario: Keep most recent record for each item

Preparation: Sort by ModifiedDate DESC

Configuration:

  • Key Fields: ItemID
  • Keep: First (after sort, first = most recent)

Result: Most recently modified record per item

Unique Email Addresses

Scenario: Mailing list has duplicate emails

Configuration:

  • Key Fields: Email
  • Keep: First

Result: One row per unique email

Unique Combinations

Scenario: Find unique product-customer pairs

Configuration:

  • Key Fields: ProductID, CustomerID
  • Keep: First

Result: One row per product-customer combination

Distinct vs. Group By

AspectDistinctGroup By
PurposeRemove duplicatesAggregate data
Output columnsAll from kept rowGrouping + aggregates
CalculationsNoneSUM, COUNT, AVG, etc.
One row perUnique key combinationUnique key combination
Non-key fieldsFrom kept rowDropped (unless aggregated)

Use Distinct when:

  • You want to remove duplicates
  • You need to keep original field values
  • No calculations are needed

Use Group By when:

  • You need to aggregate values
  • Calculating counts, sums, averages
  • Creating summary statistics

Preprocessing for Distinct

Normalize Before Deduplicating

Transform to standardize values before Distinct:

Case normalization:

Email = LOWER(${Email})
Name = UPPER(${Name})

Whitespace removal:

CustomerName = TRIM(${CustomerName})

Format standardization:

Phone = REPLACE(REPLACE(${Phone}, "-", ""), " ", "")

Then apply Distinct on normalized fields.

Sort Before Distinct

Control which duplicate is kept:

Keep newest:

[Source] → [Sort: ModifiedDate DESC] → [Distinct: ID, Keep First]

Keep highest value:

[Source] → [Sort: Score DESC] → [Distinct: UserID, Keep First]

Keep by priority:

[Source] → [Transform: Priority = CASE(Source, "Primary", 1, "Secondary", 2, 3)]
→ [Sort: Priority ASC]
→ [Distinct: RecordID, Keep First]

Complex Deduplication

Multi-Stage Deduplication

When different duplicate types exist:

Stage 1: Exact duplicates

[Source] → [Distinct: All Fields] → [Step 1 Result]

Stage 2: Logical duplicates

[Step 1 Result] → [Transform: Normalize fields]
→ [Distinct: NormalizedKey]
→ [Final Result]

Fuzzy Matching

Distinct only handles exact matches. For near-duplicates:

  1. Transform to create matching keys
  2. Apply Distinct on transformed keys

Example: Similar company names

Transform: MatchKey = REPLACE(UPPER(TRIM(${CompanyName})), " ", "")
// "Acme Corp", "ACME CORP", " Acme Corp " all become "ACMECORP"

Distinct on: MatchKey

Prioritized Deduplication

Keep the "best" record based on criteria:

[Source] → [Transform: Calculate priority score]
→ [Sort: ID, PriorityScore DESC]
→ [Distinct: ID, Keep First]

Priority Transform example:

Priority = (
IF(${HasEmail}, 10, 0) +
IF(${HasPhone}, 5, 0) +
IF(${IsVerified}, 20, 0) +
IF(${Source} == "Primary", 15, 0)
)

NULL Handling

NULLs in Key Fields

When key fields contain null:

  • All nulls are treated as equal
  • Multiple null rows become one
  • NULL matches NULL

To handle differently:

Transform before Distinct:
KeyField = COALESCE(${KeyField}, "NO_VALUE")

NULLs in Non-Key Fields

Non-key field nulls don't affect deduplication:

  • Kept row may have nulls
  • Consider which row to keep based on data completeness

Performance Considerations

Early Deduplication

Apply Distinct early to reduce data volume:

More efficient:

[Entity] → [Distinct: CustomerID] → [Transform] → [Merge] → [Output]

Less efficient:

[Entity] → [Transform] → [Merge] → [Distinct: CustomerID] → [Output]

Key Field Count

Fewer key fields = more duplicates removed:

  • Each additional key field reduces duplicate matches
  • Use minimum fields needed for your definition of "unique"

Large Datasets

For very large datasets:

  • Filter first to reduce volume
  • Consider if source-level deduplication is possible
  • Monitor memory usage

Troubleshooting

No Duplicates Removed

Possible causes:

  1. Data is already unique

    • Verify expected duplicates exist
    • Preview source data
  2. Wrong key fields

    • Check field selection
    • Consider if all fields is appropriate
  3. Values differ unexpectedly

    • Check for whitespace differences
    • Check case sensitivity
    • Check data types

Debugging:

[Source] → [GroupBy: Key Fields, COUNT] → [Filter: COUNT > 1]

This shows which combinations have duplicates.

Wrong Row Kept

Possible causes:

  1. Keep First vs Keep Last incorrect

    • Verify which option is selected
  2. Source order not as expected

    • Add explicit Sort before Distinct
  3. Sort order incorrect

    • Verify sort direction (ASC vs DESC)

Too Many Rows Removed

Possible causes:

  1. Key fields too few

    • Add additional fields to key
    • More fields = fewer matches
  2. Overly aggressive normalization

    • Transform is creating too many matches
    • Check normalization logic

Memory Issues

Symptoms: Slow or failing on large data

Solutions:

  • Filter source data first
  • Apply at database level if possible
  • Process in batches

Examples

Mailing List Cleanup

Goal: Unique contacts by email

Flow:

[All Contacts] → [Transform: Email = LOWER(TRIM(${Email}))]
→ [Distinct: Email, Keep First]
→ [Output: Cleaned Mailing List]

Customer Master

Goal: Latest customer record

Flow:

[Customer Records] → [Sort: CustomerID, LastModified DESC]
→ [Distinct: CustomerID, Keep First]
→ [Output: Current Customer Master]

Product Catalog

Goal: One row per SKU with all data

Flow:

[Product Data] → [Distinct: SKU]
→ [Output: Product Catalog]

Event Deduplication

Goal: Remove duplicate event submissions

Flow:

[Event Submissions] → [Distinct: EventID, UserID, Timestamp]
→ [Output: Unique Events]

Invoice Processing

Goal: Unique invoices, preferring matched status

Flow:

[All Invoices] → [Transform: Priority = IF(${Status}=="Matched", 1, 2)]
→ [Sort: InvoiceNo, Priority ASC]
→ [Distinct: InvoiceNo, Keep First]
→ [Output: Unique Invoices]

Next Steps