/

/

How to Clean Dirty Data in SQL: 5 Functions Every Analyst Needs

Content

How to Clean Dirty Data in SQL: 5 Functions Every Analyst Needs

How to Clean Dirty Data in SQL: 5 Functions Every Analyst Needs

How to Clean Dirty Data in SQL: 5 Functions Every Analyst Needs

Why Dirty Data Is Costing You More Than You Think

Bad data is not just annoying — it is expensive. Poor data quality costs the US economy trillions annually. For analysts, dirty data means wrong dashboards, failed reports, and business decisions built on a shaky foundation.

Dirty data comes in many forms: null values where there should be numbers, extra whitespace padding string fields, inconsistent category labels like "NY", "New York", and "new york" living in the same column, or numeric data accidentally stored as text. Before you can analyze anything, you need to clean dirty data in SQL — directly at the source.

This tutorial walks through five essential SQL functions every analyst should know for SQL data cleaning.

Function 1: TRY_CAST and TRY_CONVERT — Handle Type Mismatches Safely

One of the most common dirty data problems is a column that should hold numbers but contains text. Casting these values directly with CAST() will throw an error the moment it hits a non-numeric string.

TRY_CAST (SQL Server, Azure SQL) returns NULL instead of throwing an error when conversion fails.

Real-World Scenario

Your orders table has a revenue column stored as VARCHAR. Some rows contain '1250.00', others contain 'N/A' or 'pending'.

-- With TRY_CAST, bad values become NULL:
SELECT order_id,
  TRY_CAST(revenue AS DECIMAL(10,2)) AS revenue_clean
FROM orders;

-- In PostgreSQL, use a CASE WHEN workaround:
SELECT order_id,
  CASE WHEN revenue ~ '^[0-9]+(\.[0-9]+)?$'
    THEN revenue::DECIMAL(10,2) ELSE NULL
  END AS revenue_clean
FROM orders;

Function 2: COALESCE and ISNULL — Replace Null Values

Null values are silent data quality killers. Aggregations skip them, joins exclude them. COALESCE returns the first non-null value from a list of arguments.

Real-World Scenario

A customer table has phone and alt_phone columns with many nulls.

SELECT customer_id, name,
  COALESCE(phone, alt_phone, 'No phone on file') AS best_contact
FROM customers;

-- Fill nulls before aggregation:
SELECT AVG(COALESCE(session_duration_seconds, 0)) AS avg_session
FROM user_sessions;

Function 3: TRIM, LTRIM, RTRIM — Eliminate Invisible Whitespace

Extra spaces are invisible in most UIs but devastating for string comparisons and GROUP BY queries. A customer named ' Acme Corp' will never match 'Acme Corp' in a join.

SELECT TRIM(company_name) AS company_name_clean,
  COUNT(*) AS record_count
FROM crm_contacts
GROUP BY TRIM(company_name)
ORDER BY record_count DESC;

-- Combine with LOWER() for full normalization:
SELECT LOWER(TRIM(company_name)) AS company_name_normalized
FROM crm_contacts;

Function 4: REPLACE and TRANSLATE — Strip Unwanted Characters

REPLACE swaps one substring for another. TRANSLATE maps individual characters to replacements in a single pass.

Real-World Scenario

Phone numbers stored as '(555) 123-4567' need to be plain digits: '5551234567'.

-- REPLACE chained:
SELECT REPLACE(REPLACE(REPLACE(REPLACE(
  phone, '(', ''), ')', ''), ' ', ''), '-', '') AS phone_digits
FROM contacts;

-- Strip currency formatting:
SELECT CAST(REPLACE(REPLACE(price_text, '$', ''), ',', '')
  AS DECIMAL(10,2)) AS price_numeric
FROM product_listings;

Function 5: CASE WHEN for Data Standardization

CASE WHEN is the right tool whenever you need to normalize inconsistent categorical values.

Real-World Scenario

Your orders table has a status column with values like 'complete', 'Completed', 'DONE', 'shipped', 'CANCELLED', 'canceled'.

SELECT order_id,
  CASE
    WHEN LOWER(TRIM(status)) IN ('complete','completed','done','finished')
      THEN 'Completed'
    WHEN LOWER(TRIM(status)) IN ('shipped','in transit','dispatched')
      THEN 'Shipped'
    WHEN LOWER(TRIM(status)) IN ('cancelled','canceled','void','refunded')
      THEN 'Cancelled'
    ELSE 'Unknown'
  END AS status_clean
FROM orders;

Putting It All Together: A Combined Cleaning Query

SELECT order_id,
  INITCAP(LOWER(TRIM(customer_name))) AS customer_name,
  TRY_CAST(REPLACE(REPLACE(TRIM(revenue_raw), '$', ''), ',', '')
    AS DECIMAL(10,2)) AS revenue,
  REPLACE(REPLACE(REPLACE(REPLACE(
    TRIM(phone), '(',''), ')',''), '-',''), ' ','') AS phone,
  COALESCE(NULLIF(TRIM(region), ''), 'Unknown') AS region,
  CASE
    WHEN LOWER(TRIM(status)) IN ('complete','completed','done') THEN 'Completed'
    WHEN LOWER(TRIM(status)) IN ('shipped','dispatched') THEN 'Shipped'
    WHEN LOWER(TRIM(status)) IN ('cancelled','canceled','void') THEN 'Cancelled'
    ELSE 'Review Needed'
  END AS status
FROM raw_orders WHERE order_id IS NOT NULL;

How AI Speeds Up SQL Data Cleaning

Writing data cleaning queries from scratch is repetitive work. You know what needs to happen but translating it into the correct SQL syntax for your specific database takes time.

AI2SQL removes the friction. Instead of looking up whether your database uses TRY_CAST or a regex workaround, you describe the problem in plain English:

"Clean the phone column in raw_orders by removing parentheses, dashes, and spaces, then cast revenue_raw to a decimal after stripping dollar signs and commas."

AI2SQL generates the correct, dialect-specific SQL instantly — so you spend your time validating and running queries instead of writing boilerplate.

Conclusion

The five functions covered here — TRY_CAST for safe type conversion, COALESCE for null handling, TRIM for whitespace, REPLACE/TRANSLATE for character stripping, and CASE WHEN for standardization — cover the vast majority of real-world SQL data cleaning scenarios. Build them into your workflow, layer them into views or CTEs, and your downstream analysis will be far more reliable.

Ready to generate your own cleaning queries without memorizing every syntax variant? Try AI2SQL free and describe your dirty data problem in plain English.

Share this

More Articles

More Articles

More Articles