SayProApp Courses Partner Invest Corporate Charity Divisions

SayPro Email: info@saypro.online Call/WhatsApp: + 27 84 313 7407

SayPro Data Cleaning Checklist

Written by

Tsakani Stella Rikhotso

in

SayPro Human Capital Works

SayPro Table of Contents

✅ SayPro Data Cleaning Checklist

📅 Before Cleaning: Preparation

Confirm all raw data files have been received (Excel, CSV, system exports).
Back up the original data files.
Ensure consistent formatting across datasets (e.g., column names, units, date formats).
Review the data dictionary or variable definitions for alignment.

🔍 Step 1: Structural Checks

Are all required columns present (e.g., Participant ID, Date, Region)?
Are there any extra or duplicated columns?
Are the headers clearly labeled and consistent?
Remove any blank rows or extra spacing.

👥 Step 2: Duplicate Check

Check for and remove exact duplicate rows.
Check for partial duplicates (e.g., same name and date but different ID).
Confirm which duplicate to retain based on accuracy or timestamp.

❌ Step 3: Missing Data

Identify missing values in key fields (e.g., age, region, attendance).
Flag rows with incomplete data for follow-up or exclusion.
Apply agreed-upon method for handling missing data:
- Impute (mean, median, or category)
- Leave blank (if non-critical)
- Remove (if data is unreliable)

📐 Step 4: Data Type & Format Validation

Ensure numbers are numeric and dates are in correct format (e.g., YYYY-MM-DD).
Standardize text entries (e.g., “gauteng” → “Gauteng”).
Check dropdown values against validation lists (e.g., Gender = Male/Female/Other only).
Ensure consistent units (e.g., all scores on 1–5 scale).

🔢 Step 5: Logical Consistency Checks

Verify date sequences (e.g., Start Date is before End Date).
Ensure age range falls within expected bounds (e.g., 15–35 for youth programs).
Confirm all attendance and module completions match program schedules.
Cross-check regional and district combinations.

🧩 Step 6: Categorical Data Standardization

Standardize categories (e.g., “Yes”, “yes”, “Y” → “Yes”).
Remove typos and inconsistent spellings (e.g., “Freestate” → “Free State”).
Apply consistent naming conventions for activities or modules.

📉 Step 7: Outlier Detection

Identify and review outliers (e.g., ages over 50, scores above 5).
Investigate whether outliers are valid or entry errors.
Correct, explain, or remove extreme outliers based on context.

🗂️ Step 8: Documentation

Log all cleaning actions in a Data Cleaning Log (what was changed, why, and by whom).
Record assumptions made during cleaning (e.g., assumptions about missing values).
Save a cleaned version of the dataset with a new filename and version number.

🧪 Step 9: Final Quality Review

Conduct a peer review or second-check of the cleaned data.
Run basic summary statistics to confirm data integrity (e.g., totals, averages).
Validate a random sample against original sources if needed.

🔒 Step 10: Secure Storage

Store the cleaned dataset in the designated SayPro shared folder or platform.
Update file naming convention (e.g., SayPro_YouthData_Cleaned_June2025_v2.xlsx).
Archive both raw and cleaned data securely for traceability.

📎 Optional: Tools to Support Cleaning

Excel (filters, data validation, conditional formatting)
Power Query (for merging, transforming, cleaning large data)
Python/Pandas or R (for automated cleaning workflows)
SayPro M&E Dashboard (for integrated data checks)

Comments

Leave a Reply Cancel reply

You must be logged in to post a comment.

More posts