SayProApp Courses Partner Invest Corporate Charity Divisions

SayPro Email: info@saypro.online Call/WhatsApp: + 27 84 313 7407

SayPro Data Cleaning Checklist

SayPro Data Cleaning Checklist

📅 Before Cleaning: Preparation

  • Confirm all raw data files have been received (Excel, CSV, system exports).
  • Back up the original data files.
  • Ensure consistent formatting across datasets (e.g., column names, units, date formats).
  • Review the data dictionary or variable definitions for alignment.

🔍 Step 1: Structural Checks

  • Are all required columns present (e.g., Participant ID, Date, Region)?
  • Are there any extra or duplicated columns?
  • Are the headers clearly labeled and consistent?
  • Remove any blank rows or extra spacing.

👥 Step 2: Duplicate Check

  • Check for and remove exact duplicate rows.
  • Check for partial duplicates (e.g., same name and date but different ID).
  • Confirm which duplicate to retain based on accuracy or timestamp.

Step 3: Missing Data

  • Identify missing values in key fields (e.g., age, region, attendance).
  • Flag rows with incomplete data for follow-up or exclusion.
  • Apply agreed-upon method for handling missing data:
    • Impute (mean, median, or category)
    • Leave blank (if non-critical)
    • Remove (if data is unreliable)

📐 Step 4: Data Type & Format Validation

  • Ensure numbers are numeric and dates are in correct format (e.g., YYYY-MM-DD).
  • Standardize text entries (e.g., “gauteng” → “Gauteng”).
  • Check dropdown values against validation lists (e.g., Gender = Male/Female/Other only).
  • Ensure consistent units (e.g., all scores on 1–5 scale).

🔢 Step 5: Logical Consistency Checks

  • Verify date sequences (e.g., Start Date is before End Date).
  • Ensure age range falls within expected bounds (e.g., 15–35 for youth programs).
  • Confirm all attendance and module completions match program schedules.
  • Cross-check regional and district combinations.

🧩 Step 6: Categorical Data Standardization

  • Standardize categories (e.g., “Yes”, “yes”, “Y” → “Yes”).
  • Remove typos and inconsistent spellings (e.g., “Freestate” → “Free State”).
  • Apply consistent naming conventions for activities or modules.

📉 Step 7: Outlier Detection

  • Identify and review outliers (e.g., ages over 50, scores above 5).
  • Investigate whether outliers are valid or entry errors.
  • Correct, explain, or remove extreme outliers based on context.

🗂️ Step 8: Documentation

  • Log all cleaning actions in a Data Cleaning Log (what was changed, why, and by whom).
  • Record assumptions made during cleaning (e.g., assumptions about missing values).
  • Save a cleaned version of the dataset with a new filename and version number.

🧪 Step 9: Final Quality Review

  • Conduct a peer review or second-check of the cleaned data.
  • Run basic summary statistics to confirm data integrity (e.g., totals, averages).
  • Validate a random sample against original sources if needed.

🔒 Step 10: Secure Storage

  • Store the cleaned dataset in the designated SayPro shared folder or platform.
  • Update file naming convention (e.g., SayPro_YouthData_Cleaned_June2025_v2.xlsx).
  • Archive both raw and cleaned data securely for traceability.

📎 Optional: Tools to Support Cleaning

  • Excel (filters, data validation, conditional formatting)
  • Power Query (for merging, transforming, cleaning large data)
  • Python/Pandas or R (for automated cleaning workflows)
  • SayPro M&E Dashboard (for integrated data checks)

Comments

Leave a Reply

Index