Your cart is currently empty!
SayPro Data Cleaning Checklist
✅ SayPro Data Cleaning Checklist
📅 Before Cleaning: Preparation
- Confirm all raw data files have been received (Excel, CSV, system exports).
- Back up the original data files.
- Ensure consistent formatting across datasets (e.g., column names, units, date formats).
- Review the data dictionary or variable definitions for alignment.
🔍 Step 1: Structural Checks
- Are all required columns present (e.g., Participant ID, Date, Region)?
- Are there any extra or duplicated columns?
- Are the headers clearly labeled and consistent?
- Remove any blank rows or extra spacing.
👥 Step 2: Duplicate Check
- Check for and remove exact duplicate rows.
- Check for partial duplicates (e.g., same name and date but different ID).
- Confirm which duplicate to retain based on accuracy or timestamp.
❌ Step 3: Missing Data
- Identify missing values in key fields (e.g., age, region, attendance).
- Flag rows with incomplete data for follow-up or exclusion.
- Apply agreed-upon method for handling missing data:
- Impute (mean, median, or category)
- Leave blank (if non-critical)
- Remove (if data is unreliable)
📐 Step 4: Data Type & Format Validation
- Ensure numbers are numeric and dates are in correct format (e.g., YYYY-MM-DD).
- Standardize text entries (e.g., “gauteng” → “Gauteng”).
- Check dropdown values against validation lists (e.g., Gender = Male/Female/Other only).
- Ensure consistent units (e.g., all scores on 1–5 scale).
🔢 Step 5: Logical Consistency Checks
- Verify date sequences (e.g., Start Date is before End Date).
- Ensure age range falls within expected bounds (e.g., 15–35 for youth programs).
- Confirm all attendance and module completions match program schedules.
- Cross-check regional and district combinations.
🧩 Step 6: Categorical Data Standardization
- Standardize categories (e.g., “Yes”, “yes”, “Y” → “Yes”).
- Remove typos and inconsistent spellings (e.g., “Freestate” → “Free State”).
- Apply consistent naming conventions for activities or modules.
📉 Step 7: Outlier Detection
- Identify and review outliers (e.g., ages over 50, scores above 5).
- Investigate whether outliers are valid or entry errors.
- Correct, explain, or remove extreme outliers based on context.
🗂️ Step 8: Documentation
- Log all cleaning actions in a Data Cleaning Log (what was changed, why, and by whom).
- Record assumptions made during cleaning (e.g., assumptions about missing values).
- Save a cleaned version of the dataset with a new filename and version number.
🧪 Step 9: Final Quality Review
- Conduct a peer review or second-check of the cleaned data.
- Run basic summary statistics to confirm data integrity (e.g., totals, averages).
- Validate a random sample against original sources if needed.
🔒 Step 10: Secure Storage
- Store the cleaned dataset in the designated SayPro shared folder or platform.
- Update file naming convention (e.g.,
SayPro_YouthData_Cleaned_June2025_v2.xlsx
). - Archive both raw and cleaned data securely for traceability.
📎 Optional: Tools to Support Cleaning
- Excel (filters, data validation, conditional formatting)
- Power Query (for merging, transforming, cleaning large data)
- Python/Pandas or R (for automated cleaning workflows)
- SayPro M&E Dashboard (for integrated data checks)
Leave a Reply
You must be logged in to post a comment.