Skip to content

flowersteam/KidsReflect_data_pipeline

Repository files navigation

KidsReflect Data Pipeline

Overview

The KidsReflect Data Pipeline is a repository containing:

  1. Raw KidsReflect data (multiple CSV files + a JSON reference file),
  2. Scripts (in both R and Python) to process and clean this data,
  3. A final processed output (KR_preprocessed.csv).

Both the R and Python implementations perform the same sequence of steps, yielding identical results. The differences are purely in the programming language used.

Directory Structure

.
├── README.md                             # Project overview and documentation
├── raw_data/                             # Folder containing raw data and reference files
│   ├── talence_end_mc.csv                # Main raw CSV file
│   ├── KR_BC_session1.csv                # Additional session data file
│   ├── KR_AV_session1.csv                # Additional session data file
│   ├── KR_MC_session1.csv                # Additional session data file
│   └── reference_data.json               # JSON file with reference texts and cues
├── processed_data/                       # Folder containing raw data and reference files
│   └── KR_preprocessed.csv               # Final processed data
├── KidsReflect_data_pipeline.rmd         # R Markdown file implementing the processing pipeline in R
└── KidsReflect_data_pipeline.ipynb       # Jupyter Notebook implementing the processing pipeline in Python

Installation and Dependencies

Depending on whether you plan to run the Python or the R pipeline (or both), you’ll need to install the appropriate libraries.

Python Environment

Install:

pip install pandas jupyter ipykernel

Or if you’re using conda, do in your conda env:

conda install pandas jupyter ipykernel

R Environment

Install:

install.packages("tidyverse")
install.packages("jsonlite")

Description of the Study and Data Collection

KidsReflect is a study designed to encourage children's curiosity by training their metacognitive skills. During each session, a child follows a sequence of 8 cycles. Each cycle:

  • Presents a short text (historical, scientific, or general culture topic),
  • Guides the child through 4 steps to help them pose questions beyond the text’s obvious answers (i.e., “divergent questions”).

The 4 steps in each cycle are based on the Murayama's framework and are as follow:

  1. Identify: The child identifies a “knowledge gap” or subtopic they want to explore.
  2. Guess: The child formulates a hypothesis or guess related to that subtopic.
  3. Seek: The child transforms their hypothesis into a question that would elicit the information they lack.
  4. Assess: The child reviews suggested answers and indicates whether the question was resolved.

For each step, children had the possibility to formulate their own responses or use one of the three proposed cues.

Data was originally collected via a tablet-based application. If an application bug occurred, an additional session might have been started for the same child, sometimes with a modified ID (e.g., “B2.2” if the first session was “B2”). This can produce repeated or partial sessions.

The dataset in this repository is extracted from a MongoDB containing all KidsReflect sessions from [start date] to [end date]. For each session (identified by _id), the raw data includes the child’s entries at each step and iteration.

Data Format

  • Dimensions: The raw dataset (from talence_end_mc.csv alone) has 424 rows × 35 columns, plus additional rows from KR_BC_session1.csv, KR_AV_session1.csv, and KR_MC_session1.csv.
  • Key columns:
    • __v: Non-informative column (always 0).
    • _id: A unique session identifier.
    • name: A kid’s identifier (letters + digits, e.g., "B2"). Not guaranteed unique (multiple sessions or derivatives of a name).
    • data.0.ctrl1, data.0.ctrl2, data.0.det, data.0.exp: The 4 steps (Identify, Assess, Guess, Seek) for iteration 0.
    • data.1.ctrl1data.7.exp: The 4 steps for iterations 1 through 7 (up to 8 total iterations).

Processing Steps

The pipeline is divided into 10 steps. Steps 1-4 and Step 10 restructure the data (making it analyzable). Steps 5-9 are optional cleaning steps.

Important: In the code, you can set booleans (e.g., RUN_NAN, RUN_DUMMY, RUN_FAKE, RUN_GIBBERISH, RUN_INSUFFICIENT) to True/False to activate or skip each of the cleaning steps (5–9).

Below is a summary of each step:

  1. Data Loading

    • Loads the main CSV (talence_end_mc.csv) plus three additional session files (KR_BC_session1.csv, KR_AV_session1.csv, KR_MC_session1.csv).
    • Concatenates them into one DataFrame.
    • Checks for duplicate _id values.
  2. Reshaping Data (Wide → Long)

    • Drops the __v column (always 0).
    • Renames _id to ID.
    • Converts the data from “wide” (one row per session, multiple columns per iteration) to “long” (one row per iteration).
    • Each row now has columns: ID, name, Iteration, plus the four textual entries: IDENTIFY, GUESS, SEEK, ASSESS.
  3. Integrate Reference Texts

    • Loads a JSON file (reference_data.json) mapping each iteration (1–8) to a short reference text and suggested question “cues.”
    • Merges these references/cues into the DataFrame, adding columns: reference, identify_cues, guess_cues, seek_cues, assess_cues.
  4. Merge Sessions

    • Sometimes, one child’s data is split across multiple _id sessions (due to technical issues or partial retakes).
    • A canonical name is inferred from name (e.g., B2, Be2.2, or Be2bisbe2), so repeated sessions for the same child can be merged.
    • Dummy (filler) rows from partial sessions (with only one-character placeholders) are marked but not yet removed—just tagged.
    • The result is one consolidated DataFrame per child, preserving real data and tagging/merging partial sessions.
  5. Remove Sessions with All-NaN Iterations (optional)

    • Some sessions truly have no user data (all steps in all iterations are NaN).
    • These sessions are removed entirely.
  6. Remove Dummy Rows (optional)

    • Rows identified as “dummy” (i.e., the experimenter skipping previously completed iterations by entering placeholders) are removed.
  7. Remove Fake Subjects (optional)

    • Some sessions were tests or pilots (e.g., subject name = test, demo, or no digit in the name).
    • Removes those “fake” subjects, with an override list in case you want to keep some real participants who happen not to have digits in their name.
  8. Remove Gibberish Rows (optional)

    • Filters out rows that contain only gibberish or unexploitable text, by applying simple heuristics (e.g., numeric-only, extremely few vowels, single-word content, etc.).
  9. Remove Subjects with Insufficient Iterations (optional)

    • Drops any subject who has fewer than 4 valid iterations out of 8 (i.e., 50% or less valid data).
  10. Build Session Composition

  • Creates a sessions_composition column to track the original _ids, the original session_name, and the iteration numbers for each subject.
  • This helps with traceability of how the final row was built from possibly multiple partial sessions.
  • Final ordering of columns in the output:
    name, Iteration, reference, IDENTIFY, GUESS, SEEK, ASSESS,
    identify_cues, guess_cues, seek_cues, assess_cues, sessions_composition
    

How to Run the Pipeline

You can run either the R Markdown (KidsReflect_data_pipeline.rmd) or the Python notebook (KidsReflect_data_pipeline.ipynb). Both do the same thing.

Python

  1. Make sure you have pandasand jupyter to run the notebook.
  2. Before running, optionally toggle the cleaning steps in Step 0 (e.g., RUN_NAN = True, RUN_DUMMY = True, etc.).
  3. Run all cells.
  4. The final CSV will be written to processed_data/KR_preprocessed.csv (unless you alter the path).

R

  1. Make sure you have tidyverse, readr, dplyr, etc.
  2. Similarly, you can toggle specific cleaning steps by editing booleans in the R code.
  3. Knit the RMarkdown.
  4. The final CSV will be written to processed_data/KR_preprocessed.csv (unless you alter the path).

Final Output

  • File: KR_preprocessed.csv

  • Location: processed_data/

  • Format: 12 columns, 929 rows

    Column Description
    name The canonical name of the subject (child).
    Iteration Iteration number (1–8).
    reference The short text displayed to the child for that iteration (from reference_data.json).
    IDENTIFY Child’s response in the Identify step.
    GUESS Child’s response in the Guess step.
    SEEK Child’s response in the Seek step.
    ASSESS Child’s response in the Assess step (e.g., “Yes,” “No,” or “I have found my answer”).
    identify_cues JSON-encoded list of prompts/cues given during the Identify step.
    guess_cues JSON-encoded list of prompts/cues given during the Guess step.
    seek_cues JSON-encoded list of prompts/cues given during the Seek step.
    assess_cues JSON-encoded list of prompts/cues given during the Assess step.
    sessions_composition A JSON-like structure showing all original session _id values, each _id’s session_name, and iteration(s).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published