The KidsReflect Data Pipeline is a repository containing:
- Raw KidsReflect data (multiple CSV files + a JSON reference file),
- Scripts (in both R and Python) to process and clean this data,
- A final processed output (
KR_preprocessed.csv
).
Both the R and Python implementations perform the same sequence of steps, yielding identical results. The differences are purely in the programming language used.
.
├── README.md # Project overview and documentation
├── raw_data/ # Folder containing raw data and reference files
│ ├── talence_end_mc.csv # Main raw CSV file
│ ├── KR_BC_session1.csv # Additional session data file
│ ├── KR_AV_session1.csv # Additional session data file
│ ├── KR_MC_session1.csv # Additional session data file
│ └── reference_data.json # JSON file with reference texts and cues
├── processed_data/ # Folder containing raw data and reference files
│ └── KR_preprocessed.csv # Final processed data
├── KidsReflect_data_pipeline.rmd # R Markdown file implementing the processing pipeline in R
└── KidsReflect_data_pipeline.ipynb # Jupyter Notebook implementing the processing pipeline in Python
Depending on whether you plan to run the Python or the R pipeline (or both), you’ll need to install the appropriate libraries.
Install:
pip install pandas jupyter ipykernel
Or if you’re using conda, do in your conda env:
conda install pandas jupyter ipykernel
Install:
install.packages("tidyverse")
install.packages("jsonlite")
KidsReflect is a study designed to encourage children's curiosity by training their metacognitive skills. During each session, a child follows a sequence of 8 cycles. Each cycle:
- Presents a short text (historical, scientific, or general culture topic),
- Guides the child through 4 steps to help them pose questions beyond the text’s obvious answers (i.e., “divergent questions”).
The 4 steps in each cycle are based on the Murayama's framework and are as follow:
- Identify: The child identifies a “knowledge gap” or subtopic they want to explore.
- Guess: The child formulates a hypothesis or guess related to that subtopic.
- Seek: The child transforms their hypothesis into a question that would elicit the information they lack.
- Assess: The child reviews suggested answers and indicates whether the question was resolved.
For each step, children had the possibility to formulate their own responses or use one of the three proposed cues.
Data was originally collected via a tablet-based application. If an application bug occurred, an additional session might have been started for the same child, sometimes with a modified ID (e.g., “B2.2” if the first session was “B2”). This can produce repeated or partial sessions.
The dataset in this repository is extracted from a MongoDB containing all KidsReflect sessions from [start date] to [end date]. For each session (identified by _id
), the raw data includes the child’s entries at each step and iteration.
- Dimensions: The raw dataset (from
talence_end_mc.csv
alone) has 424 rows × 35 columns, plus additional rows fromKR_BC_session1.csv
,KR_AV_session1.csv
, andKR_MC_session1.csv
. - Key columns:
__v
: Non-informative column (always 0)._id
: A unique session identifier.name
: A kid’s identifier (letters + digits, e.g.,"B2"
). Not guaranteed unique (multiple sessions or derivatives of a name).data.0.ctrl1
,data.0.ctrl2
,data.0.det
,data.0.exp
: The 4 steps (Identify, Assess, Guess, Seek) for iteration 0.data.1.ctrl1
…data.7.exp
: The 4 steps for iterations 1 through 7 (up to 8 total iterations).
The pipeline is divided into 10 steps. Steps 1-4 and Step 10 restructure the data (making it analyzable). Steps 5-9 are optional cleaning steps.
Important: In the code, you can set booleans (e.g.,
RUN_NAN
,RUN_DUMMY
,RUN_FAKE
,RUN_GIBBERISH
,RUN_INSUFFICIENT
) toTrue/False
to activate or skip each of the cleaning steps (5–9).
Below is a summary of each step:
-
Data Loading
- Loads the main CSV (
talence_end_mc.csv
) plus three additional session files (KR_BC_session1.csv
,KR_AV_session1.csv
,KR_MC_session1.csv
). - Concatenates them into one DataFrame.
- Checks for duplicate
_id
values.
- Loads the main CSV (
-
Reshaping Data (Wide → Long)
- Drops the
__v
column (always 0). - Renames
_id
toID
. - Converts the data from “wide” (one row per session, multiple columns per iteration) to “long” (one row per iteration).
- Each row now has columns:
ID
,name
,Iteration
, plus the four textual entries:IDENTIFY
,GUESS
,SEEK
,ASSESS
.
- Drops the
-
Integrate Reference Texts
- Loads a JSON file (
reference_data.json
) mapping each iteration (1–8) to a short reference text and suggested question “cues.” - Merges these references/cues into the DataFrame, adding columns:
reference
,identify_cues
,guess_cues
,seek_cues
,assess_cues
.
- Loads a JSON file (
-
Merge Sessions
- Sometimes, one child’s data is split across multiple
_id
sessions (due to technical issues or partial retakes). - A canonical name is inferred from
name
(e.g.,B2
,Be2.2
, orBe2bis
→be2
), so repeated sessions for the same child can be merged. - Dummy (filler) rows from partial sessions (with only one-character placeholders) are marked but not yet removed—just tagged.
- The result is one consolidated DataFrame per child, preserving real data and tagging/merging partial sessions.
- Sometimes, one child’s data is split across multiple
-
Remove Sessions with All-NaN Iterations (optional)
- Some sessions truly have no user data (all steps in all iterations are
NaN
). - These sessions are removed entirely.
- Some sessions truly have no user data (all steps in all iterations are
-
Remove Dummy Rows (optional)
- Rows identified as “dummy” (i.e., the experimenter skipping previously completed iterations by entering placeholders) are removed.
-
Remove Fake Subjects (optional)
- Some sessions were tests or pilots (e.g., subject name =
test
,demo
, or no digit in the name). - Removes those “fake” subjects, with an override list in case you want to keep some real participants who happen not to have digits in their name.
- Some sessions were tests or pilots (e.g., subject name =
-
Remove Gibberish Rows (optional)
- Filters out rows that contain only gibberish or unexploitable text, by applying simple heuristics (e.g., numeric-only, extremely few vowels, single-word content, etc.).
-
Remove Subjects with Insufficient Iterations (optional)
- Drops any subject who has fewer than 4 valid iterations out of 8 (i.e., 50% or less valid data).
-
Build Session Composition
- Creates a
sessions_composition
column to track the original_id
s, the originalsession_name
, and the iteration numbers for each subject. - This helps with traceability of how the final row was built from possibly multiple partial sessions.
- Final ordering of columns in the output:
name, Iteration, reference, IDENTIFY, GUESS, SEEK, ASSESS, identify_cues, guess_cues, seek_cues, assess_cues, sessions_composition
You can run either the R Markdown (KidsReflect_data_pipeline.rmd
) or the Python notebook (KidsReflect_data_pipeline.ipynb
). Both do the same thing.
- Make sure you have
pandas
andjupyter
to run the notebook. - Before running, optionally toggle the cleaning steps in Step 0 (e.g.,
RUN_NAN = True
,RUN_DUMMY = True
, etc.). - Run all cells.
- The final CSV will be written to
processed_data/KR_preprocessed.csv
(unless you alter the path).
- Make sure you have
tidyverse
,readr
,dplyr
, etc. - Similarly, you can toggle specific cleaning steps by editing booleans in the R code.
- Knit the RMarkdown.
- The final CSV will be written to
processed_data/KR_preprocessed.csv
(unless you alter the path).
-
File:
KR_preprocessed.csv
-
Location:
processed_data/
-
Format: 12 columns, 929 rows
Column Description name
The canonical name of the subject (child). Iteration
Iteration number (1–8). reference
The short text displayed to the child for that iteration (from reference_data.json
).IDENTIFY
Child’s response in the Identify step. GUESS
Child’s response in the Guess step. SEEK
Child’s response in the Seek step. ASSESS
Child’s response in the Assess step (e.g., “Yes,” “No,” or “I have found my answer”). identify_cues
JSON-encoded list of prompts/cues given during the Identify step. guess_cues
JSON-encoded list of prompts/cues given during the Guess step. seek_cues
JSON-encoded list of prompts/cues given during the Seek step. assess_cues
JSON-encoded list of prompts/cues given during the Assess step. sessions_composition
A JSON-like structure showing all original session _id
values, each_id
’ssession_name
, and iteration(s).