The study has conducted six waves of data collection, spanning from 1998 through 2017. The publicly-available data from interviews at each wave, as well as the home visit activities, and interviewer observations in Year 3 through Year 15, is described below. This page also provides a brief overview of data available through the restricted-use contract process, and a publicly-available file constructed specifically for the Fragile Families Challenge.
The Baseline wave of data collection took place from 1998 to 2000 and includes mother and father core interviews at the birth of the study's "focal child." These interviews were conducted primarily in the hospital shortly following the focal child's birth.
At Baseline and the subsequent five waves, the core phone interviews collected data on parental relationships, parenting, health and health behaviors, family and social support, demographics, housing, use of social programs, and education and employment.
The Year 1 follow-up wave of data collection took place from 1999 to 2001 and includes mother and father core interviews around the time of the focal child's first birthday.
The Year 3 (2001-2003) and Year 5 (2003-2006) follow-up waves included mother and father core interviews, as well as primary caregiver interviews and home visits around the time of the focal child's third and fifth birthdays. The primary caregiver interviews included questions on home life and routines, health and health care, and parenting. During the home visits, assessments such as the Peabody Picture Vocabulary Test (PPVT) and direct height and weight measurements were given. Interviewers observed the home environment (surrounding neighborhood, interior and exterior of house/apartment) and recorded additional information about the parent and child's affect during the home visit.
The Year 9 (2007-2010) follow-up wave included mother and father core interviews, as well as primary caregiver interviews, home visits, and interviewer observations, similar to the previous two waves. We conducted a short interview with the focal child around their ninth birthdays, collecting information on their relationships with parents and siblings, school connectedness, task completion, self-concept, and home routines.
The Year 15 follow-up wave of data collection took place from 2014 to 2017. Activities included: primary caregiver and teen interviews (mostly conducted by phone), and home visit activities and interviewer observations conducted with a subset of ~1,000 teens. In addition to recollecting data on the topics covered throughout the previous five waves, the phone interviews included new measures on focal childrens' education and school experiences, risky behaviors such as sexual activity and substance use, peer interactions, and pro-social behaviors. The home visits included height/weight/waist circumference measurements, and interviewer observations of the home environment.
Data files containing measures from the interviews, home visits, and observations are available for download through the OPR Data Archive.
Restricted-Use Contract Data
Additional files are available through our restricted-use contract data application process. These are described below, and more information can be found here.
Residential context files, including:
A geographic file, with the focal child's birth city, mother's and father's state of residence at each interview, and stratum and psu.
A census tract measures file, with data on demographic, housing, and income characteristics from the U.S. Census for the census tracts of mothers’ and fathers’ residence.
A labor market and macroeconomic file, with data on local employment and national consumer confidence.
An Opportunity Insights file, with county-level measures of intergenerational mobility and characteristics correlated with intergenerational mobility from Chetty and Hendren's Opportunity Insights.
A gun violence file, with incident-level data on time and location of gun violence in 2014 to 2017 from the Gun Violence Archive.
A Uniform Crime Reports file, with county-level crime rates (counts/county population) for all crimes, violent crimes, and property crimes from the FBI Unified Crime Reports database.
School context files including:
NCES - School-level characteristics including school type, pupil-to-teacher ratio, school’s racial composition, Title I funding, percent of students receiving free and reduced price lunch, and more from the National Center for Education Statistics Common Core
SEDA - School district level measures from 2009 to 2013 of educational inequality and characteristics correlated with educational inequality from the Stanford Educational Data Archive.
CRDC - School- and school district-level measures from 2009, 2011, and 2013 of school resources, disciplinary outcomes, and other characteristics related to school environments from the Civil Rights Data Collection.
Biological and Health files, including:
Medical records data for mothers and children from the birth hospitalization record.
A FFCWS genetic data appendage based on the saliva samples collected from mothers and children at the Year 9 wave.
A Genotype array file, with genotype data on focal children. This genetic information can be linked to a limited subset of phenotype variables. To download this file, please request access through the dbGaP data archive here. You do not need to complete a restricted data application with CRCW. (Note: This file cannot be merged to other restricted use contract data or to the public FFCWS data.)
FF Challenge Files
The FF Challenge data files are associated with the predictive modeling stage of the FF Challenge competition, held in Summer 2017. These files are now being provided so that other data users may replicate and extend what participants did in the Challenge.
The Challenge files include:
- readme.txt – a text file with descriptions of the remaining files
- background.csv - birth-Year 9 data, as a .csv
- background.dta - birth-Year 9 data, as a Stata .dta file
- codebook_FFChallenge.txt - merged codebook for all waves
- prediction.csv - an example submission that predicts the mean of the training data for all outcomes
- train.csv - outcomes for training observations (half the sample)
- test.csv - outcomes for test observations
- leaderboard.csv - outcomes for observations in the leaderboard set, with missing outcomes imputed (as provided via Codalab)
- leaderboardUnfilled.csv - outcomes for observations in the leaderboard set (not imputed)