deBarbaroCry Corpus

Kaya de Barbaro
Department of Psychology
University of Texas at Austin


Xuewen Yao
Electrical and Computer Engineering
University of Texas at Austin

Mckensey Johnson
Department of Psychology
University of Texas at Austin

Megan Micheletti
Department of Psychology
University of Texas at Austin

Participants: 21
Recordings: 44129 5-second recordings
Type of Study: naturalistic
Location: Austin TX, USA
Media type: audio
DOI: xxx

Media folder

Citation Information

Yao, X., Micheletti, M., Johnson, M., & de Barbaro, K. (in review) Classification of infant crying in real-world home environments using Deep Learning 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Micheletti, M., Yao, X., Johnson, M., & de Barbaro, K. (in prep) A Comparison of Automated Methods for Detecting Naturalistic Infant Crying.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Corpus Description

This dataset was created to detect and classify infant crying/fussing in naturalistic environments. The dataset consists of LENA recordings from 22 infants with age ranging from 1 - 10 months old. Parents were instructed to place the LENA in a vest worn by the infant and record up to 72 hours total of audio data in their home, including two weeknights and a weekend.

A team of trained research assistants annotated raw audio data according to best practices in behavioral sciences (inter-rater reliability kappa score: 0.8469). Crying is typically very loud, rhythmic, harsh and sudden and may feature wails or grunts. Fussing, on the other hand, is a continuation of negative vocalizations that is less intense than crying. It features a larger gap between vocalizations as well as quick breathing and closed-mouth noises. Annotators were trained to include only those instances of crying which lasted 3-seconds and those instances of fussing that lasted at least five seconds. Additionally, they combined all neighboring crying and fussing sounds occurring within 5 seconds of one another into a crying annotation. Note that the labels do not distinguish between fuss and cry, rather, a cry label corresponds to either fussing or crying. To facilitate the development of audio recognition models, all fuss/cry episodes were cut into five second segments (with four-second overlap between neighboring segments). An equal length and number of five second segments of non-cry data was randomly selected from the same recording. The complete dataset totals 61.3h of labelled data with over seven hours of unique annotated crying data).

The names of the audio files indicate: the participant number, the sample number, and whether it is crying or not crying.

All infants in this dataset are from Austin, Texas. The sample is generally middle class, with a range of family annual incomes (n=1 under $25,000, n=5 $50,000-$74,999, n=7 $75,000-$99,999, n=9 $100,000+) and above-average maternal education levels (n=2 high school or less, n=4 some college, n=5 college, n=10 graduate degree or higher). The majority of families are married (n=20). Infant race is predominantly White (n=13), then multiracial (n=7) and Hispanic/Latinx (n=2). All infants heard majority English at home and had no known vision or hearing issues at birth. These data were collected in participants' homes by the University of Texas at Austin where the data continue to be analyzed. Further details of the project are available on our lab website. Please contact Kaya de Barbaro directly to discuss further aspects of the sample design, annotation, and analysis


Thanks to Lara Andres, Nina Nariman, Brooke Banson, and Kara Kaur for their work on this corpus.

Usage Restrictions

Please notify Dr. de Barbaro before using the data.