HomeBank English deBarbaro Cry Corpus

Kaya de Barbaro
Department of Psychology
University of Texas at Austin


Xuewen Yao
Electrical and Computer Engineering
University of Texas at Austin

Mckensey Johnson
Department of Psychology
University of Texas at Austin

Megan Micheletti
Department of Psychology
University of Texas at Austin

Participants: 21
Recordings: 44129 5-second recordings
Type of Study: naturalistic
Location: Austin TX, USA
Media type: audio
DOI: doi:10.21415/Y4J3-QG85

Media folder

Citation Information

Yao, X., Micheletti, M., Johnson, M., Thomaz, E., & de Barbaro, K. (2022) Infant crying detection in real-world environments. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/9746096

Micheletti, M., Yao, X., Johnson, M., & de Barbaro, K. (2022) Validating a model to detect infant crying from naturalistic audio. Behavior Research Methods. https://doi.org/10.3758/s13428-022-01961-x

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Corpus Description

This dataset was created to detect and classify infant crying/fussing in naturalistic environments. The dataset consists of LENA recordings from 22 infants with age ranging from 1 - 10 months old. Parents were instructed to place the LENA in a vest worn by the infant and record up to 72 hours total of audio data in their home, including two weeknights and a weekend.

A team of trained research assistants annotated raw audio data according to best practices in behavioral sciences (inter-rater reliability kappa score: 0.8469). Crying is typically very loud, rhythmic, harsh and sudden and may feature wails or grunts. Fussing, on the other hand, is a continuation of negative vocalizations that is less intense than crying. It features a larger gap between vocalizations as well as quick breathing and closed-mouth noises.

Cries had a minimum duration of 3 seconds and were combined, if within 5 seconds. Fusses did not have a minimum duration. All neighboring crying and fussing sounds occurring within 5 seconds of one another were combined. Note that the labels do not distinguish between fuss and cry, rather, a “crying” label corresponds to either fussing or crying. All other sounds and silence were collapsed into a second category labelled “not crying.” To facilitate the development of audio recognition models, all crying episodes were cut into five second segments (with four-second overlap between neighboring segments). An equal length and number of five second segments of non-cry data was randomly selected from the same recording. The complete dataset totals 61.3h of labelled data with over seven hours of unique annotated crying data. The names of the audio files indicate: participant number, sample number, and whether it is “crying” or “not crying.”

All infants in this dataset are from Austin, Texas. The sample is generally middle class, with a range of family annual incomes (n=1 under $25,000, n=5 $50,000-$74,999, n=7 $75,000-$99,999, n=9 $100,000+) and above-average maternal education levels (n=2 high school or less, n=4 some college, n=5 college, n=10 graduate degree or higher). The majority of families are married (n=20). Infant race is predominantly White (n=13), then multiracial (n=7) and Hispanic/Latinx (n=2). All infants heard majority English at home and had no known vision or hearing issues at birth. These data were collected in participants' homes by the University of Texas at Austin where the data continue to be analyzed. Further details of the project are available on our lab website. Please contact Kaya de Barbaro directly to discuss further aspects of the sample design, annotation, and analysis.


Thanks to Lara Andres, Nina Nariman, Brooke Benson, and Kara Kaur for their work on this corpus.

Usage Restrictions

Please notify Dr. de Barbaro before using the data.