HomeBank Accessing Data

Each HomeBank audio file has an accompanying CHAT (.cha) transcription file. Some datasets have been transcribed (in CHAT format) while others have been automatically diarized (so far this has been done using the LENA Pro software) but not transcribed.

Some corpora are available publicly, while others require a password, to protect participant privacy. To learn how to become a HomeBank member and gain access to the password-protected data, see the Membership page. Please remember to follow the TalkBank guidelines for data-sharing and, in the case of the password-protected corpora, the HomeBank membership data use agreement.

Downloading HomeBank data

HomeBank data can be downloaded by finding a relevant corpus on the Corpus List and then clicking on the link that says "Download CHAT transcripts, ITS files, and metadata".

Browsing HomeBank data

HomeBank CHAT transcripts can be viewed while you play the audio in your browser by connecting to the Browsable Database.

Working with transcripts and media locally

  1. Download and install the CLAN program.
  2. From the corpus list, download the transcripts, ITS files and metadata and unzip them.
  3. From the same place, download the media files and place them into the transcript folders.
  4. To open a transcript, you double-click on it and it will open in CLAN. If there is associated media, you can play the media using escape-8 for continuous playback or command-click for playing single utterances.

Downloading Media

If you find it tedious to download media files one by one, you can use wget. For example, to retrieve all the *.mp3 and *.wav audio in the Warlaumont folder, you can run this one-line wget command:

$ wget -c --user=gordon --ask-password -e robots=off -r -l inf --no-remove-listing -nH --no-parent -R 'index.html*' https://media.talkbank.org/homebank/Password/Warlaumont/

Then you enter your password, and the files download into a folder called "homebank/Password/Warlaumont" into the calling directory. The files within that folder will also maintain the original hierarchical structure. The program will not give you any progress bar, but you can check the progress by watching files pour into the folder on your computer.

If you want to download only all the *.wav files for a single child (for example, participant 0204) in the Warlaumont folder, the command would be:

$ wget -c --user=gordon --ask-password -e robots=off -r -l inf --no-remove-listing -nH --no-parent -R 'index.html*' -A '*.wav' https://media.talkbank.org/homebank/Password/Warlaumont/0204/

If you want to download all media from an area that has no password protection, such as the VanDam Public corpus, you could use this form:

$ wget -c robots=off -r -l inf --no-remove-listing -nH --no-parent -R 'index.html*' https://media.talkbank.org/homebank/VanDam-5minute/

Installing wget

Installation of wget depends on your system: