logo
Contact Us
About MICASE
History
Speech Event & Speaker Attributes
Statistical Overview
Transcription & Spelling Conventions
SoundScriber
>FAQ
MICASE Manual

Frequently Asked Questions

  1. What kind of corpus is MICASE?
  2. How do you define 'academic speech'?
  3. What are the components of the MICASE archive?
  4. How was the data collected?
  5. What is the breakdown of the corpus contents according to speaker and speech event types?
  6. What kind of transcription and mark-up system are you using?
  7. Who will have access to the data, and how will it be made available?
  8. How can I listen to the sound files?
  9. What kind of research are people going to do with this data and who will it benefit?

1.      The MICASE Corpus:

The MICASE corpus is a spoken language corpus of approximately 1.8 million words (200 hours) focusing on contemporary university speech within the microcosm of the University of Michigan, in Ann Arbor, Michigan. This is a typical large public research university with about 37,000 students, approximately one-third of whom are graduate students. Speakers represented in the corpus include faculty, staff, and all levels of students, and both native and non-native speakers.

2.       Academic Speech:

Academic speech is defined as that speech which occurs in academic settings. In other words, it is not pre-defined as something like "scholarly discussion." In academic settings, we might, for example, find such speech acts as jokes, confessions, and personal anecdotes, as well as the more prototypical definitions, explanations and intellectual justifications. Therefore, the real question is how we define "academic setting." We have taken an open yet circumscribed stance on this. The speech events included in the corpus include: small and large lectures (62), public interdisciplinary or departmental colloquia (13), discussion sections (9), student presentations (11), seminars (8), undergraduate lab sessions (8), lab group and other meetings (6), one-on-one tutorials (3), office hours (8), advising consultations (5), dissertation defenses (4), study groups (8), interviews (3), campus/museum tours (2), and service encounters (2). On the other hand, we have excluded certain events that occur on campus but would not be significantly different if they had occurred in other locations. For example, we did not record food-ordering sequences in university food outlets or discussions among those who work in the university's plant or grounds departments. These speech events we do not consider central or particular to a university community's educational mission.

3.       Components of the MICASE Archive:

There are two main components.
 
  1. Recordings of speech:
    on Digital Audio Tape and as compressed MPEG Layer 3 files, and as WAV files.
  2. Text transcripts:
    created using Author/Editor, an SGML text editor, and saved as ASCII text files. They have since been converted to XML format.

4.       Recording Methodology:

Because MICASE aimed to record a wide range of academic speech, our sampling goals spanned fifteen different types of speech events (see 2. above) and four major academic divisions within those types (e.g. humanities, social sciences, physical sicences and engineering, and biological sciences). We adopted stratified random sampling as our preferred method of sampling. Each recording is classified according to speech event type, a pre-assigned number indicating the academic discipline, two letters representing the majority of participants in the event (e.g. junior undergraduate, senior faculty, staff), and a final three digit sequence to track chronologically when the tape was recorded. For example, transcript number LEL115SU015 is a recording of a large lecture (LEL) in anthropology (115), at the senior undergraduate level (SU), and is the 15th speech event recorded for MICASE. 

All recordings were made with a digital audio tape recorder with two external stereo microphones, and at selected events, a video recorder. Two researchers attended most speech events in order to identify speakers and facilitate transcription by taking field notes about nonverbal contextual information; however, in small groups (e.g. advising sessions, office hours, study groups) where an observer's presence would have been intrusive, the research assistants left the room after the equipment was set up. All speech was recorded with written consent from the major speakers and verbal consent from other participants. Demographic information (gender, age group, university position, and native language) was collected from each speaker on a form distributed at the end of each event. The speaker information is included in the header of each transcript and is also entered into a separate database. All DAT recordings were captured and stored as MP3 format sound files for use with our computer transcription program, SoundScriber, and have also been re-digitized as WAV format files and transferred to data CD for archival purposes.

5.      Breakdown of MICASE by speaker and speech events types

The corpus was designed to be balanced, as much as possible, across several categories of speech events, including monologic and interactive speech and speech events from all of the major academic divisions within the university (with the exception of the professional schools, i.e., medical, dental, business, and law). Furthermore, an attempt was made to get approximately equal amounts of speech from male and female speakers within each academic division. Students and faculty are both represented in the corpus, as are native and non-native speakers. For a detailed breakdown of the word counts and percentages of speech by each category of speaker and within the two major speech event categories, see the MICASE statistics page.

 

6.       Transcription conventions:

Our transcription conventions and markup system are intended to allow for ease of readability, while including enough detail to ensure adequate comprehension from the text of the transcript alone. This translates to our use of standard orthography in the case of most words, except for select situations where standard conventions may cause confusion, and for a limited number of lexicalized abbreviations and grammatical constructions (e.g., cuz, gonna, hafta, sorta, and several others). We do not use standard punctuation, but instead mark pauses of varying lengths with commas, periods and ellipses. We also use question marks to identify phrases that function pragmatically as questions.

All backchannel cues and hesitation or filler words were transcribed using a set number of normalized orthographic representations that disregard minor phonetic variations. These, like overlaps and interruptions, are situated in a way that illustrates their sequential occurrence, but still indicates which speaker holds the floor.

We used a customized set of SGML tags adapted from the Text Encoding Initiative (TEI) conventions. Additionally, all the speaker demographic information and recording information is tagged in the header. Our transcripts were first created using Author Editor, an SGML text editing program, and have since been converted to XML.

For a complete description of the transcription conventions and mark-up tags, see the transcription conventions page.

7.       Access to MICASE:

MICASE was designed to be freely available to as many researchers and students as possible. To that end, the entire corpus has been accessible on the Web since May 2002, on a site with a searchable interface much like a concordance program. The old online search interface was changed, and the new, enhanced version became available in May 2007 (See the MICASE Search Engine Homepage.) We also have a CD-ROM or downloadable zip file of the 152 XML files available for purchase for those who prefer to use the corpus off-line, with their own software; see Ordering MICASE

 

8.       Availability of Sound Files

As of late fall 2004, many of the MICASE sound files were made available in one of two formats: a selection of about 70 files are available on the web as streamed RealAudio files. (These can be found here: http://www.lsa.umich.edu/eli/micase/Audio/index.htm) Another set of files are available for purchase as MP3 audio files on CD-rom. (Information available at: http://www.lsa.umich.edu/eli/micase/orderaudio.htm) Speaker consent restrictions prevent us from making the entire set of recordings publicly available, and some of the files that are available have been anonymized by manipulating appropriate segments of the audio file, while preserving the timing and prosody of the utterance.

In addition, we continue to make some sound files available to bona fide academic researchers by special arrangement. Researchers who have particular speech events that they would like to purchase or listen to may contact us with a specific list of files and we will process such requests on a case-by-case basis as schedules permit.

 

9.       Research Program:

The MICASE team have an evolving research plan which is summarized here. We also have a list of many MICASE-based publications and presentations.