logo
Contact Us
About MICASE
History
Speech Event & Speaker Attributes
Statistical Overview
Transcription & Spelling Conventions
SoundScriber
>FAQ
MICASE Manual

Transcription and Spelling Conventions

A. TRANSCRIPTION AND MARK-UP CONVENTIONS
SGML TAG or SYMBOL
MEANING/DESCRIPTION
APPEARANCE IN ON-LINE TRANSCRIPTS
(HTML VERSION)
SPEAKER ID
<U WHO=S1>, <U WHO=S2>, etc. Speaker IDs, assigned in the order they first speak. S1: at the beginning of each turn or interruption/backchannel.
<U WHO=SU>, <U WHO=SU-f>, <U WHO=SU-m> Unknown speaker, without and with gender identified SU:
SU-f, SU-m
<U WHO=SU-1> Probable but not definite identity of speaker SU-1:
<SS> Two or more speakers, in unison (used mostly for laughter) SS:
PAUSES
<PAUSE DUR=:05> Pauses of 4 seconds or longer are timed to the nearest second. <P: 05>
, Comma indicates a brief (1-2 second) mid-utterance pause with non-phrase-final intonation contour. ,
. Period indicates a brief pause accompanied by an utterance final (falling) intonation contour; not used in a syntactic sense to indicate complete sentences. .
... Ellipses indicate a pause of 2-3 seconds ...
OVERLAPS
<OVERLAP>...</OVERLAP> This tag encloses speech that is spoken simultaneously, either at the ends and beginnings of turns, or as interruptions or backchannel cues in the middle of one speaker's turn.
All overlaps are approximate and shown to the nearest word; a word is generally not split by an overlap tag.
Text of overlapping speech is in blue.
BACKCHANNEL CUES and FAILED INTERRUPTIONS
Embedded utterance (<U> tag within a <U> tag) Backchannel cues from a speaker who doesn't hold the floor and unsuccessful attempts to take the floor are embedded within the current speaker's turn, and not shown as a separate line/paragraph. [S3: Text of embedded speech is in orange and surrounded by orange square brackets.]
Embedded and overlapped utterance (<OVERLAP> tag within an embedded utterance) Backchannel cues or unsuccessful interruptions that overlap with the main speaker's speech. [S3: Text of embedded speech that is overlapped is in blue and surrounded by orange speaker ID and square brackets.]
LAUGHTER
<EVENT DESC=LAUGH> or <EVENT DESC=LAUGH WHO=S2> All laughter is marked.
Speaker ID not marked if current speaker laughs.
<LAUGH>, <S8 LAUGH>
<SS LAUGH>, etc.
CONTEXTUAL EVENTS
<EVENT DESC="WRITING ON BOARD"> Various contextual (non-speech) events are noted, usually only when they affect comprehension of the surrounding discourse. <WRITING ON BOARD>
<EVENT DESC="APPLAUSE">
<APPLAUSE>
<EVENT DESC="AUDIO DISTURBANCE">, <EVENT DESC="BACKGROUND NOISE">
<AUDIO DISTURBANCE>, <BACKGROUND NOISE>
<EVENT DESC="SOUND EFFECT">, <EVENT DESC="GASP">
<SOUND EFFECT>, <GASP>
READING PASSAGES
<SEG TYPE="READING">.....</SEG> Used when part of an utterance is read verbatim. <READING>.....</READING>
FOREIGN WORDS
<FOREIGN>.....</FOREIGN> Used for non-English words or phrases. Italics
e.g.: the mother says c'est quoi? and Annika says to parce que eh and then,...
PRONUNCIATION VARIATIONS
<SEG TYPE="PRON" SUBTYPE="/seltik/">Celtic</SEG> Used when an unexpected pronunciation is used that would affect comprehension of the surrounding discourse.
Dialect or other phonological variations are generally not represented.
Pronunciation guide follows the word
e.g.: ...they asked the librarian for pictures of old Celtic <PRON: /seltik/> uniforms the basketball team, and it turns out that the project was he was supposed to find Celtic <PRON: /keltik/> costumes.
<SIC>...</SIC> Used when a speaker makes a mistake without self-correcting, and the error might otherwise appear to be a transcribing error. (sic) follows the word.
e.g.: despite the fact that that was the era of Women's Liberation like i say on the cover of Newsweek, and Gloria Steinman (sic) and uh Betty Friedan...
UNCERTAIN or UNINTELLIGIBLE SPEECH
(xx)

(words)

Two x's in parentheses indicate one or more words that are completely unintelligible. Words surrounded by parentheses indicate the transcription is uncertain. i don't (xx) whole (xx) analysis it just struck me...

lemme not write it that way (lest it be confused) with C syntax...

NAMES
When participants' names occur in a recording, they are changed to pseudonyms in the transcript, except in the case of most public colloquia (i.e. COL-prefixed files). In some cases, names of non-present people referred to in the recording are also changed. There is no SGML marking for names.



B. SPELLING CONVENTIONS

RULE or GUIDELINE
EXAMPLES
GENERAL Standard orthography is used for most words, even though they may not be fully pronounced, may be pronounced with a foreign accent, etc. In general, phonologically reduced forms are not represented, except as noted below.
CAPITALIZATION Only proper nouns (names, departments, course titles, organizations, etc.) are capitalized (in addition to acronyms; see below).

Neither the beginnings of turns nor the pronoun 'i' are capitalized.

Dr Hales received his M-S and B-S degrees at Stanford in nineteen eighty-two. his PhD at Princeton in eighty-six under the Harold W Dodds Honorific Fellowship...

oh, i i think i know what you're getting to.

FILLED PAUSES, BACKCHANNEL CUES, EXCLAMATIONS, etc. All hesistation and filler words, backchannel cues, and transcribable exclamations are spelled out, as shown on the right. Hesitation/Filler Words/Backchannels:
hm, hm', huh, mm, mhm, uh, um, mkay


Yes/No Responses:
yes: mhm, mm, okey-doke, okey-dokey, uhuh, yeah, yep, yuhuh
no: uh'uh, huh'uh, 'm'm, huh'uh


Exclamations/Doubt/Misc.:
ach, ah, ahah, gee, jeez, oh, ooh, oop, oops, tch, ugh, uh'oh, whoa, yay
CONTRACTIONS and LEXICALIZED REDUCED FORMS All standard contractions of is, am, are, had, have, would, not are represented, including [noun + has been/have been/is]. i'd, i've, i'm, i'll, she's, she'll, he's, they've, etc.
that'll, it'll, there're etc.

Different forms of modals + have are represented. coulda, could've, couldn't, couldn've, couldna, woulda, would've, wouldn't, wouldn've, wouldna, shoulda, should've, shouldn't, shouldn've, shouldna

Lexicalized phonological reductions are limited to those listed on the right. betcha, cuz, 'em (=them), gimme, gotta, hafta, kinda, lookit (as vocative only), lotsa, lotta, oughta, sorta, wanna
ACRONYMS, ABBREVIATIONS, LETTERS AS VARIABLES Acronyms are written in all caps.
Three commonly abbreviated titles are left as abbreviations, but without periods.
An acronym pronounced as a word is run together as one word.
When an acronym is spelled out, it appears in all caps with hyphens between each letter (except PhD).
Exception: PhD (no hyphens, no period)

Dr, Mr, Mrs (not spelled out)

NASA, TOEFL
C-I-A
F-B-I
E-L-I
L-S-and-A

Letters used as variables in math and science are written in all caps with hyphens between modifying or adjoining elements. X-Y axis
N-squared,
X-to-the-N-minus-one
HYPHENS Standard hyphenation rules apply, as in the Chicago Manual of Style, where they exist. pre-med, pre-calc, pre-law, mid-thirties mid-nineteen-ninety-nine, pre-Christian, non-Euclidean, non-native
NUMBERS All numbers are fully spelled out as words.
Standard hyphenation rules apply, with some additional guidelines: page numbers, course numbers, and room numbers are all hyphenated.
nineteen ten
nineteen twenty-nine
page one-fifty-seven
Poli Sci one-sixty
room thirty-twelve
REPETITIONS and REPAIRS All repetitions of a word, partial word or phrase are transcribed. it's no longer than a than a, calendar year...

Truncated or cut-off words have a hyphen at the end of the last audible sound/letter. so, come on up, grab yourself a ins- implement of destruction.

An underscore at the end of a word indicates a false start in which a whole word is spoken but then the speaker re-starts the phrase. well, it will be_ it's sort of_ it's a management human-resource kind of job...
FOREIGN WORDS Foreign words are spelled as in the original language when it uses a roman alphabet; otherwise, an approximate phonetic transliteration is used. and see what, the Buddha, was s- was saying um, the tatha, gata, Sanskrit's a really interesting language...
PRONUNCIATION VARIATIONS As mentioned above, minor pronunciation variations are not represented in the spelling, with the exception of the contractions and lexicalized forms listed in this table.