Oral deaf audio MacGyver: identifying speakers

September 23, 2016 – 10:30 am

Being oral deaf is like being MacGyver with audio data, except that the constant MacGyvering is normal since you do it for every interaction of every day. Posting because this seems interesting/useful to other people, although I’m personally still in the “wait, why are people so amused/surprised by this… does not everyone do this, is this not perfectly logical?”

I was explaining how I use my residual hearing to sort-of identify speakers, using faculty meetings as an example. The very short version is that it’s like constructing and doing logic grid puzzles constantly. Logic grid puzzles are ones where you get clues like…

  1. There are five houses.
  2. The Englishman lives in the red house.
  3. The Spaniard owns the dog.
  4. Coffee is drunk in the green house.
  5. The Ukrainian drinks tea.
  6. The green house is immediately to the right of the ivory house.

…and so forth, and have to figure out what’s going on from making a grid and figuring out that the Ukranian can’t possibly live in the green house because they drink tea and the green house person drinks coffee, and so forth.

Now the long explanation, in the context of being oral deaf. Some background: I’m profoundly deaf, with some low-frequency hearing; I use hearing aids and a hybrid CI (typically the CI plus one hearing aid). Generally speaking, I can’t actually hear enough to identify people through voice alone — but I can say some things about some attributes of their voice. For instance, I can tell (to some approximation) if a singer is in-tune, in-rhythm, and in control of their voice, and I can tell the difference between a low bass and a first soprano… but I wouldn’t be able to listen to a strange song and go “oh, that’s Michael Buble!” (My hearing friends assure me that his voice is quite distinctive.)

However! When I know people and have heard their voice (along with lipreading and context) for a while, I do know that their voices do and don’t have certain attributes I can perceive. And even if I’m not using my residual hearing/audio-related gadgetry to get semantic information (i.e. the words someone is saying) because I have better alternatives in that context (interpretation, captioning) I will still want audio…

…and I will pause for a short sidebar right now, because it might seem, to hearing people, that this is the only logical course of action — that hearing more is always good for understanding more. It isn’t. Extra information is only information if it’s worth the mental effort tradeoff to turn it into useful data; otherwise, it’s noise. It’s the same reason you would probably be happy if the background noise in a loud bar went away while you were talking to your friend. That background noise is “extra data,” but it’s not informative to you and just takes more effort to process it away.

In my case — and the case of my deaf friends who prefer to not use residual hearing when there’s another access option available — we’re patching across multiple languages/modalities on a time delay, and that triggers two competing thought streams. If you want to know what that feels like, try to fluently type a letter to one friend while speaking to another on a different topic. Physically, you can do it — your eyeballs and hands are on the written letter, your ears and mouth are in the spoken conversation — but your brain will struggle. Don’t switch back and forth between them (which is what most people will immediately start to do) — actually do both tasks in parallel. It’s very, very hard. In our case, one stream is lossy auditory English as the speaker utters something, and the other is clear written English or clear ASL visuals some seconds behind it. (Assuming your provider is good. Sometimes this data stream is… less clear and accurate than one might like.) Merging/reconciling the two streams is one heck of a mental load… and since we *can* shut off the lossy auditory English as “noise” rather than “signal,” sometimes we do.

Anyway, back to the main point. Sometimes I don’t want the audio data for semantic purposes — but I want it for some other purposes, so I’ll leave my devices on. Oftentimes, this reason is “I’d like to identify who’s speaking.” Knowing who said what is often just as important as what’s being said, and this is often not information available through that other, more accessible data stream — for instance, a random local interpreter who shows up at your out-of-state conference will have no idea who your long-time cross-institutional colleagues are, so you’ll get something like “MAN OVER THERE [is saying these things]” and then “WOMAN OVER THERE [is saying these things]” and then try to look in that direction yourself for a split-second to see which WOMAN OVER THERE is actually talking.

This is where the auditory data sometimes comes in. I can sometimes logic out some things about speaker identity using my fuzzy auditory sense along with other visually-based data, both in-the-moment and short-term-memorized.

By “fuzzy sense,” I mean that auditorily — sometimes, in good listening conditions — I can tell things like “it’s a man’s voice, almost certainly… or rather, it is probably not a high soprano woman.” By in-the-moment visual data, I mean things like “the person speaking is not in my line of sight right now” and “the interpreter / the few people who are in my line of sight right now are looking, generally, in this direction.” By short-term-memorized visual data, I mean things like “I memorized roughly who was sitting where during the few seconds when I was walking into the room, but not in great detail because I was also waving to a colleague and grabbing coffee at the same time… nevertheless, I have a rough idea of some aspects of who might be where.”

So then I think — automatically — something like this. “Oh, it’s a man now, and not in my line of sight right now, and that has two possibilities because I’ve quasi-memorized where everyone is sitting when I walked into the room, so using the process of elimination…”

Again, the auditory part is mostly about gross differences like bass voices vs sopranos in no background noise. Sometimes it’s not about what I can identify about voice attributes, but also about what I can’t — “I don’t know if this is a man or a woman, but this person is not a high soprano… also, they are not speaking super fast based on the rhythm I can catch. Must not be persons X or Y.”

For instance, at work, I have colleagues whose patterns are…

  1. Slow sounds, many pauses, not a soprano
  2. Super fast, not a bass, no pauses, machine gun syllable patterns
  3. Incredibly variant prosody, probably not a woman but not obviously a bass
  4. Slower cadence and more rolling prosody with pauses that feel like completions of thoughts rather than mid-thought processing (clear dips and stresses at the ends of sentences)
  5. Almost identical to the above, but with sentences that have often not ended, but pauses are occurring and prosodic patterns are repeating and halting and repeating

These are all distinctive fingerprints, to me — combined with knowing where they’re sitting, and I have decently high confidence in most of my guesses. And then there are people who won’t speak unless I’m actually looking at them or the interpreter or the captioning, and that’s data too. (“Why is it quiet? Oh! Person A is going to talk, and is waiting for me to be ready for them to speak.”)

There’s more to this. Sometimes I’ll look away and guess at what they’re saying because I know their personalities, their interests, what they’re likely to say and talk about, opinions they’re likely to hold… I build Markov models for their sentence structures and vocabularies, and I’m pretty good at prediction… there’s a lot more here, but this is a breakdown of one specific aspect of the constant logic puzzles I solve in my head as a deaf person.

In terms of my pure-tone audiogram, I shouldn’t be able to do what I do — and it’s true, I can’t from in-the-moment audio alone. But combined with a lot of other things, including a tolerance of extreme cognitive fatigue? Maybe. In the “zebra puzzle,” where I drew the example logic puzzle clues from at the beginning, there are a series of clues that go on and on… and then the questions at the end are “who drinks water?” and “who owns the zebra?” Neither water nor zebra are mentioned in any of the clues above, so the first response might be “what the… you never said anything about… what zebra?” But you can figure it out with logic. Lots of logic. And you have the advantage of knowing that the puzzle is a logic puzzle and that it ought to be solvable, meaning that with logic, you can figure out who owns the zebra. In the real world… nobody tells you something could become a logic puzzle, and you never know if they are solvable. But I try them anyway.

Know someone who'd appreciate this post?
  • Print
  • Facebook
  • Twitter
  • Google Bookmarks
  • email
  • Identi.ca
  1. 5 Responses to “Oral deaf audio MacGyver: identifying speakers”

  2. What kind of sensors and/or visual aids would you need to not have to cut back on all the effort? A seating chart with little sound-triggered LEDs connected to microphones at each seat? A better seating chart with ASL nicknames for your interpreter to use? People to identify themselves before speaking?

    By Erin D on Sep 23, 2016

  3. I would rather frame this as a behavior change project (design the environment to make dialogue accessible to all by default) rather than an arena for a technological fix (put extra things on the deaf person in order to fit them into an environment that does not include them by design).

    Of course, these aren’t completely separate things. But, for instance… take the same people and the same conversation in the same room, arrange the chairs in a big circle, and grab something tossable and use it as a talking stick (the eraser on the whiteboard! the tennis ball on the end of someone’s walker! the beanbag laptop wristrest that is almost always in my backpack!) and poof, speaker identification for everyone. No blinky light microphone stuffs needed.

    And yeah, I have been known to draw seating maps for name recall for myself, for introductions for new people sitting next to me (when I know the people in the room and they don’t), and for interpreters (who may be part of the previous category).

    By Mel on Sep 23, 2016

  4. The group I’m in has started to hear about ways to be accessible. But If there are clearly written guides, I haven’t found them yet. Someone told us about one idea for deaf, ID/DD, HoH people: to have a talk and slides available in printed format, some in larger types, at an event before it begins. If you have designs for a behavioral change, I hope they can see their way into some document that contains these kinds of changes. Its emotional and intellectual labor for you but I don’t know who else would know this knowledge. It sucks that the groups that suffer from oppression have to usually create the tools for others to include them.

    By Kevix on Sep 23, 2016

  5. Thanks for deftly detailing the cognitive load. I believed that speech-reading was exhausting; I’ve experienced the lag frustration, but your in-depth description has helped me know it.

    By Jesse the K on Oct 10, 2016

  6. Oh, also — in engineering/computing, the gender cues are usually… not so helpful, because… diversity!

    Interpreter: Man over there says XYZ.
    Interpreter: Man over there says ABC.
    Interpreter: Man over there says 123.

    Me, thinking: They’re all men. Everybody in this room except me is a man. Saying that the speaker was a man adds no additional information as to the identity of who spoke.

    This *also* means that, often, the most efficient way to refer to a person in the room is by saying “the woman” or “the Hispanic one” or “the Black one” or whatever, since chances are that yes, we’re singletons. I am… really torn by this, and generally try to describe people by their clothes/hair rather than assumed gender/race. I’m pretty confident I’ll get clothing descriptions right, but could totally misgender/guess the wrong race, and also because people have a higher degree of freedom to choose their clothing than their gender and race.

    By Mel on Dec 7, 2016

What do you think?