Complexness

How difficult are different English vowels to learn?

Jan 27, 2025

In my last post I put together some rudimentary analysis of English consonants to see if I could quantify (in an accessible and not overly theoretical way) what it is that makes different sounds more challenging for learners to articulate. I said in that article “I think the vowels will be more complex with a greater degree of susceptibility to perceptual difficulties, so I’ll do these later, hopefully, maybe…eventually”, but in fact they surprised me.

Despite the fact that vowels are just inherently difficult to articulate and to perceive and to explain, and their more ‘analogue’ nature ensures their variation, I realised I could more or less cover the complexity with fewer variables than the consonants.

Put very simply, a vowel is created by the passage of vibrating air through the mouth, while the shape of the oral cavity is modified to create different ‘configurations’. Each configuration leads to a distinct acoustic signal being created, and this is typically known as the vowel ‘quality’. Vowels in English are distinguished by their quality.

My experience has generally been that there are three key points that impact the difficulty learners will have with a vowel:

Identicality - if the L1 has an almost identical sound it stands to reason that a learner will have little to no difficulty producing the sound (with some caveats…for another article!).
Familiarity - if the L1 typically does not utilise vowels in a similar part of the vowel space (for example low back vowels), these will be difficult to learn to physically articulate, but will not suffer from perceptual overlaps.
Similarity - if the L1 has similar but not identical sounds, or has one vowel where English has multiple in the same ‘zone’, we run into perceptual difficulties. These require both cognitive and articulatory rewiring.

Once I narrowed this down, I started playing around with some numbers. I ended up using the following measures:

What I have called ‘Vowel Density’ - this is the number of other vowels near enough to the vowel in question to cause some perceptual difficulties/overlap. This happened to max out at five, which worked nicely with my other variables which I had scaled 1-5. (For anyone interested, this was taking the number of vowels within a 275Hz F1/F2 Euclidean distance of the target vowel, and I fully appreciate the inconsistency in assigning a ‘standard’ F1/F2 to these vowels, and that this is a very simplistic metric). For diphthongs, this measure just looked at the density of the onset (the first vowel quality in the diphthong).
Cross-linguistic rarity - on a scale of 1-5, how many other languages’ phonetic inventories make use of this vowel? The more common, the more likely it will be an identical sound for a learner, and therefore the less chance they will struggle with it.
Spelling variability - on a scale of 1-5, how many different ways of representing this vowel orthographically are there? A score of 5 means there is an increased chance the learner won’t be able to predict the vowel sound from the spelling.
Teaching difficulty - on a scale of 1-5, the extent to which the articulatory complexity (typically the presence of backness and rounding) makes it more difficult for learners, but also taking into consideration that sounds made at the front of the mouth are often easier to learn because you can see what you’re doing and often use your teeth as guide for your tongue position.
For diphthongs, I also looked at how far the tongue has to travel from the onset to the offset of the vowel - what I’ve called ‘Glide Distance’. This on a scale of 1-5 (again, based on the F1/F2 Euclidean distance between the onset and offset).

Anyway, this is what I came up with for the monophthongs:

I think what this shows neatly is that all the vowels are difficult! But, we can also clearly see that the famous ‘schwa’ (/ə/) comes out on top. Central vowels are not common in general, so making sounds in the middle of the mouth is unfamiliar territory for many speakers. The lack of any physiological reference points for the tongue in the middle of the mouth makes it difficult to learn how to articulate, and its proximity to other sounds means there is a risk of perceptual overlap.

The diphthongs painted an interesting picture. I don’t actually know the reason for this, but regardless of how difficult learners find it to articulate the monophthongs, they almost always find it an absolute cinch to produce the diphthongs. My working hypothesis is that the inherent transitory nature of diphthongs puts less psychological demand for precision on speakers (that is, you can hide the inaccuracies in the transition). Ultimately, the analysis shows they are, more or less, all equally easy/difficult to articulate, with the /oʊ/ arguably a little more difficult:

As with the consonant article, this is not a piece of academic research. While my background is in articulatory and acoustic phonetics, it has been some 10 years since I was deep in research - this is just me rekindling my passion for data, fuelled by the anecdotal evidence gathered in my job coaching English learners.

Let me know what you think, and as always, please consider subscribing!

Discussion about this post

Ready for more?