Recently, during my journey to create a better English dictionary for ESL, I came across a need of getting the pronunciation of all words in English. That means all: single, plural, past tense… are included.
My approach was to get text from books, articles and of course, Wikipedia.
As a result, I extracted over 26 million sentences. These sentences are supposed to be clean (no special characters, no encoding errors).
From these sentences, I ran a job to get all the unique words. The method is quite simple, I split the sentences into single word with space delimiter. Spaces, punctuation are removed, of course.
The result is over 1.9 million unique words. As a went through the list, a portion is not words. Most of these ones have two letters. They could be abbreviations, ordinal number…
However, all English words should be in this list.
My next step would be creating recordings for these words so learners know how to speak them correctly.