Things to keep in mind when using wordlists
lemma – a set of lexical forms having the same stem/root/base and belonging to the same major word class, differing only in inflection and/or spelling
flemma – like a lemma they have the same form but do not distinguish between word classes
word family – a set of stem/root/base word and all its inflected and derived forms
1. Generally it has been recommended that lemmas be used for lower proficiency learners and word families for higher proficiency learners.
2. Since words have a variety of meanings examine a) the general characteristics of the corpus (more possible meaning variation in large, general corpora compared to smaller, specialized corpora), b) degree of word tagging used (untagged corpora more possibility for meaning variation vs grammatically and/or semantically tagged corpora), and c) general characteristics of words (high frequency words more possibility of meaning variation compared to technical words).
3. Have multi-word units like phrasal verbs, idioms etc been accounted for in your list. If yes to what extent? If not are they important for your purposes?
Gardner, D. (2007). Validating the construct of word in applied corpus-based vocabulary research: A critical survey. Applied linguistics, 28(2), 241-265.
(do read this to get more details on say nature of learners & multi-word recommendations)
Also do make sure to read the post by Julie Moore on some things to consider when using wordlists – The wonders and (worries!) of wordlists.
And Philip Kerr’s post which lists some nice resources and comments about using wordlists – WordLists
If you know of any publically available wordlists that you think needs adding do let me know.
You may also like to check a timeline called A (brief) History of (English) WordLists