Nowadays Character-Level Language Modeling gets a lots of attention for better Language Understanding Tasks. Here I simply describes a comparative understanding of Character-level and Word-level Language Understanding.
Word-level Language Modeling refers to considering words as the building blocks of the textual information. In a semantic space , like in a word embedding space the words are likes nodes surrounding with lot of other words. In this case for generating feature vector or word vectors by using term-frequency or topic modeling or word embedding , every word has a numerical or vectorized representation that can be fed by a learning model like Recurrent Neural Network. Current approach of word level language modeling is to use word embedding . The approach is to train a large corpus and build a trained Word2vec model. The Word2vec model contains a dictionary where each word a vectorized meaning.
Character-level language Modeling is like use to a one hot vector representation for each character , simultaneously feed to a learning model and the syntax and semantic attributes of the text in a word level is simply ignored as it is believed that linguistic attributes will be captured by this model. The idea of Character Level language Modeling is come from signal processing.
The key challenge is to use Character -level language Modeling , it requires a large amount of data and enough training so that the model will become enough smart to show semantic and syntactic representation of the text that are the main properties of the Word Level Language Modeling. Also it requires data augmentation( by replacing parts of texts by their words synonyms) to avoid the generalization errors.
One the other hand, The downside of word level language modeling to use an extra word2vector distributional model. To build this model takes a lot of time and memory sometimes.
Character-level Language Modeling shows superior performance on Short Text Understanding depending on the training status like TweetEmbedding.
It is true that Word-Level Language Modeling sometimes imports a giant Word2Vec pertained model that prohibits the representation of the unknown or spelling mistaken word.
So which is best?
I wound like to say, Character-level Language Modeling is more near to the general intelligence , assuming this model is smart enough to understand the high level representation of characters e.g.s words and their semantical ecology.