Word-level Language Modeling vs Character-level Language Modeling

Nowadays Character-Level Language Modeling gets a lots of attention for better Language Understanding Tasks. Here I simply describes a comparative understanding of Character-level and Word-level Language Understanding.

Word-level Language Modeling refers to considering words as the building blocks of the textual information. In a semantic space , like in a word embedding space the words are likes nodes surrounding  with lot of other words. In this case for generating  feature vector  or word vectors by using term-frequency or topic modeling or word embedding , every word has  a numerical or vectorized representation that can be fed by a learning   model like Recurrent Neural Network. Current approach of word level language modeling is to use word embedding . The approach is to train a large corpus and build a trained  Word2vec model.  The Word2vec model contains a dictionary  where each word a vectorized meaning.

Character-level language Modeling is like use to a one hot vector representation for each character , simultaneously  feed to a learning  model and the syntax and semantic attributes of the text in a word level is simply ignored as it is believed that linguistic attributes  will be captured by this model. The idea of Character Level language Modeling is come from signal processing.

The key challenge is to use Character -level language Modeling , it  requires a large amount of data and enough training so that the model will become enough smart to show semantic and syntactic representation of the text that are the main properties of the Word Level Language Modeling. Also it requires data augmentation( by replacing parts of texts by their words synonyms) to avoid the generalization errors.

One the other hand, The downside of word level language modeling to use an extra word2vector distributional model. To build this model takes a lot of time and memory sometimes.

Character-level Language Modeling shows superior performance on Short Text Understanding  depending on the training status like TweetEmbedding.

It is true that  Word-Level Language Modeling sometimes imports a giant Word2Vec pertained model that prohibits the representation of the unknown or spelling mistaken word.

So which is best?

I wound like to say, Character-level Language Modeling is more near to the general intelligence , assuming this model is smart enough to understand the high level representation of characters e.g.s  words and their semantical ecology.







Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s