How does vowpal wabbit featurize text?

The details of vowpal wabbit’s (vw) feature representation are specified here but it can be tricky to grok immediately.

All features in vw are numeric features. It doesn’t do anything special for text or categorical features - they are just numeric features like all other features.

Consider this vw formatted example: |doc this is some text |stats views:10 |type post

The | character indicates a namespace declaration and the characters after that are the name of the namespace1. What follows the namespace declaration is a list of features and their values.

So in this example we have the doc namespace, stats namespace and type namespace.

The core of vws feature representation is that every feature is numeric and consists of a name and a value. This is specified in an example as name:value

If name is specified and value is omitted, then it has a default value of 1 2.

So for our example of |doc this is some text , vw treats doc as a namespace and the other tokens as features. Since no value is specified for each feature, this is the same as |doc this:1 is:1 some:1 text:1. Similarly, |type post is equivalent to |type post:1.

So vw doesn’t do anything special for text features (or categorical features) - these features are treated as numeric features with default value 1, since value was omitted. It’s just the way vw encodes features with defaults for unspecified values that makes it appear as if it supports text features.

It also worth understanding how vw encodes features internally. This is useful if you start using ngram features, cross-features, or if you generate a readable model and want to interpret which features are important..

Internally vw stores namespace names as single characters so the above example is equivalent to

|d this:1 is:1 some:1 text:1 |s views:10 |t post

vw hashes features by namespace. So features in different namespaces end up being hashed as different tokens.

Internally vw combines the namespace and feature name before hashing them. The features that get hashed in the above example would be

d^this
d^is 
d^some 
d^text
s^views
t^post

Note that a text token most likely has different hashes depending on which namespace it is in. So the same text token in two different namespaces is treated as as two different features.

So that’s the default featurization for text. The ability to generate ngrams and skipgrams is optional. If you provide the option --ngram d2 it will also generate 2-gram features for the doc namespace.

So in addition to the above features it would also generate these additional features and then hash them.

d^this*d^is 
d^is*d^some
d^some*d^text
...

n-gram features are generated within a namespace. If you generate quatratic features, e.g. with -q dt or -interactions dt you would also end up features accross those 2 namespaces. E.g.:

d^this*t^post
d^is*t^post
...

For more on working with text in vw see this repository

And this is a good starting point for getting familiar with vw.

  1. Namespaces allow you to group features together. The main reason for grouping features in namespaces, is so that you can easily generate cross-features.) 

  2. Also, the absence of a feature indicates that the feature has value 0. So for text classfification tasks 0 is assumed for all tokens not listed in an example.