The details of vowpal wabbit’s (
vw) feature representation are specified here but it can be tricky to grok immediately.
All features in
vw are numeric features. It doesn’t do anything special for text or categorical features - they are just numeric features like all other features.
Consider this vw formatted example:
|doc this is some text |stats views:10 |type post
| character indicates a namespace declaration and the characters after that are the name of the namespace1. What follows the namespace declaration is a list of features and their values.
So in this example we have the
stats namespace and
The core of
vws feature representation is that every feature is numeric and consists of a name and a value. This is specified in an example as
If name is specified and value is omitted, then it has a default value of 1 2.
So for our example of
|doc this is some text ,
doc as a namespace and the other tokens as features. Since no value is specified for each feature, this is the same as
|doc this:1 is:1 some:1 text:1. Similarly,
|type post is equivalent to
vw doesn’t do anything special for text features (or categorical features) - these features are treated as numeric features with default value 1, since value was omitted. It’s just the way
vw encodes features with defaults for unspecified values that makes it appear as if it supports text features.
It also worth understanding how
vw encodes features internally. This is useful if you start using ngram features, cross-features, or if you generate a readable model and want to interpret which features are important..
vw stores namespace names as single characters so the above example is equivalent to
|d this:1 is:1 some:1 text:1 |s views:10 |t post
vw hashes features by namespace. So features in different namespaces end up being hashed as different tokens.
vw combines the namespace and feature name before hashing them. The features that get hashed in the above example would be
d^this d^is d^some d^text s^views t^post
Note that a text token most likely has different hashes depending on which namespace it is in. So the same text token in two different namespaces is treated as as two different features.
So that’s the default featurization for text. The ability to generate ngrams and skipgrams is optional. If you provide the option
--ngram d2 it will also generate 2-gram features for the doc namespace.
So in addition to the above features it would also generate these additional features and then hash them.
d^this*d^is d^is*d^some d^some*d^text ...
n-gram features are generated within a namespace. If you generate quatratic features, e.g. with
-q dt or
-interactions dt you would also end up features accross those 2 namespaces. E.g.:
d^this*t^post d^is*t^post ...
For more on working with text in vw see this repository
And this is a good starting point for getting familiar with
Namespaces allow you to group features together. The main reason for grouping features in namespaces, is so that you can easily generate cross-features.) ↩
Also, the absence of a feature indicates that the feature has value 0. So for text classfification tasks 0 is assumed for all tokens not listed in an example. ↩