Google calendar and a practical example of Information Extraction

Last week I was got an email to my Gmail account about a meeting that I was going to attend. I added it to my calendar and noticed that Gmail extracted some of the information from the email and used it to fill out fields in the new calendar entry. This was pretty exciting as it is a real-life example of how information extraction can be useful. It got me wondering how they are doing the extraction and how Automated IE could be used to improve integration between different applications.

Here is a copy of the mail I received.

HI All,

Just a reminder that the UCDSAC AGM 2006 will be held in Room G24, Ag
UCD at 1930 Tue 25th April.

Looking forward to seeing you there

IE is the task of identifying and extracting predefined items from text. In the case of this example, the predefined items that we might be interested in extracting are the name, time and place of the appointment i.e. what, where, when.

You could approach this by writing a regular expression to match the info you want to extract. Its fairly easy to write a simple regular expression that will extract dates or times, but this approach falls down when there are multiple different dates or times in the document. For example what if the example said “… will be held in Room G24 from 1930-2030 (Can committee members please be there by 1915).” Now we have three times to extract and our IE system need to get the correct one (1930) as the time of the meeting and ignore the others.

So now you could start to write more complex rules and regular expressions that will take all these variations into account. So if a time is preceded by a dash(-) and another time, we should ignore it. But this approach doesn’t scale. It is brittle and time consuming. This is where the automated part of automated IE comes in. Instead of trying to come up with rules for extracting the items we are interested in from the text, we mark the items we want to extract from the text and use a machine learning algorithm to learn the rules for extracting the items.

This is much better because Humans are better at identifying items of interest than coming up with general rules for extracting them, while machine learning algorithms are better at finding general rules for extracting items from text given lots of examples.

Gmail gives me the option of adding this to my calendar so it looks like it recognized that the email discussed an appointment. Maybe is just recognized that the email contained a date and time.

For an appointment, the fields we are interested in extracting are

  • What
  • When
  • Where

Adding it to calendar to my calendar looks like this.

Google gets the When right. This is no surprise, as dates and times are relatively easy to recognize. In fact a regular expression can probably do a good job of identifying these fields (assuming there aren’t multiple and dates times listed in the email in which case context becomes important). For What, it gives the entire message, and where is left blank.

So Google is using some kind of shallow information extraction in that it recognized documents that contain dates and times and it extracts the date and time from it when you add it to your calendar. I wonder how google are generating the patterns for extraction - are they hand-coded regular expressions or is there something deeper going on? It would be nice if it extracted the what and where correctly. The fact that it doesn’t get these fields but it gets the easier when correct would indicate that they are using a bunch of common date patterns.

Here is a brief overview of how Elie, my IE system would approach this problem.

The above email could be marked up and used as an example to a learning algorithm.


HI All,

Just a reminder that the <what>UCDSAC AGM 2006</what> will be held in <where>Room G24, Ag
UCD</where> at <stime>1930</stime> <date>Tue 25th April</date>.

Looking forward to seeing you there


Once the document is marked up, running the above text through Elie’s preprocessor gives the following.

Token Stem POS Chunk Gaz Orth
hi hi NNP stopword WORD,ALL_UPPER
all all DT stopword WORD,CAPITALIZED
, , PUNC,S_CHAR
\n \n SPECIAL
\n \n SPECIAL
just just RB stopword WORD,CAPITALIZED
a a DT NP_s stopword WORD,ALL_LOWER,S_CHAR
reminder remind NN NP_e WORD,LONG,ALL_LOWER
that that IN stopword WORD,ALL_LOWER
the the DT NP_s stopword WORD,ALL_LOWER
ucdsac ucdsac NNP NP_i WORD,LONG,ALL_UPPER
agm agm NNP NP_e WORD,ALL_UPPER
2006 2006 NUM,4_DIGIT_NUM
will will MD VP_s stopword WORD,ALL_LOWER
be be VB VP_i stopword WORD,ALL_LOWER
held held VBN VP_e WORD,ALL_LOWER
in in IN stopword,US_state WORD,ALL_LOWER
room room NNP secondary_address,location WORD,CAPITALIZED
g g stopword WORD,ALL_UPPER,S_CHAR
24 24 NUM,2_DIGIT_NUM,S_NUM
, , PUNC,S_CHAR
ag ag NNP WORD,CAPITALIZED
\n \n SPECIAL
ucd ucd NNS WORD,ALL_UPPER
at at IN stopword WORD,ALL_LOWER
1930 1930 NUM,4_DIGIT_NUM
tue tue NNP WORD,CAPITALIZED
25 25 NUM,2_DIGIT_NUM,S_NUM
th th stopword WORD,ALL_LOWER
april april NNP person_first,date,month WORD,CAPITALIZED
. . PUNC,S_CHAR
\n \n SPECIAL
looking look VBG stopword WORD,LONG,CAPITALIZED
forward forward RB WORD,LONG,ALL_LOWER
to to TO stopword WORD,ALL_LOWER
seeing see VBG stopword WORD,LONG,ALL_LOWER
you you PRP stopword WORD,ALL_LOWER
there there EX stopword WORD,ALL_LOWER

This tokenizes the document and adds some extra information about each token. POS refers to part-of-speech. E.g. NNP is a proper noun. Stem is the root of the word according to Porter’s stemming algorithm. This can be useful for conflating several different by similar words to the same feature. Chunk refers POS chunking - whether a token is part of a noun or verb phrase. E.g. NP_s indicates the start of a noun phrase. Orthographic features record information about how the token occurs in the text. Gazetteer is a set of pre-defined dictionaries such as peoples first names and last names.

These features are then used to build a representation that gives represents contextual information about each token. As an example take the token G, as in “in room G 24”. We encode instances using a fixed window of tokens before and after. So for a window of 1, this token would have features such as token_room_-1 (i.e. token room occurs 1 token previous to the current token), POS_NNP_-1, gaz_location_-1, orth_all-upper_0 (i.e. current token is all upper case), orth_2-digit-num_+1 etc.

Once we have represented all the tokens in this way we have a large number of positive and negative examples of the starts and ends of the fields that we want to extract. We can pass these examples to a learning algorithm (Elie uses SMO which is a Support Vector Machine variation) to learn extraction patterns from our labelled examples. Given a set of good training data the use of general features such as orthographic and gazetteer enables the learner to learn more general extraction patterns. E.g. Instead of learning a pattern such as in->room->g->24, it can learn a pattern such as in->room->[all-caps, single-character]->[2-digit-num] => start_of_where. We can then use these patterns to identify starts and ends of fields that we want to extract and match up starts and ends using a probability histogram from our training data.

The problem with this approach is that is very slow when learning models. It generates an lot of features and the learning algorithm is slow. However you could do the learning offline and then use the patterns that the system learns to do the extraction in real-time.

The advantage is that it is easier for humans to mark items for extraction in text than to come up with good extraction patterns. Once you have a set of good examples marked up you can let the learning algorithm worry about what features to pay attention to and what patterns to use for extraction.