CNNs for NLP

An intuitive look at one 3-letter acronym ML concept applied to another

I’ve always believed that a better way to learn things is to explain them to others. It’s part of why I teach in college. When I started thinking about the merits of a blog, writing about more technical things as a way to ground concepts for myself was definitely a plus. \

This semester, I’m taking CS 287R, a class at Harvard that deals with NLP and increasingly deep learning. Aside from this class, there are also

Think of filters as ways to build up higher-level features, which can then aid in classification at the end

Location Invariance

Compositionality

Filters compose local patch of lower-level features to higher level ones. Pixels $\rightarrow$ edges $\rightarrow$ shapes $\rightarrow$ objects

Input for NLP = sentences, documents in a matrix.

NLP, use filters that are same width as image. Height / region size is based on number of words (2 - 5) typical.

$$7 \times 5 \rightarrow (2 \times 5, 3 \times 5, 4 \times 5) \rightarrow (6 \times 1, 5 \times 1, 4 \times 1) $$

Then max pool $\rightarrow$ takes the largest value. Then concatenate. Final softmax takes this feature, uses it to classify.

Location invariance and compositionality don’t apply obviously?

Why CNNs

CNN parameters

Zero Padding

Can help preserve shape of feature map. In general, output size is $$n_{\text{out}} = (n_{\text{in}} - n_{\text{filter size}} + 2n_{\text{padding_size}}) / n_{\text{stride}} + 1$$

Stride size

Larger stride leads to fewer applications of filter and smaller output.

Pooling

Subsample their input. Common: max operation to each filter. Can also just pool over a window.

Max pool helps retain info for if feature appeared in sentence

Channels

Different views of input data (r, g, b) for images.


References:

  1. [Understanding Convolutional Neural Networks for NLP - Denny Britz]
  2. Anecdote shared by Kanjun Qin, founder at Sourceress.