Notes on NamedTensor
NamedTensor is a PyTorch library that tries to fix some broken parts of Tensor. As part of CS 287R this year, our class gets to be beta-testing guinea pigs for it.\
While I haven’t really worked extensively in PyTorch before, in this post (perhaps the first of many) I’ll try to document notes on Tensors and NamedTensor alternatives. Sources will mainly be the NamedTensor blog post written by Sasha Rush’s group, and the official PyTorch documentation.
Also as a quick note, while I’m posting this live, it’s by no means complete, and the ending quickly degenerates to my usual self-intended note style. That being said, it’s nice to have things up. Hopefully will add to this soon.
Tensor Intro
For some quick background, NamedTensor tries to make working with tensors better. What exactly are tensors? They’re sort of like multi-dimensional arrays, often used in deep learning frameworks like Torch or TensorFlow.
For example, we can do
1import torch
2import numpy as np
3
4ims = torch.tensor(np.load('test*img.npy'))
5ims.shape
torch.Size([b, h, w, c])
noting 4 dimensions corresponding to _batch size*, height, width, channels.
torch.transpose(input, dim0, dim1) $\rightarrow$ Tensor
Broadcasting
Binary operations can be performed on arrays of the same size on an element-by-element basis.
Broadcasting allows these operations to be performed on arrays of different sizes. For example, if we had
import numpy as np
a = np.array([4, 5, 6])
a + 5
This would return array([9, 10, 11])
. Mentally, we can think of broadcasting as converting 5
to the array [5, 5, 5]
, but this doesn’t actually happen in NumPy.
View
View is a method belonging to the torch.Tensor
object that returns a new tensor in a different shape, dictated by its parameters.
The product of dimensions needs to be consistent for both shapes.
Using -1
in place of a number can be used to automatically determine what this dimension value should be. This comes in handy in situations where we might not know how many rows we want (or are too lazy to calculate this beforehand), but do know the number of columns.
Squeeze
Called with torch.squeeze(_input_, _dim_=None, _out=None_) $\rightarrow$ Tensor
Returns a new tensor with all dimensions of size 1 removed. For example, if the input is of shape $(A \times 1 \times B)$, calling squeeze
will return a tensor with shape $(A \times B)$.
If dim
is present, then we only squeeze on that given dimension. For example, if the input is of shape $(A \times 1 \times B)$ and we call squeeze(input, 0)
, this won’t do anything. But calling squeeze(input)
or squeeze(input, 1)
will return the “squeezed” tensor of shape $(A \times B)$.
Squeeze can be undone with unsqueeze
, which takes in an input
but also requires a dim
. The method returns a new tensor with dime
1 inserted at the specific position, but still shares the input data. Example: with x = torch.tensor([1, 2, 3])
, torch.unsqueeze(x, 0)
will output a tensor with dim
$(1 \times 3)$, and torch.unsqueeze(x, -1)
will output a tensor with dim
$(3 \times 1)$.
NamedField
Documentation notes for implementing NamedField
.
NamedField
inherits fromtorchtext.data.Field
- Most (all except names?) of its parameters come from this
The additional paramname
is used to specify anames
property, which might be useful? Defaults to"seqlen",
- Most (all except names?) of its parameters come from this
torchtext.data.Field
According to the docs, this defines a datatype and instructions for converting to Tensor
Vocab
obj defines set of possible values for elements
Parameters:
sequential
: whether datatype should be sequential. Default isTrue
.use_vocab
: whether to use vocab object. If False, data should already be numerical. Default isTrue
.unk_token
: string token to represent words not in the vocab: Default is'<unk>'
, but in the 287 pset is set toNone
.
torchtext.datasets.SST.splits
This should create dataset objects for splits of the SST dataset
- We specify which fields are used for the sentence and labels
- So this determines how we process the dataset
- Params are opinionated for SST, because we only have the sentence and the sentiment / label, so we just need to specify the fields for these
torchtext.data.Field.build_vocab
Constructs the Vocab object for this field from one or more datasets
- Calling
TEXT.build_vocab(train)
goes through all elements in the training set, checks contents for theTEXT
field (sentences), and registers words to the vocab
vocab
is an object used to numericalize a field.
- Holds mapping from word to id in
stoi
attribute, and mapping from id to word initos
attribute.- Can also automatically build an embedding matrix with pretrained embeddings (word2vec)
- Other params:
_max_size
: sets the max size of the vocab. Default:None
.
_min_freq
: sets lower bound for number of times word needs to appear to be considered in the vocab. Default:1
.
_Question: what is vocab?_
This is the set of words we’re allowed to work with
When feeding sentences into the model, we feed a batch of them at a time
- More than one, but all sentences need to be the same size, so any shorter than the longest are padded with
<pad>
Labels denote a 0 for positive and 1 for negative, which is weird.