How can a sentence or a document be converted to a vector?
Solution 1:
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
Solution 2:
It all depends on:
- which vector model you're using
- what is the purpose of the model
- your creativity in combining word vectors into a document vector
If you've generated the model using Word2Vec, you can either try:
- Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
- Wiki2Vec: https://github.com/idio/wiki2vec
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model):
sent_vec = np.zeros(400)
numw = 0
for w in sent:
try:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
Solution 3:
A solution that is slightly less off the shelf, but probably hard to beat in terms of accuracy if you have a specific thing you're trying to do:
Build an RNN (with LSTM or GRU memory cells, comparison here) and optimize the error function of the actual task you're trying to accomplish. You feed it your sentence, and train it to produce the output you want. The activations of the network after being fed your sentence is a representation of the sentence (although you might only care about the networks output).
You can represent the sentence as a sequence of one-hot encoded characters, as a sequence of one-hot encoded words, or as a sequence of word vectors (e.g. GloVe or word2vec). If you use word vectors, you can keep backpropagating into the word vectors, updating their weights, so you also get custom word vectors tweaked specifically for the task you're doing.
Solution 4:
There are a lot of ways to answer this question. The answer depends on your interpretation of phrases and sentences.
These distributional models such as word2vec
which provide vector representation for each word can only show how a word usually is used in a window-base context in relation with other words. Based on this interpretation of context-word relations, you can take average vector of all words in a sentence as vector representation of the sentence. For example, in this sentence:
vegetarians eat vegetables .
We can take the normalised vector as vector representation:
The problem is in compositional nature of sentences. If you take the average word vectors as above, these two sentences have the same vector representation:
vegetables eat vegetarians .
There are a lot of researches in distributional fashion to learn tree structures through corpus processing. For example: Parsing With Compositional Vector Grammars. This video also explain this method.
Again I want to emphasise on interpretation. These sentence vectors probably have their own meanings in your application. For instance, in sentiment analysis in this project in Stanford, the meaning that they are seeking is the positive/negative sentiment of a sentence. Even if you find a perfect vector representation for a sentence, there are philosophical debates that these are not actual meanings of sentences if you cannot judge the truth condition (David Lewis "General Semantics" 1970). That's why there are lines of works focusing on computer vision (this paper or this paper). My point is that it can completely depend on your application and interpretation of vectors.