How to get vector for a sentence from the word2vec of tokens in sentence
I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence.
There are differet methods to get the sentence vectors :
- Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors.
- Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.
- Average of Word2Vec vectors with TF-IDF : this is one of the best approach which I will recommend. Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.
There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.
First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.
A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.
It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.
It depends on the usage:
1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.
A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.
2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:
You could check out this paper:
Steven Du and Xi Zhang. 2016. Aicyber at SemEval-2016 Task 4: i-vector based sentence representation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, US
3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:
http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195
4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.
Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.
5) tweet2vec or sent2vec .
Facebook has SentEval project for evaluating the quality of sentence vectors.
https://github.com/facebookresearch/SentEval
6) There are more information in the following paper:
Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering
And for now you can use 'BERT':
Google release the source code as well as pretrained models.
https://github.com/google-research/bert
And here is an example to run bert as a service:
https://github.com/hanxiao/bert-as-service
You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).
Code for sentence2vec has been shared by Tomas Mikolov here. It assumes first word of a line to be sentence-id. Compile the code using
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
and run it using
./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
EDIT
Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument)
method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py