I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary). Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.

I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.


Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.

Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:

from pyspark.ml.feature import Word2Vec

sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)

So, unsurprisingly, it is

model.getVectors().count()
# 3

and asking for the vectors themselves

model.getVectors().show()

gives

+----+--------------------+
|word|              vector|
+----+--------------------+
|   a|[0.09511678665876...|
|   b|[-1.2028766870498...|
|   c|[0.30153277516365...|
+----+--------------------+

In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.