Why is this accuracy of this Random forest sentiment classification so low?

I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:

import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder

data = pd.read_csv('data.csv')

x = data['Reviews']
y = data['Ratings']

le = LabelEncoder()
x_encoded = le.fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)

x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

Then I printed out the accuracy like below:

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

And here's the output:

Accuracy: 0.5975

I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.

Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?


Solution 1:

It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.

What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.

As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.

A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.