Tokenization Error in Sentiment Analysis Code - How Should Contractions Be Handled?
-
Hello,
I'm attempting to design a Sentiment Analysis model for movie reviews similar to this one, however I'm having trouble with tokenization. Here's an example of my code:
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # Load the movie reviews dataset data = pd.read_csv('movie_reviews.csv') # Preprocess the data # ... (code for data preprocessing) # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42) # Vectorize the text data using TfidfVectorizer vectorizer = TfidfVectorizer() X_train_vectorized = vectorizer.fit_transform(X_train) X_test_vectorized = vectorizer.transform(X_test) # Train the Logistic Regression model model = LogisticRegression() model.fit(X_train_vectorized, y_train) # Evaluate the model accuracy = model.score(X_test_vectorized, y_test) print(f"Accuracy: {accuracy}")
The issue I'm having is that the accuracy of my model is significantly lower than intended, hovering around 55%. After investigating the tokenization process, I discovered that contractions such as "don't," "can't," "won't," and so on are not handled appropriately. For example, "don't like" is tokenized as "don't" and "like" individually, affecting the model's overall performance.
Could you kindly advise me on how to solve this tokenization issue and guarantee that contractions are appropriately handled during text preparation in order to increase the accuracy of my sentiment analysis model?
Thank you for your assistance!
-
Hi,
That's a Qt forum so not really suited for deep learning related questions. You should rather check a dedicated forum. That said, you should check your preprocessing steps to ensure your data are properly prepared.