BDS – Twitter Gender Classification Solved

Description

5/5 – (1 vote)

1. The Problem Statement
2. What we were trying to do/solve/or look for
3. Preparing the Environment
4. The Dataset we chose
5. Reading and Examining the Data
6. Cleaning the Data
7. Pycaret
8. Feature Engineering
9. Models tried
10. Insights
a. Comparison to the top-ranking notebook has an accuracy of 48%.
11. Visualizing the Insights
12. Limitations
a. PyCaret, Cross-Validation, One-Hot-Encoding Vectorization, String Replacement, etc. all take a lot of computation power and time. Of which, our Google Colab struggles to handle these tasks individually. Having more computation power and time would further allow us to improve our accuracy.
Twitter Gender Classification
Machine learning techniques to predict the user’s gender on Twitter text data
Shreya Bakshi, Jonathan Cope, Preeti Gupta, Gowtami Khambhampati, Preety Pinghal, Sneha Reddy, Danqing Wang
Introduction:
The user profile is a persona for a product or service. Gender profiling of unstructured data has several applications in areas such as marketing, advertising, recommendation systems, etc. We can segment the data and understand what drives users, how to attract more users, and how users interact with the service. Now we are going to use the profile information of users on Twitter to predict their gender.
The Problem Statement:
Given this dataset of Twitter conversations, how well can our proposed model predict the gender of the user based on linguistic cues from textual Twitter data.
Environment:
For this project, we Installed classifiers, necessary packages and imported all the libraries required for visualizations, modeling, automation, vectorization, etc. We also pip installed pycaret and lgbm.

The data has been extracted from Kaggle from the link below. https://www.kaggle.com/crowdflower/twitter-user-gender-classification
Read the CSV file:

Shape of the dataset:

Dataset consists of 20050 rows and 26 columns. Out of 26 columns, we use 25 predictor variables and 1 target variable which is ‘gender’.
Columns and their associated data types:

Analyzing Data source:
Features of dataset:

Exploring Target variable – ‘Gender’:

Null values in the dataset:

Data Cleaning:
Data cleansing is a very crucial step in the overall data preparation process and it is the process of analyzing, identifying, and correcting messy, raw data.
We started our data cleaning by dropping unnecessary features

Getting rid of gender type – “unknown”:

Getting rid of rows where column “profile_yn” is no:

Removing Stop words and cleaning the Text:
Here we are using functions preprocessor(), remove_dup_whitespace(), tokenizer_porter(), clean_tweet, has_nan to clean, stem and tokenize the text

Changing the text to lower characters:

We have removed the column description_has_nan as it has no significance in predicting the gender which is our goal here.

PyCaret Machine Learning
PyCaret is an automated tool in Python that allows (with time) to create insights on what kind of data is being collected (categorical, numerical, boolean, etc.), and give the user a chance to make adjustments and tweaks to the auto-generated assumptions. The tool will then compile a variety of models using default hyperparameters to provide results such as accuracy and AUC. The tool can be further used to boost/ensemble/predict.
On the cleaned data without much feature engineering, we were able to find that out of models
(Light Gradient Boosting, SVM, Random Forest, Logistic Regression, Ridge, Naive Bayes,
Linear Discriminant Analysis, Extra Trees, Gradient Boosting, Quadratic Discriminant Analysis, Ada Boost, Decision Tree, K Neighbors, Dummy) and it tells us here the accuracy was 58%.

Feature Engineering:
To get the most out of our PyCaret Auto Modeling, we had to perform a few tasks of Feature Engineering. For example, we began by reindexing the data (since it would skip numbers, which would affect adding back the Y label after one-hot-encoding). Then, we pursued one hot encoding, starting with 500 columns for the “description” column and 500 for the “text_Cleaned” column. We would rename the second set of columns “500”-”999” so that they wouldn’t clash with the 0-499 set.

Since PyCaret would not mix the string ‘brand’ in well, with the gender integers (0, 1), we translated ‘brand’ as a ‘2.’ We also converted the datatypes for the table as “int8” so that the computation would run faster than the default int64.

We then noticed that if we improve our one-hot-encoding vectorizer from 500 to 1500, the accuracy improves from 56 to 63.14% with Extra Trees Classifier

After this, we improved our accuracy score further by including the Link Color (categorical) data, along with the Favorite Number (numerical) data.

Finally, we decided to go a step further with our Link Color data (which was previously in HEX format (ex. #AAFF00). We converted this data into RGB Values (Red, Green, Blue values of 0-255 inclusive). Once we had these values in a usable format, we converted them into Color names using PIL’s ImageColor library. If the name included “light” or “dark” substrings, we would remove that portion of the name. That way we would have multiple values using the standard labels “red,” “blue,” “gold,” etc. regardless of whether they were light reds, dark blues, etc. This also made the data more categorical, and we also included the favorite color column as categorical data. This brought our accuracy score to 65.99%.

Visualizing the Data:
Gender distribution

Subplots for fav_number, retweet_count, tweet_count for all ‘gender’ types:

From the figure above we notice that the retweet_count and tweet_count for brands are higher when compared to others.
Density Graph of Tweet count vs Gender

The above image shows the density for tweet count for genders male, female, and brand. The female gender has a high tweet count density in the dataset.
Visualizing color features
link_color and sidebar_color are two features that we are interested in to see if they can be used to predict genders . Like what kind of color women are preferring compared to men?

There are more variations in link_color than sidebar_color so people are changing link_color more than sidebar color.
For females pink & purple seems to be the most popular color for link_color whereas males prefer blue , shades of blue and green.
Choice of link_color overlaps more between brands and males than between brands and females.
Text length vs Gender

The above image shows a comparison of text lengths and how they differ among the gender and brand. We can see data follows a normal distribution trend here.
Word cloud of “text” column based on gender:
Word Cloud of Gender: Male

Word Cloud of Gender: Female

High-frequency words:

Label Encoding:

Splitting train and test data:
The train and test data is split into 70% and 30%

Training data on different ML models:
We have tried to implement the different prediction models just by using text and non -text and text columns without involving any sentiment analysis with the textual data.
Features used for the predictions:
Link Color: It has non-text data and indicates the link color on the profile
Description: The user’s profile description
Text: Text of a random one of the user’s tweets
We found these attributes to provide useful information regarding gender classification.
Label Encoder: Before the prediction, we have encoded the target column to 0-1
Predictions using non-text column- Link Color
Twitter allows customizing and personalizing the account by changing the colors of the links or the sidebars, and we expect people from different genders to have different behaviors in how they personalize their page.
Hence, we have used link color as the feature for this prediction.

Accuracy Scores

Logistic Regression gave the highest accuracy followed by Multinomial NB SGD has the lowest scores, whereas Adaboost also gave a low accuracy score.
Predictions using text column- Text
We have cleaned the text column before doing the prediction in order to get rid of noisy data.
We have used TfidfVectorizer to calculate the TF-IDF values and understand the importance and weightage of a word in the text.

Accuracies:

Support vector machines (SVM) have the highest accuracy score followed by SGD- stochastic gradient descent.
Overall accuracy score increased compared to scores from using just link color feature but still, it is between 47%-53% and has not improved significantly.
Predictions using text + description column

We have concatenated the text and the description column and then have used the same code as described above

SVM again has the highest score followed by Complement Naive Bayes and the accuracy scores have increased significantly when we are combining the text and the description column. We have better prediction chances using the user’s profile description and the text they are tweeting. There is no single prediction model which performs well in all the cases; however, we see that overall SVM has higher accuracy predicting the gender compared to other models.
Note- Prediction graphs generated using tableau , based on the data from the prediction models
By using boosting classifiers like LGBM Classifier, We are getting an accuracy of 56%

By using SVM Classifier, We are getting an accuracy of 62.7%

With PyCaret, and feature engineering (explained previously), our highest possible accuracy given our computation limits was 65.99%.

Reviews

There are no reviews yet.

Be the first to review “BDS – Twitter Gender Classification Solved”

BDS – Twitter Gender Classification Solved

Description

Reviews

Related products

BDS – Department of Electrical and Computer Engineering Solved