Description
Project Description:
Project Name: Fight Online Abuse Using Natural Language Processing
Team Members: Aimen Chaudhry and Kalyan Bandaru
Goals and Objectives: The project goal was to identify the type of comment toxicity during online interaction. Given the rise in trolling and hatred on social media and elsewhere these days, this is a very real problem. The task is set up as a standard text classification problem in which one have to predict the probability of a comment being toxic or a threat, and submissions are evaluated. The project will be finished by the team to allow for efficient brainstorming on how we can enhance the model accuracy and how we can lend the NLP to be competent to combat online abuse that can have a damaging impact. Identifying toxicity is a lot more than just detecting abusive words in the text as it requires the project to be implemented in different steps and handled appropriately.
For this project, in the first team meeting, we decided a layout for the project implementation, where it was decided that the four main sections of the project include data loading, data preprocessing, train and testing, and model evaluations. It was decided that Aimen Chaudhry will work on the first two sections and Kalyan Bandaru will complete the last two sections. The dataset used for this project is called “Toxic_Comment” and was taken from Kaggle. This dataset includes a significant number of Wikipedia comments, labeled for toxic behavior. The types of toxicity are:
• toxic
• severe_toxic
• obscene
• threat
• insult
• identity_hate
Data Loading (Completed by Aimen Chaudhry)
In this section, we uploaded the Toxic_Comment.zip and unzipped the folder to be able to use the files it contained. A description of some of the files used and included are provided below: • train.csv – the training set with comments with their binary labels
• test.csv – the test set
Random samples of data were printed and plotted in this section to get a deeper understanding of the data we would be working with.
Preprocessing (Completed by Aimen Chaudhry)
In the preprocessing section, we removed unnecessary elements like stopwords, URLs and HTMLs, newline characters, numeric data, punctuation marks from the data, as shown in Figure 1. The data was also tokenized in this section before passing it along to the training and testing section.
Figure 1
Training and Testing (Completed by Kalyan Bandaru)
After preprocessing the data, the new data was saved as new train and new test data. We split the data in to features and targets, and converted text into an array of integers with the help of tokenizer. A model was created to predict the type of toxicity for each comment, shown in Figure 2.
Figure 2
Model Evaluation (Completed by Kalyan Bandaru)
In this section, we simply calculated the performance and accuracy of the model created using plots and data frames. The following screenshot shows a snippet of it.
Figure 3
References
Aimenchaudhry, “CSCE-5290-4290/online_abuse_comm.ipynb at Main · Aimenchaudhry/CSCE-5290-4290,” GitHub. [Online]. Available:
“Toxic comment classification challenge,” Kaggle. [Online]. Available:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data. [Accessed:



Reviews
There are no reviews yet.