Explore open access research and scholarly works from STORE - University of Staffordshire Online Repository

Advanced Search

Construction and Performance Analysis of a Groomed Polarity Lexicon Derived from Product Review Source Datasets

COLLEY, Derek and ASADUZZAMAN, Md (2021) Construction and Performance Analysis of a Groomed Polarity Lexicon Derived from Product Review Source Datasets. In: 11th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS 2021). IEEE. (In Press)

[thumbnail of Construction and Performance Analysis of a Groomed Polarity Lexicon Derived from Product Review Source Datasets]
Preview
Text (Construction and Performance Analysis of a Groomed Polarity Lexicon Derived from Product Review Source Datasets)
paper5.pdf - AUTHOR'S ACCEPTED Version (default)
Available under License Type All Rights Reserved.

Download (437kB) | Preview

Abstract or description

Using a large, publicly-available dataset [1], we extract over 51 million product reviews. We split and associate each word of each review comment with the review score and store the resulting 3.7 billion word- and score pairs in a relational database. We cleanse the data, grooming the dataset against a standard English dictionary, and create an aggregation model based on word count distributions across review scores. This renders a model dataset of words, each associated with an overall positive or negative polarity sentiment score based on star rating which we correct and normalise across the set. To test the efficacy of the dataset for sentiment classification, we ingest a secondary cross-domain public dataset containing freeform text data and perform sentiment analysis against this dataset. We then compare our model performance against human classification performance by enlisting human volunteers to rate the same data samples. We find our model emulates human judgement reasonably well, reaching correct conclusions in 56% of cases, albeit with significant variance when classifying at a coarse grain. At the fine grain, we find our model can track human judgement to within a 7% margin for some cases. We consider potential improvements to our method and further applications, and the limitations of the lexicon-based approach in cross-domain, big data environments.

Item Type: Book Chapter, Section or Conference Proceeding
Faculty: School of Digital, Technologies and Arts > Computer Science, AI and Robotics
Depositing User: Derek COLLEY
Date Deposited: 12 Oct 2021 10:34
Last Modified: 01 Mar 2023 01:38
URI: https://eprints.staffs.ac.uk/id/eprint/7029

Actions (login required)

View Item
View Item