Lecture1
Sentiment classification — the task of automatically deciding whether a review is positive or negative, based on the text of the review
Lexicon – lexicon is a list of words with some associated information.
Suppose that a very large collection of documents describing the University of Cambridge has been collected. Your task is to build a classifier which assigns sentiment to each document, using the classes: positive, negative and neutral.
- Given the limited amount of annotated data, you decide to classify the documents using a standard sentiment lexicon. Explain how you would perform an experiment to do this.
- Note that the preliminary description is important - we are told there is a large collection of documents, but we’re not told they are all of one genre. Unlike the artificially balanced sentiment data which was used for the practical, it is extremely unlikely the collection will be balanced - we might expect that the majority of the documents will be neutral.
- Tokenize the documents (will probably also require ‘cleaning’ by removal of markup)
- Classify the documents by counting matches with words in the sentiment lexicon (explain assumptions about sentiment lexicon and give details)
- It is necessary to establish some sort of threshold for positive/negative decisions: development data should be used to do this.
- Treatment of ties should be discusses (there are several reasonable options here)
- Adjustment for sentiment strength should be discussed (there are several reasonable options here)
- Split annotated data 50:50 into development set and final evaluation set.
- Note that some form of normalization for document length is required, but students are not expected to know this
- How would you evaluate the results of the system you have described in your answer to part (b) using the annotated data? Give details of the evaluation metrics you would use.
- A decision has to be made as to how to use the annotated data for evaluation. The simplest method is to use majority class to decide on the gold standard. It is unlikely there will be many cases where the annotators have each chosen a different class, but such documents could be counted as neural.
- One can evaluate using precision and recall for each class. Accuracy will not work well because we cannot expect the dataset to be balanced
- The alternative is to evaluate the system using kappa in conjunction with the human annotated data
- If the primary objective were to identify the documents with negative sentiment, how might your proposed evaluation change?
- In this case, the distinction between the positive and neutral classes is unimportant. Recall should possibly be prioritized over precision on the negative class.
- It is suggested to you that the classes automatically assigned by the sentiment lexicon approach could be used to provide training data for a Naive Bayes classifier. Could such a classifier perform better than the sentiment lexicon classifier which provided the decisions it was trained on? Explain your answer
- The Naive Bayes classifier could improve over the sentiment lexicon classifier in the circumstance where the sentiment lexicon was missing one or more words which were good cues for sentiment with this document set, assuming that such words did occur sufficiently often associated with words which were in the sentiment lexicon.
- To see this, consider a toy example where the sentiment lexicon contains only the words good and bad. The document collection contains 1000 documents with both good and world-leading, 500 documents with only world-leading, 100 documents with only bad and 400 documents with none of these words. The test data has similar proportions: the first two sets are all both annotated as positive, the third set as negative and the fourth as neutral. Under these circumstances, the sentiment lexicon approach will classify the documents in the second set incorrectly, but the trained Naive Bayes algorithm would classify them as positive.
- This amounts to using the sentiment lexicon as seeds for a semi-supervised classifier. There are more sophisticated approaches to this than simply using Naive Bayes on the automatically annotated data, but these have not been discussed in the course so it is not expected that the answer includes details of such an approach
Type vs. Token
In natural language processing (NLP), a "type"
refers to a distinct word or symbol in a language, while a "token"
refers to a specific instance of that word or symbol in a text. For example, in the sentence “the cat in the hat,” the word “the” is a type, but it appears twice as two different tokens in the sentence. This distinction is important because it allows for the calculation of various statistics in NLP, such as the type-token ratio.
- Any unique word is a type. Any given instance of a type is a token.
Provide a sentence with exactly 4 types and 5 tokens. Explain any assumptions you make about the nature of tokens.
Any 5 word sentence that contains one word type twice: e.g. the cat likes the dog In this example we assume that punctuation is removed and that all word tokens are lowercased. If this wasn’t the case we could have a sentence like A long long winding road. (1 mark for sentence, 1 mark for discussion of tokens)