I ran across this corpus of Congressional speech that may be useful to some in the group. Here is a brief description:
This data includes speeches as individual documents, together with:
- automatically-derived labels for whether the speaker supported or opposed the legislation discussed in the debate the speech appears in, allowing for experiments with this kind of sentiment analysis
- indications of which “debate” each speech comes from, allowing for consideration of conversational structure
- indications of by-name references between speakers, and the scores that our agreement/disagreement classifier(s) automatically assigned to such references, allowing for experiments on agreement classification if one assigns “true” labels from the support/oppose labels assigned to the pair of speakers in question
- the edge weights and other information we derived to create the graphs we used for our experiments upon this data, facilitating implementation of alternative graph-based methods upon the graphs we constructed
The third bullet seems like it would be of particular interest.
In my data mining class we are not using this corpus, unfortunately. But, if you want to know which words most likely indicate an unfavorable movie review, I should have a classifier that will tell you by next week.