Congressional Speech Corpus (including references to other members of Congress)

By Jason J. Jones

I ran across this corpus of Congressional speech that may be useful to some in the group.  Here is a brief description:

This data includes speeches as individual documents, together with:

  • automatically-derived labels for whether the speaker supported or opposed the legislation discussed in the debate the speech appears in, allowing for experiments with this kind of sentiment analysis
  • indications of which “debate” each speech comes from, allowing for consideration of conversational structure
  • indications of by-name references between speakers, and the scores that our agreement/disagreement classifier(s) automatically assigned to such references, allowing for experiments on agreement classification if one assigns “true” labels from the support/oppose labels assigned to the pair of speakers in question
  • the edge weights and other information we derived to create the graphs we used for our experiments upon this data, facilitating implementation of alternative graph-based methods upon the graphs we constructed

The third bullet seems like it would be of particular interest.

In my data mining class we are not using this corpus, unfortunately.  But, if you want to know which words most likely indicate an unfavorable movie review, I should have a classifier that will tell you by next week.

Advertisements

5 responses to “Congressional Speech Corpus (including references to other members of Congress)

  1. This looks really cool! Who posted this?

  2. Oops, I forgot to sign it. It was me.

  3. I think Gary King has done work on that data.

  4. For those who can’t wait, the most diagnostic words when attempting to classify movie reviews as positive or negative:
    bad
    worst
    stupid
    great
    ?
    boring
    most
    !
    truman
    very
    world
    war
    godzilla
    best
    mulan
    family
    perfect
    excellent
    ridiculous
    wasted
    wonderful
    seagal
    titanic
    awful
    flynt
    shrek
    &nbsp
    mess
    political
    batman

    These are based on chi-squared values, and not a fancy classifier (which is still a work in progress). I removed some stop words. I left some fun ones in (e.g. godzilla) that are probably deeply dependent on the corpus.

    Well, maybe not godzilla. If you’re reviewing the ’98 film, that’s probably a reliable predictor of a negative review.

  5. This dataset seems to be really useful. We would produce a bunch of works with this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s