Print This Post Print This Post

Computer Science for Poets: N-Gram Language Models

June 2nd, 2010 by edde addad | Filed under -NP-Software, -NP-Theory/Critical

Hi everyone!  Over at Gnoetry Daily we've been doing n-gram computer-assisted poetry generation for a while, and I decided to write up a little tutorial introducing n-grams and how they can be used for poetry generation.  It's posted below; I'm trying to make it easy to understand, so let me know if there's anything that's not clear!

1. Unigrams

I'll use Shakespeare as an example; either you'll like it, or you'll like that I'm shredding it.  So think of Shakespeare's line:
“When I do count the clock that tells the time.”
It contains several words:
1. when
2. I
3. do
4. count
5. the
6. clock
7. that
8. tells
9. the
10. time
Computational linguists might call each word a unigram, as part of a language model.

I could generate my own poetry by rolling a 10-sided dice:
- if a '3' came up, I'd write 'do'.
- if a '9' came up I'd write 'the'.
- if a '1' came up I'd write 'when'.
and so on.  After rolling the dice a couple times I might get something like:

do the when clock the when when that

Which sounds pretty incoherent, but this is just the first step!

2. Bigrams from a single line

Think of that line again:
“When I do count the clock that tells the time.”
It contains several adjacent pairs of words:
1. when I
2. I do
3. do count
4. count the
5. the clock
6. clock that
7. that tells
8. tells the
9. the time
Computational linguists might call these bigrams.

If I found a 9-sided dice, I could generate my own poetry again.
- For example, if an '8' came up, I'd write 'tells the'
- The previous bigram ends with the word 'the'.  So for the next word I'd look at those bigrams that begin with 'the'.  There are two of these:
5. ('the clock')
9. ('the time').
So I flip a coin, and I decide on 5.
So far my poem is:

tells the clock

- Then for the next word I'd look at those bigrams that begin with 'clock'. Unfortunately there is only one, 'clock that,' so I need to choose that.
- And after that, there is only one bigram that begins with 'that' ('that tells') so I need to choose that!
- And after that there is only one bigram that begins with 'tells' ('tells the') so I need to choose that.
- Luckily there are again two bigrams that begin with 'the' so I flip a coin and pick 'the clock'.

So thus far my poem is:

tells the clock that tells the clock

which is bad, but at least makes a bit more sense than the unigram-generated poem.

But the sequence 'the clock that tells the' is lifted straight from the original line of poetry (a.k.a. the training data) because there weren't a whole lot of options available.  This is the danger of using a small data set!  So let's look at a bit more data.  At this point, using a computer programs helps.

3. Bigrams from Shakespeare's Sonnets

Let's take the entire set of Shakespeare's sonnets from Project Gutenberg, and write a program to read the text file and break each line of poetry into bigrams as above.  There are a couple decisions we need to make when tokenizing (splitting the text into words) such as whether we should split things like “self-substantial” into two words or not, but the programming itself is pretty straightforward.

It turns out there are 11817 unique bigrams in Shakespeare's sonnets which occur a total of 17534 times.  For example:
- “eyes are” is a unique bigram which occurs 2 times
- “fragrant rose” is a unique bigram which occurs 1 time
- “even to” is a unique bigram which occurs 3 times
and so on.

So let's generate a poem!  How do we know where to start?  We keep a separate list of those words that begin a line of poetry!  Here is a list of the tokens that start lines in the Sonnets, along with the number of times they occur:

I 37, a 17, above 1, accuse 1, admit 1, advantage 1, after 2, against 10, ah 5, alack 1, alas 2, all 10, although 5, am 1, among 1, an 1, and 242, angry 1, anon 1, another 1, applying 1, are 3, art 1, as 40, askance 1, at 5, attending 1, authorizing 1, awakes 1, ay 2, bare 1, be 9, bear 1, bearing 2, beated 1, beauteous 1, beauty 2, beauty's 1, because 2, before 3, beggar'd 1, being 4, beshrew 1, better 1, betwixt 1, beyond 1, blessed 1, book 1, borne 1, both 3, bound 1, breathed 1, bring 1, but 89, buy 1, by 14, call'd 1, calls 1, came 2, can 3, cannot 1, canst 1, cheered 1, chiding 1, clouds 1, come 2, comes 1, commanded 1, commit 1, compar'd 1, compare 1, consum'd 1, coral 1, could 1, counting 1, crawls 1, creating 1, creep 1, cries 1, crooked 1, crowning 1, cupid 1, darkening 1, dear 1, death's 1, delights 1, describe 1, deserves 1, desire 1, desiring 1, despite 1, devouring 1, die 2, disdains 1, dissuade 1, distill'd 1, divert 1, do 4, doing 1, dost 3, doth 6, doubting 1, drawn 1, drink 1, drugs 1, dulling 1, duty 1, each 2, eat 1, either 1, enjoy'd 1, entitled 1, ere 2, eternal 1, even 10, exceeded 1, excuse 1, excusing 1, fair 3, fairing 1, farewell 1, featur'd 1, feed'st 1, feeding 1, feeds 1, find 1, finding 2, flatter 1, for 73, forgot 1, from 14, full 2, gainst 1, gentle 1, gilding 2, give 5, giving 1, gor'd 1, grant 1, great 1, growing 1, grows 1, had 2, hang 1, haply 1, happy 1, harsh 1, hast 1, hate 1, hath 6, have 6, he 7, hearing 1, hence 1, her 4, herein 1, hers 1, hiding 1, him 2, his 3, holds 1, how 21, hung 1, if 35, in 36, incapable 1, incertainties 1, increasing 1, injurious 1, intend 1, is 10, is't 1, it 6, join 1, just 1, kill 2, kind 1, kissing 1, knowing 2, laid 1, lascivious 1, lean 1, leaving 1, leese 1, lest 6, let 10, lifts 1, like 8, lilies 1, lo 3, look 6, looking 2, lord 1, lose 1, love 4, love's 3, loving 1, mad 2, made 2, make 6, makes 2, making 7, mark 2, may 3, me 2, methinks 1, might 1, mine 10, more 5, most 2, much 1, music 1, my 30, myself 2, naming 1, nativity 1, nature's 1, nay 2, needs 1, neither 1, never 1, no 17, none 1, nor 24, not 10, nothing 1, now 9, o 47, o'er 1, o'ercharg'd 1, oaths 1, of 20, on 6, one 5, only 1, or 38, others 1, our 2, painting 2, past 3, perforce 1, pitiful 1, pity 3, plods 1, pluck 1, pointing 1, points 1, poor 1, possessing 1, potions 1, praising 1, presents

1, presume 1, prison 1, profitless 1, proving 1, receiving 1, resembling 2, reserve 2, return 2, revenge 1, richer 1, rise 1, robb'd 1, robbing 1, root 1, roses 2, rough 1, ruin 1, sap 1, savage 1, save 7, say 1, seeking 1, seems 1, self 1, serving 1, sets 2, shall 10, she 3, shifts 1, show 1, showing 1, simply 1, sin 1, since 15, sing 1, sings 1, sinks 1, sland'ring 1, so 47, some 7, sometime 2, speak 2, speaking 1, spend'st 1, spending 1, steal 1, stealing 2, still 2, stirr'd 1, straight 1, strikes 1, such 5, suffering 1, suns 1, supposed 1, suspect 1, swear 1, sweet 4, sweets 1, take 3, tan 1, tell 1, tempteth 1, ten 1, than 13, that 83, that's 1, the 78, thee 1, their 1, theirs 1, then 33, thence 1, there 2, therefore 8, these 5, they 11, thine 4, think 1, this 8, those 7, thou 25, though 10, three 3, through 1, thus 8, thy 29, till 6, time 1, time's 1, tir'd 1, tired 1, tis 3, to 78, to-morrow 2, too 1, towards 1, triumph 1, truth 1, two 1, under 2, unlearned 1, unless 4, unlook'd 2, unmoved 1, unthrifty 1, until 1, upon 7, use 1, uttering 1, vaunt 1, want 1, was 4, we 1, weary 1, weeds 1, weighs 1, were 4, were't 1, wh'r 1, what 14, what's 2, whate'er 1, when 57, whence 1, where 11, wherein 2, whereon 1, whereto 3, which 47, while 4, whilst 11, who 18, whoe'er 1, whoever 1, whom 1, whose 6, why 12, will 4, wilt 1, wishing 1, with 18, within 5, without 5, wooing 1, worthy 1, wound 1, wretched 1, yet 20, you 7, your 5, yourself 2

So in Shakespeare's sonnets, the token “I” begins a line of verse 37 times; the token “how” begins a line of verse 21 times; the token “why” begins a line of verse 12 times, and so on.

Now let's say we get a lottery machine, and wrote “I” on 37 of those little ping-pong balls, and wrote “how” on 21 of the ping-pong balls, and “why” on 12 of the ping-pong balls, and so on.  Then we draw out one of the ping-pong balls, look at the token on it, and put it back.  Let's say the ping-pong ball we drew out said: “that”.  So thus far we have a poem that goes:

that

So now we can have our program look at the tokens that follow “that” in our corpus.  Here is a list of the tokens that follow “that” in Shakespeare's Sonnets:
I 28,  a 1,  able 1,  affable 1,  all 1,  am 1,  art 1,  audit 1,  barren 1,  be 1,  bears 2, beauteous 1,  beauty 3,  before 1,  besiege 1,  best 1,  better 1,  blessed 1,  bond 1,  bosom 1, brightness 1,  by 1,  calls 1,  can 1,  cannot 1,  censures 1,  churl 1,  copy 1,  deep 1, did 2,  do 2,  doth 3,  due 3,  ease 1,  enfeebled 1,  eternal 1,  ever 1,  every 2,  eyes 1, face 1,  fair 1,  fears 1,  feeds 1,  fell 1,  fester 1,  fire 1,  flies 1,  flower 1,  followed 1, for 2,  fresh 1,  from 1,  full 1,  gainst 1,  gave 1,  give 1,  glory 1,  god 1,  ground 1, grows 1,  guides 1,  harvest 1,  hath 1,  have 2,  having 1,  he 3,  heals 1,  heart 1,  heaven's 1, heavy 1,  her 1,  heretic 1,  hidden 1,  him 1,  his 1,  honour 1,  idle 1,  in 7,  ink 1, is 9,  it 2,  keeps 2,  languish'd 1,  leads 1,  leaves 1,  level 1,  liberty 1,  life 1, like 1,  long 1,  looks 1,  loss 1,  love 3,  love's 1,  loves 1,  made 1,  makes 2,  man's 1, may 1,  men 1,  millions 1,  mine 3,  more 1,  muse 1,  music 1,  my 6,  myself 1,  nimble 1, nothing 1,  offence 1,  on 2,  one 1,  other 1,  our 1,  over-goes 1,  pay 1,  pen 1,  pine 1, plea 1,  poor 1,  pour'st 1,  purpose 1,  putt'st 1,  receive 1,  record 1,  repose 1,  riches 1, right 1,  said 2,  same 1,  says 1,  seals 1,  shall 1,  she 5,  sin 1,  smells 1,  so 2, sometimes 1,  sorrow 1,  still 1,  struck 1,  sun 1,  sweet 3,  taught 1,  tells 2,  the 4, thee 1,  then 1,  there 1,  thereby 1,  they 2,  this 2,  thou 18,  through 1,  thy 2,  time 7, to 5,  tongue 3,  touches 1,  travels 1,  unfair 1,  use 1,  ushers 1,  vex'd 1,  vow'd 1, we 2,  wear 1,  weight 1,  well 1,  were 1,  when 2,  which 19,  wild 1,  with 1,  word 1, writ 1,  writes 1,  you 9,  your 3, yourself 1

So in our corpus, “that I” occurs 28 times, “that music” occurs once, “that word” occurs once, and so on.  So let's say we get our lottery machine to pick one of these words, and let's say it picked “loss.”  So thus far we have a poem that goes:

that loss

So now we can have our program look at the words that follow “loss” in our corpus, and we use our lottery machine to pick one of them. Let's say it picks “in” so we have:

that loss in

So now we can look at the words that follow “in”, and pick one of them. And so on, until we might get a line like:

that loss in youth doth appear and straight

which doesn't make a lot of sense altogether.  But the first part does.  And since we're humans, we're in charge, so we can fix it!  First, let's cut out the last three words, leaving us with:

that loss in youth doth

Then let's re-generate the poem, starting with picking a word that follows “doth,” then a word that follows that, and so on.  This time around, we might get:

that loss in youth doth bear all things remov'd

which is better than before.

4. Conclusions

So to sum up:

  • n-gram language models are built with things like bigrams and unigrams. They are built on training data.  They collect information about what words are seen next to each other, and how frequently those sequences of words are seen in the data.
  • you can generate poetry by having a program select words from the n-gram language model.  You can use the frequency counts to make it more likely that a given word will be picked.
  • if the line generated by sampling from a bigram model is coherent, it's because each pair of words is coherent.  (i.e. the language model does not contain bigrams like 'when when')  However, this may not be enough to generate poetry completely algorithmically.  You may need to have a human author intervene.

There's a lot more to n-grams, of course.  Wikipedia is a good place to start reading on it.  Otherwise, check out a copy of Jurafsky and Martin's “Speech and Language Processing.”

Hope that was interesting!  later.

zp8497586rq
Be Sociable, Share!
tag_icon

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

3 Responses to “Computer Science for Poets: N-Gram Language Models”.

  1. heliopod :

    this is gorgeous fun. really interesting way of exploring poetic language….thanks heaps for posting this.

  2. Matt :

    I am not hunting nor hounding, simply sniffing out everything and keep crossing your scent! Thanks this is a wonderful intro!

  3. [...] sees in the source text. I didn’t write Gnoetry myself, but I wrote a netpoetic post explaining Gnoetry’s basic algorithm as I understand it. LikeBe the first to like this [...]