Now that we have learned some basic Python, we are ready to process texts. First we will write a program that accesses the contents of a given text and then analyzes it as follows: (1) separates the text into sentences and words (2) counts the number of words in the text (3) counts the number of unique words in the text (4) counts the number of sentences (5) computes the average word-length of a sentence.
If this sounds familiar, this was your Assignment#2. The first question you need to consider is what is a sentence and subsequently, what is a word?
Carefully studt the program below to see how these questions were answered:
# File: HW2.py # Written by: Deepak Kumar # Date: September 19, 2005 # import re
print 'Running program HW2.py...' # open the text file and read its contents
f = open('HuckleberryFinn.txt')
content = f.read()
# Remove '\n' since lines do not matter (more or less)
noLine = re.sub('(\n)+', ' ', content)
# Next, sentence boundaries occur at '.', '!', '?', and ':'
# replace them with a '\n' thus giving one sentence per line
sLine = re.sub('[.!:?]', '\n', noLine)
# replace dashes ('-') with a space, commas, semi-colons, colons with nothing
# replace single and double quotes with nothing
sLine2 = re.sub("[,\-']", ' ', sLine)
sLine3 = re.sub("[,';:]", '', sLine2)
sLine3 = re.sub('"', '', sLine3)
# split the text into sentences, each sentence is on one line
sentences = sLine3.split('\n')
nSentences = len(sentences) # number of sentences in text
print 'The text has ' + str(nSentences) + ' sentences.'
# extract words from the text
words = []
for s in sentences: # for each sentence
words = words + s.split(' ') # get the words
words = [w for w in words if len(w) > 0] # remove empty strings
nWords = len(words) # total words in text aveSentenceLength = nSentences/nWords # average sentence length
print 'The text has a total of ' + str(nWords) + ' words in it.' print 'Average sentence length is ' + str(nWords/nSentences) + ' words/sentence.'
# find out unique words, and count occurrences while at it!
wDict = {}
for w in words:
if wDict.has_key(w):
wDict[w] += 1
else:
wDict[w] = 1
print 'There are a total of ' + str(len(wDict)) + ' words in this text'
print 'End of processing.'
Notice the use of regular expressions and also Python dictionaries. A dictionary structure helps identify unique words and can also simultaneously be used to count the number of occurrences of each word. It may help to review Python dictionaries at this point. Make sure you understand how they are used and review the available commands that you can use on dictionaries. A good way to get a list of commands an anything in Python is to use the help command:
help(dict)
A sample output of the above program is presented below:
Running program HW2.py...
The text has 6469 sentences.
The text has a total of 116536 words in it.
Average sentence length is 18 words/sentence.
There are a total of 6915 words in this text
End of processing.
Now that wDict has all the unique words and their counts, we can play with it some more. Lets us print out the forst 20 entries in the dictionary:
>>> for i in range(10):
print wDict.items()[i]
('cussed', 8) ('nunnery', 1) ('foun', 2) ('yellow', 1) ('four', 44) ('woods', 65) ('spiders', 8) ('hanging', 21) ('woody', 2) ('spidery', 1) >>>
The output above shows that the word 'cussed' ocurred a total of 8 times. Similarly, the word 'four' appears 44 times. What is the most frequently used word in this text? Let us find out:
First, we will make a list of all the frequency counts, and find out the largest value in that list:
>>> frequencies = wDict.values()
>>> max(frequencies)
6125
That is, some word appears 6125 times. But we also need to know what that word is:
>>> frequencies.index(6125)
5839
>>> wDict.items()[5839]
('and', 6125)
>>>
That is, the word 'and' is the most frequently used word.
It should also be worth pointing out above that the unique word counting algorithm used above is case-sensitive. That is, the words 'The' and 'the' are considered different. To test this try:
>>> wDict['the']
4512
>>> wDict['The']
265
Exercise: Fix the above program so that it is no longer case sensitive? (Hint: To convert a string S to lower case do S.lower())
Most text processing begins by breaking up the text into words and/or sentences. While breaking up a text into words is one of the most fundamental operations, yet, the definition of what makes up a word varies widely from a language processing perspective. It is also helpful to identify not just words, but punctuations which serve as important linguistics markers in a text. The task of breaking up a text into these pieces (whether it be words and/or non-words) is called tokenization. A token is the smallest lexical item in a text, whether it is a word or a non-word. nltk_lite provides some convenient, yet flexible, tokenization facilities. You can read about them in the tutorial that accompanies nltk_lite.
Below, we will summarize the tokenization facilities. The following text will be used as a sample (taken from a Top Story at The Onion (theonion.com, September 25, 2005):
The season premiere of The Oprah Winfrey Show unleashed a surprise for
viewers Monday, when host Winfrey presented her studio audience with an
unexpected gift: eligible men.
"Everybody gets a man! Everybody gets a man!" said Winfrey,
almost drowned out by cries of disbelief as 276 men, one for every member
of the studio audience, filed onto the Oprah set.
Hoping to top last year's season-debut surprise, when members of the studio
audience received free cars, Winfrey watched elated as the men knelt before
their awestruck new mates and delivered gallant kisses and professions of
undying affection.
"Signed, sealed, delivered... they're yours!" Winfrey said.
Hand-picked by Winfrey and her staff, the men range in age from 29 to 63 and
were described by assistant producer Sally Heffernan-Ross as "great catches"
with semi-professional to professional careers and stable personalities.
First, lets read the text:
>>> text = open('Oprah.txt').read()
>>> text
'The season premiere of The Oprah Winfrey Show unleashed
a surprise for\nviewers Monday, when host Winfrey presented her studio audience
with an\nunexpected gift: eligible men. \n\n"Everybody gets a man! Everybody gets a man!" said
Winfrey,\nalmost drowned out by cries of disbelief as 276 men, one for every
member\nof the studio audience, filed onto the Oprah set. \n\nHoping to top
last year\'s season-debut surprise, when members of the studio\naudience received
free cars, Winfrey watched elated as the men knelt before\ntheir awestruck
new mates and delivered gallant kisses and professions of\nundying affection.\n\n"Signed,
sealed, delivered... they\'re yours!" Winfrey said.\n\nHand-picked by
Winfrey and her staff, the men range in age from 29 to 63 and\nwere described
by assistant producer Sally Heffernan-Ross as "great catches"\nwith
semi-professional to professional careers and stable personalities. \n\n'
>>>
In order to use the tokenization facilities of nltk_lite, you have to import:
from nltk_lite import tokenize
The tokenize package contains several facilities. One of them can be used to split a string of text (like the one above) into paragraphs (which can be identified by the presence of blank lines):
>>> paragraphs = list(tokenize.blankline(text))
>>> paragraphs
['The season premiere of The Oprah Winfrey Show unleashed
a surprise for\nviewers Monday, when host Winfrey presented her studio audience
with an\nunexpected gift: eligible men.', '"Everybody gets a man! Everybody gets a man!" said
Winfrey,\nalmost drowned out by cries of disbelief as 276 men, one for every
member\nof the studio audience, filed onto the Oprah set.', "Hoping to
top last year's season-debut surprise, when members of the studio\naudience
received free cars, Winfrey watched elated as the men knelt before\ntheir awestruck
new mates and delivered gallant kisses and professions of\nundying affection.",
'"Signed, sealed, delivered... they\'re yours!" Winfrey said.', 'Hand-picked
by Winfrey and her staff, the men range in age from 29 to 63 and\nwere described
by assistant producer Sally Heffernan-Ross as "great catches"\nwith
semi-professional to professional careers and stable personalities.']
>>> len(paragraphs)
5
That is, the text above has 5 paragraphs. You can also tokenize by whitespace (' ', TAB, or a newline):
>>> tokens = list(tokenize.whitespace(text))
>>> tokens
['The', 'season', 'premiere', 'of', 'The', 'Oprah', 'Winfrey',
'Show', 'unleashed', 'a', 'surprise', 'for', 'viewers', 'Monday,', 'when',
'host', 'Winfrey', 'presented', 'her', 'studio', 'audience', 'with', 'an',
'unexpected', 'gift:', 'eligible', 'men.', '"Everybody', 'gets', 'a', 'man!', 'Everybody', 'gets', 'a', 'man!"',
'said', 'Winfrey,', 'almost', 'drowned', 'out', 'by', 'cries', 'of', 'disbelief',
'as', '276', 'men,', 'one', 'for', 'every', 'member', 'of', 'the', 'studio',
'audience,', 'filed', 'onto', 'the', 'Oprah', 'set.', 'Hoping', 'to', 'top',
'last', "year's", 'season-debut', 'surprise,', 'when', 'members',
'of', 'the', 'studio', 'audience', 'received', 'free', 'cars,', 'Winfrey',
'watched', 'elated', 'as', 'the', 'men', 'knelt', 'before', 'their', 'awestruck',
'new', 'mates', 'and', 'delivered', 'gallant', 'kisses', 'and', 'professions',
'of', 'undying', 'affection.', '"Signed,', 'sealed,', 'delivered...', "they're",
'yours!"', 'Winfrey', 'said.', 'Hand-picked', 'by', 'Winfrey', 'and',
'her', 'staff,', 'the', 'men', 'range', 'in', 'age', 'from', '29', 'to', '63',
'and', 'were', 'described', 'by', 'assistant', 'producer', 'Sally', 'Heffernan-Ross',
'as', '"great', 'catches"', 'with', 'semi-professional', 'to', 'professional',
'careers', 'and', 'stable', 'personalities.']
>>> len(tokens)
138
Notice that punctuations were retained in the tokens above. You can also tokenize by specifying a regular expression (that is what makes up a valid word):
>>> tokens = list(tokenize.regexp(text, '\w+|[^\w\s]'))
>>> tokens
['The', 'season', 'premiere', 'of', 'The', 'Oprah', 'Winfrey',
'Show', 'unleashed', 'a', 'surprise', 'for', 'viewers', 'Monday', ',', 'when',
'host', 'Winfrey', 'presented', 'her', 'studio', 'audience', 'with', 'an',
'unexpected', 'gift', ':', 'eligible', 'men', '.', '"', 'Everybody', 'gets', 'a', 'man', '!',
'Everybody', 'gets', 'a', 'man', '!', '"', 'said', 'Winfrey', ',', 'almost',
'drowned', 'out', 'by', 'cries', 'of', 'disbelief', 'as', '276', 'men', ',',
'one', 'for', 'every', 'member', 'of', 'the', 'studio', 'audience', ',', 'filed',
'onto', 'the', 'Oprah', 'set', '.', 'Hoping', 'to', 'top', 'last', 'year', "'",
's', 'season', '-', 'debut', 'surprise', ',', 'when', 'members', 'of', 'the',
'studio', 'audience', 'received', 'free', 'cars', ',', 'Winfrey', 'watched',
'elated', 'as', 'the', 'men', 'knelt', 'before', 'their', 'awestruck', 'new',
'mates', 'and', 'delivered', 'gallant', 'kisses', 'and', 'professions', 'of',
'undying', 'affection', '.', '"', 'Signed', ',', 'sealed', ',', 'delivered',
'.', '.', '.', 'they', "'", 're', 'yours', '!', '"', 'Winfrey',
'said', '.', 'Hand', '-', 'picked', 'by', 'Winfrey', 'and', 'her', 'staff',
',', 'the', 'men', 'range', 'in', 'age', 'from', '29', 'to', '63', 'and', 'were',
'described', 'by', 'assistant', 'producer', 'Sally', 'Heffernan', '-', 'Ross',
'as', '"', 'great', 'catches', '"', 'with', 'semi', '-', 'professional',
'to', 'professional', 'careers', 'and', 'stable', 'personalities', '.']
>>> len(tokens)
177
You can also tokenize by separating alphabetic from non-alphabetic characters:
>>> tokens = list(tokenize.wordpunct(text))
>>> tokens
['The', 'season', 'premiere', 'of', 'The', 'Oprah', 'Winfrey',
'Show', 'unleashed', 'a', 'surprise', 'for', 'viewers', 'Monday', ',', 'when',
'host', 'Winfrey', 'presented', 'her', 'studio', 'audience', 'with', 'an',
'unexpected', 'gift', ':', 'eligible', 'men', '.', '"', 'Everybody', 'gets', 'a', 'man', '!',
'Everybody', 'gets', 'a', 'man', '!"', 'said', 'Winfrey', ',', 'almost',
'drowned', 'out', 'by', 'cries', 'of', 'disbelief', 'as', '276', 'men', ',',
'one', 'for', 'every', 'member', 'of', 'the', 'studio', 'audience', ',', 'filed',
'onto', 'the', 'Oprah', 'set', '.', 'Hoping', 'to', 'top', 'last', 'year', "'",
's', 'season', '-', 'debut', 'surprise', ',', 'when', 'members', 'of', 'the',
'studio', 'audience', 'received', 'free', 'cars', ',', 'Winfrey', 'watched',
'elated', 'as', 'the', 'men', 'knelt', 'before', 'their', 'awestruck', 'new',
'mates', 'and', 'delivered', 'gallant', 'kisses', 'and', 'professions', 'of',
'undying', 'affection', '.', '"', 'Signed', ',', 'sealed', ',', 'delivered',
'...', 'they', "'", 're', 'yours', '!"', 'Winfrey', 'said',
'.', 'Hand', '-', 'picked', 'by', 'Winfrey', 'and', 'her', 'staff', ',', 'the',
'men', 'range', 'in', 'age', 'from', '29', 'to', '63', 'and', 'were', 'described',
'by', 'assistant', 'producer', 'Sally', 'Heffernan', '-', 'Ross', 'as', '"',
'great', 'catches', '"', 'with', 'semi', '-', 'professional', 'to', 'professional',
'careers', 'and', 'stable', 'personalities', '.']
>>> len(tokens)
173
While studying morphology, we learned the Porter algorithm for stemming words. You can stem words in a text as follows:
>>> tokens = list(tokenize.whitespace(text))
>>> porter = tokenize.PorterStemmer()
>>> stemmed = [porter.stem(token) for token in tokens]
>>> stemmed
['The', 'season', 'premier', 'of', 'The', 'Oprah', 'Winfrey',
'Show', 'unleash', 'a', 'surpris', 'for', 'viewer', 'Monday,', 'when', 'host',
'Winfrey', 'present', 'her', 'studio', 'audienc', 'with', 'an', 'unexpect',
'gift:', 'elig', 'men.', '"Everybodi', 'get', 'a', 'man!', 'Everybodi', 'get', 'a', 'man!"',
'said', 'Winfrey,', 'almost', 'drown', 'out', 'by', 'cri', 'of', 'disbelief',
'as', '276', 'men,', 'one', 'for', 'everi', 'member', 'of', 'the', 'studio',
'audience,', 'file', 'onto', 'the', 'Oprah', 'set.', 'Hope', 'to', 'top', 'last', "year'",
'season-debut', 'surprise,', 'when', 'member', 'of', 'the', 'studio', 'audienc',
'receiv', 'free', 'cars,', 'Winfrey', 'watch', 'elat', 'as', 'the', 'men',
'knelt', 'befor', 'their', 'awestruck', 'new', 'mate', 'and', 'deliv', 'gallant',
'kiss', 'and', 'profess', 'of', 'undi', 'affection.', '"Signed,', 'sealed,',
'delivered...', "they'r", 'yours!"', 'Winfrey', 'said.', 'Hand-pick',
'by', 'Winfrey', 'and', 'her', 'staff,', 'the', 'men', 'rang', 'in', 'age',
'from', '29', 'to', '63', 'and', 'were', 'describ', 'by', 'assist', 'produc',
'Salli', 'Heffernan-Ross', 'as', '"great', 'catches"', 'with', 'semi-profession',
'to', 'profession', 'career', 'and', 'stabl', 'personalities.']
>>> len(stemmed)
138
This will give you sufficient background into the word-level and tokenization faciltities we will need for text processing. In the end it is likely that you will employ a collection of these techniques to satisfy the needs of the processing task at hand.
More to come...