How do you reduce an article or other selection of text to a concise list of its important keywords?
For instance, from the above sentence, I probably only need the following words:
reduce
article
selection
text
concise
list
important
keywords
That could easily enough be done, I suppose, with a list of words to remove or exclude from consideration. But another problem comes up with different forms of words:
selection, selected, selections, etc...
That's where "Word stemming" comes into play. Each word can be reduced a core set of characters. "Select" migt be a good 'stem' for all the words listed above.
Martin Porter developed an algorithm to perform this kind of word stemming over 20 years ago. It's been tested and refined, and he's published many specific implementations of the algorithm in various languages.
http://www.tartarus.org/~martin/PorterStemmer/
Groovy beans.
But here's where I see a need for refinement: proper nouns. I don't want to see Microsoft stemmed to something else. Or IBM, or other company or people names. Maybe simply adjusting the process to skip capitalized words...? But then how do we account for words at the start of sentences? And what if Microsoft is at the start of a sentence?
I can see that I must think more on this issue. And maybe search more, to see who else has already solved it. I love the Internet!