MLUG: [MLUG - DISCUSSION] Re: [MLUG] AI - any suggestions?
[MLUG - DISCUSSION] Re: [MLUG] AI - any suggestions?
Email address obfuscation in effect -- please click here to turn it off.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
moved this msg from the main list since this is gettign way ot.

Michael wrote:
 
> The general successful methodology to processing human language can be
> broken down into steps. The first is to break your input stream down into
> usable tokens. The easiest way is usually to split a line by alpha and
> non-alpha characters. Tokens in human language are much fuzzier than in
> computer languages. Some may be spelled different, broken into several
> tokens, or missing portions.
> 
> It is often wise to scan your initial token list and attempt to correct
> these errors. Misspelled words can often be corrected via a soundex
> sampling of words. In an environment where typos are likely it can also be
> useful to attempt to fix common typos. In some environments you might face
> meaningful misspellings such as hacker (wannabes) use and again these tend
> to follow a common encoding method allowing for correction.
> 
> You can rejoin tokens that have been broken by looking for common reasons
> why they might have been broken. Hyphenated words such as 'eighty-five',
> shortened words such as '85th', and regular old typos such as 'inco rect'
> are common things to look for.
> 
> Most tokens that are missing portions are probably typos and will usually
> be fixed by fixing typos. Other reasons can be the use of slang and
> jargon. Usually you can map these to the correct words.

seems to me you're concentrating on things of little relevance to nlp
here. mapping tokens into dictionary words is a largely mechanical task.


> Once you have your token list cleaned up you'll likely want to look for
> known words of importance. This is typically done by passing the token
> list through a function who maps them to keys in the word database. You'll
> also probably want to look for common grammatic tokens that will lend
> extra meaning to sentence structure. Quoted sections, commas, and various
> other forms of punctuation that are important. This combined with word
> order lends quite a bit to understanding the intent of a sentence. Usually
> these elements will be tied to database keys also.

this is the area where i'd like to see much more explained before i can
accept your claim that "parsing nl is easy". if i am correct, you're
suggesting a phrase lookup kind of nlp system here (perhaps with some
lematization, ie stripping words into their word stems). this is all
fine if you have a domain limited to people asking for directions to
known places on the map (see yahoo map engine). the moment you're gonna
try to map a larger domain in this manner your nlp system will start
printing "42" in response to every other question, because it'll be
something you have not predicted and put in the lookup table. not to
mention that once your table gets large, it'll take forever to maintain
and use. that's why the symbolic models were developed - to translate
spoken language into some reduced symbol set for the purpose of
extracting meaning from it. so efficiency in the long run is weakness
number one. 

the system you describe also has one other basic weakness - it'll fail
miserably on simple tasks of the type:
- given the fact that I have a truck, 4x4, a bicycle and a horse, how
many cars do i have?
or
- i left the passenger-side window down today, and it was raining. is
the left car seat wet or dry?
...or any other type of task where it is required that meaning be
extracted from nl information. 
i may be incorrect on my assumptions as to what you were attempting to
present here, but i'd very much like to see how you'd go about tackling
this sort of tasks with the system you described. 

> 
> In some cases the grammar of a sentence might be unknown or seemingly
> inappropriate and words may be completely unknown. In such cases it is
> usually best to handle them as if they don't exist at all by means of
> defaults.

symbolic systems do that as well. unassigned symbols are called "free
variables" and the proportrion of free variables is a measure of degree
of nlp system's success.

 
> Having turned out token stream into a key stream we can pass the key
> stream on to the portion of our program that processes this stream. How
> this is done depends largely on the intent of the program. Using key
> matching it is reasonably easy to script match.. response events. More
> true-AI systems can process key input in neural nets to find combinations
> by which they can make some sense. Either way the basic steps of decoding
> an English stream into language tokens into meaningful data keys is very
> similar.

once again, you're trivializing the most important part about natural
language processing. i remember that your original statement was about
easiness of "parsing," but you cannot parse a statement into a stream of
words and expect the expert system to make sense of it - how you parse
determines what kind of information you'll be able to extract. and your
argument here cannot be verified because you're trivializing the process
of extracting meaning (while devoting several paragraphs to discussion
on tokenizing).

/paul
--
To unsubscribe, go to http://mlug.missouri.edu/members/edit.php

Archives are available at http://mlug.missouri.edu/list-archives/