انتقل إلى المحتوى

مستخدم:Mahde jade/ملعب

من ويكيبيديا، الموسوعة الحرة

INTRODUCTION Successor variety stemmers (Hafer and Weiss 1974) are based on work in structural linguistics which attempted to determine word and[1] morpheme boundaries based on the distribution of phonemes in a large body of utterances. The stemming method based on this work uses letters in place of phonemes, and a body of text in place of phonemically transcribed utterances.

Hafer and Weiss formally defined the technique as follows:

Let be a word of length n; i, is a length i prefix of . Let D be the corpus of words. Di is defined as the subset of D containing those terms whose first i letters match i exactly. The successor variety of i, denoted Si, is then defined as the number of distinct letters that occupy the i + 1st position of words in Di. A test word of length n has n successor varieties Si, S2, . . . , Sn.

In less formal terms, the successor variety of a string is the number of different characters that follow it in words in some body of text. Consider a body of text consisting of the following words, for example.

able, axle, accident, ape, about.

To determine the successor varieties for "apple," for example, the following process would be used. The first letter of apple is "a." "a" is followed in the text body by four characters: "b," "x," "c," and "p." Thus, the successor variety of "a" is four. The next successor variety for apple would be one, since only "e" follows "ap" in the text body, and so on. When this process is carried out using a large body of text (Hafer and Weiss report 2,000 terms to be a stable number), the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. At this point, the successor variety will sharply increase. This information is used to identify stems.

Successor Variety approach

[عدل]

,Determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances ,The successor variety of a string is the number of different characters that follow it in words in some body of text .The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached

Uses

[عدل]

1 Using the cutoff method, some cutoff value is selected for successor varieties and a boundary is identified whenever the cutoff value is reached. The problem with this method is how to select the cutoff value--if it is too small, incorrect cuts will be made; if too large, correct cuts will be missed.

2 With the peak and plateau method, a segment break is made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it. This method removes the need for the cutoff value to be selected.

3 In the complete word method method, a break is made after a segment if the segment is a complete word in the corpus.

4 The entropy method takes advantage of the distribution of successor variety letters. The method works as follows. Let |Di| be the number of words in a text body beginning with the i length sequence of letters . Let |Dij| be the number of words in Di with the successor j. The probability that a member of Di has the successor j is given by . The entropy of |Di| is Using this equation, a set of entropy measures can be determined for a word. A set of entropy measures for predecessors can also be defined similarly. A cutoff value is selected, and a boundary is identified whenever the cutoff value is reached.


example Successor Variety

[عدل]
Test Word: READABLE

Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE

READING, READS, RED, ROPE, RIPE.
Using the complete word segmentation method, the test word "READABLE" will be segmented into "READ" and "ABLE," since READ appears as a word in the corpus. The peak and plateau method would give the same result. After a word has been segmented, the segment to be used as the stem must be selected. Hafer and Weiss used the following rule: if (first segment occurs in <= 12 words in corpus) first segment is stem else (second segment is stem) The check on the number of occurrences is based on the observation that if a segment occurs in more than 12 words in the corpus, it is probably a prefix. The authors report that because of the infrequency of multiple prefixes in English, no segment beyond the second is ever selected as the stem. Using this rule in the example above, READ would be selected as the stem of READABLE. In summary, the successor variety stemming process has three parts: (1) determine the successor varieties for a word, (2) use this information to segment the word using one of the methods above, and (3) select one of the segments as the stem. The aim of Hafer and Weiss was to develop a stemmer that required little or no human processing. They point out that while affix removal stemmers work well, they require human preparation of suffix lists and removal rules. Their stemmer requires no such preparation. The retrieval effectiveness of the Hafer and Weiss stemmer is discussed below.
Letters Successor Variety Prefix
E,I,O 3 R
A,D 2 RE
D 1 REA
READ A,I,S 3 READ
B 1 READA
L 1 READAB
E 1 READABL
BLANK 1 READABLE