May 07 2008

The text prediction engine

Tag:Tag , , Andrea @ 10:39

Among the other features, farfalla includes a text prediction engine. Text prediction is one of the key issues to support users in many different fields, for example in the use of mobile phones.
Several text prediction algorithms have been proposed in the literature, and interesting projects have been recently developed. Since one of the goals is to minimize the number of movements of the hands and fingers of the user (hence, the number of key strokes of the users), offering an effective text prediction system is one of the main focuses of the software. On the other hand, another goal is to maintain the system as fast and easy to use as possible, avoiding that the prediction system slows down the insertion of characters by the user or it becomes too intrusive for the user herself (for example, suggesting too many words and hence generating confusion in the user and slowing the insertion process).
For this reason a simple text prediction approach was used. The system suggests a single word that can be a possible completion of the characters inserted by the user. The suggested completion is presented in gray and with a smaller size compared to the characters already inserted by the user. As reported in the previous section, two different archives are used by the system to suggest words to the user. A first archive, called standard archive, contains a standard dictionary with the frequency of each word in a certain language (in this case in Italian). This dictionary is integrated with new words inserted by the user: a second archive, called personal archive, contains the words inserted in previous sessions by the user with their frequency and the recency, that is date and hour of their last occurrence expressed in the UNIX time stamp format.
Two different archives are introduced for two main reasons. The first archive can be useful when the user is starting to work with Farfalla and to integrate the first inserted words. Furthermore, this archive represents (at least when used in the web) a dictionary with many words, and when the user inserts a new one, all other users can take advantage of this new word. This could be a problem if we think about typos and digitation mistakes, because the database would be populated by junk words. To this purpose the program is planned to be integrated with a filter for the detection and the periodical deletion of words (inserted by the user) with a low frequency rate and a low value of recency.
Although this process could eventually delete also correct words, these would be rare or uncommon words so the functionality of the prediction system should not be affected. The personal archive is useful because it contains the words inserted by each user. In some sense, every user inserts a different set of words, according to her preferences, her level of instruction, her job and so on. Hence, this set represents those words usually inserted in a text by the user and are very important to her because they represent something like a lexical toolkit. When the user is inserting text, the system looks for the word to suggest first from the personal archive, then in the standard archive. The candidate words are those words which contain the partially inserted text as a prefix.
Next the system selects the word with the highest score from the candidate words pool. This score of each possible suggestion is calculated for each word in real time, as each character is inserted.
The score is calculated as the percentage which the frequency value of each word represents in relation to the sum of all frequency values of every eligible word. To this percentage another value is added, given again by the percentage of the recency value of one single word against the sum of all selected values. For this score to be computed, we need a method to express time values in the form of an integer, so the UNIX time stamp of the last insertion of every word was adopted. If the user’s insertion is not recognized as a word in the personal archive, the suggestion algorithm switches to the standard archive, simply suggesting the most frequent word and adding the new word with frequency set to 1.

Leave a response