Home PLSA
 
Probabilistic Latent Semantic Analysis
by Sean Colledge

With more information being digitized and stored in digital libraries every day and communication networks reaching more users farther away, it is no surprise that a huge repository of textual data has become openly available to the public. The challenge that this creates is largely focused on how users can intelligently structure this huge mass of textual information and search through it to obtain only the relevant documents to a search query. This must of course also be done in a feasible amount of time. The popular and widely used automatic document indexing algorithm, Probabilistic Latent Semantic Analysis (PLSA), has proven to be very successful in the Machine Learning domain of research. As an extension to analysing the accuracy of the algorithm, the results have not been measured against the results obtained from users manually indexing documents, and therefore a relation between the two have not been shown.

Research Objectives
It was the goal of this project to achieve the semi-automatic construction of a usage sensitive hierarchical knowledge structure through the use of data obtained from PLSA, combined with the data obtained from a flat tag cloud structure. The results obtained from this automatic document indexing algorithm on a large domain of textual documents can be combined together with the underlying structure obtained from a user created tag cloud, and thus be used to strengthen the associations between the terms of each document. These results can hopefully also show a correlation between the associations found by using the PLSA data as well as the data obtained from the Manual Tagging System using tag cloud structures.

Background
The PLSA algorithm is actually the extension of another automatic document indexing algorithm called Latent Semantic Analysis (LSA). LSA has been researched and implemented for many years with successful results in obtaining synonyms, spelling variations, abbreviations and valid ways of naming the same entity and then matching them back to their original counterparts. However in 1999, T. Hofmann published the PLSA technique as an improvement to the original LSA algorithm. It was proven to be more accurate as well as being statistically sound through the use of a statistical Log-Likelihood Estimation Function. It has since become the more popular automatic document indexing algorithm used in industry for querying systems that index extremely large databases of information.

Design and Implementation
The PLSA algorithm was designed and implemented as a Matlab script. Although it can be implemented using a number of other programming languages such as C++ and Java etc., the algorithm can be efficiently implemented in a reasonable number of lines of code due to the mathematical advantages that Matlab provides with the use of matrices.

Testing
In order to test the algorithm, the following procedures were followed:
    • Questionnaires were handed out to Computer Science MSc students to obtain a humans perception of what the topics of a document should be
    • Questionnaires were handed out to obtain data relating to how users linguistically relate words to each other
    • Comparison of PLSA data with data manually obtained through the use of flat tag cloud structures and user entered data.

      These different techniques were then analysed and evaluated to determine the accuracy of the PLSA algorithm and whether it is a true reflection of how humans perceive the same data to be.

Results
Due to time constraints, not enough MSc students could be interviewed to obtain enough data to show a true reflection of the accuracy of the algorithm. However the results that were obtained showed that the PLSA algorithm relies too heavily on the frequency of the terms in each document (number of times each term appears in a document) and that users intuitively do not take the frequency into account when determining the topics of a document. It was also determined that using the differences between PLSA probabilities to show the linguistic relation between terms was unsuccessful and that the technique used does not show the correct results as was obtained by the user data from the questionnaires.

 
© 2007 UCT Computer Science Honours Project
Designed By Ian Saunder
Template Design by funky-visions.de