Tuesday, January 8, 2019
Advanced Data Structure Project
CSCI4117 ripe(p) Data Structure Project intention Yejia Tong/B00537881 2012. 11. 5 1. Title of Project brief information structure in top-k memorandums convalescence 2. Objective of Re explore The main subscribe to of this project is to discover how to efficiently give away the k scrolls where a given prototype occurs just about frequently. While the problem has been discussed in many an(prenominal) subjects and solved in various ways, our research is to look for the fiction algorithms and (succinct) information structures among lately relate materials and demote the one autocratic almost either the berth/ beat tradeoff. 3.Background/History of the written report Before we beigin our aim to envision a such a succinct data structure, there be a deem of perfect works in our advancement. in that respect exist two main among many ideas in classic information recovery anatropous index and term frequency. (Angelos, Giannis, Epimeneidis, Euripides, & group A ere Evangelos, 2005) The inverted index is a also referred to as postings file, which is an index dara structure storing a mapping from content. It is the most utilized data structure in the Information recuperation domain, used on a macroscopic scale for extype Ale in search engines.Term frequency is a measure of how frequently a term is found in a collection of inscriptions. However, there are restricted assumptions for the efficiency of the ideas the text moldiness(prenominal) be easily tokenized into words, there must not be too many different words, and queries must be both(a) in all words or phrases, causing practically of difficulty in the document retrieval via various languages. Moreover, one of the attractive properties of an inverted file is that it is easily compressible enchantment still supporting fast queries. In practice, an inverted file occupies space finishing to that if a compressed document collection. Niko & antiophthalmic factor Veli, 2007) In furt her development, people find efficient data structures such as affix arrays and suffix manoeuvers (full-text indexes) providing good space/ fourth dimension efficiency to inverted files. Recently, some(prenominal) compressed full-text indexes have been proposed and show stiff in practice as well. A generalized suffix tree diagram is a suffix tree for a devise of strings. Given the set of strings D = S(1), S(2), S(n) of total duration n, it is a Patricia tree containing all n suffixes of the strings. It mess be built in time and space, and can be used to find all k occurrences of a string P of length m in time. Bieganski, 1994) Then, we outright flap close to our original need the inscription Retrieval. Matias et al. gave the first efficient settlement to the Document Listing problem with O(n) time preprocessing of a collection D of document s d(1), d(2), d(k) of total length Sumd(i) = n, they could execute the document listing query on a linguistic rule P of length m in time. (Y. , S. , S. , & J. , 1998) The algorithm uses a generalized suffix tree augmented with extra edges making it a directed acyclic graph.However, it requires bits, which is significantly more(prenominal)(prenominal) than the collection size. Later on, Niko V. and Veli M. in their paper present an alternative space-efficient variant of Muthukrishnans structure that takes bits, with optimal time. (Niko & Veli, 2007) base on the background field, we finally turn tail advance to our intensive topic heavyset data structure in top-k documents retrieval. 4. look for to the Study According to the background hold above, the suffix tree is used to pick at the space consumption.In the suffix tree document model, a document is considered as a string consisting of words, not characters. During constructing the suffix tree, apiece suffix of a document is compared to all suffixes which exist in the tree already to find out a target for inserting it. Hon W. K. , Shah R. a nd Wu S. B. introduced the first efficient solution for the top-k document retrieval. (Hon, Shah, & Wu, 2009) In order to get rid of too many clangorous factors in the large collection, the algorithm adds a minimum term frequency as one of the parameters for highly relevant pattern P. Hon, Shah, & Wu, 2009) Furthermore, they also developed the f-mine problem for the high relevancy, that only documents which have more than f occurrences of the pattern need to be retrieved. The notion of relevance here is only the term frequency. In the later study, Hon W. K. , Shah R. and Wu S. B. achieved the study of Efficient Index for Retrieving Top-k Most common Documents by driving the solution derived from related problem by Muthukrishnan (Y. , S. , S. , & J. , 1998), respond queries in time and taking space.The approach is based on a cutting use of the suffix tree called generate generalized suffix tree (IGST). (Hon, Shah, & Wu, 2009) The practicality of the proposed index i s validated by the observational results. 5. Future Works Since all the fundamental works are settled, our futuer analysis of the heavyset data structure in top-k documents retrieval is mainly based on the most recently accomplishment by Gonzalo N. and Daniel V. (Gonzalo & Daniel, 2012) , a New Top-k Algorithm dominating almost all the space/time tradeoff. . References Bibliography Angelos, H. , Giannis, V. , Epimeneidis, V. , Euripides, P. G. , & Evangelos, M. (2005). Information Retrieval by Semantic Similarity. Dalhousie University, Faculty of Computer Science. Halifax None. Bieganski, P. (1994). infer suffix trees for biological sequence data applications and implementation. Minnesota University, Dept. of Comput. Sci. Minneapolis None. Gonzalo, N. , & Daniel, V. (2012). Space-Efficient Top-k Document Retrieval. Univ. of Chile, Dept. f Computer Science. Valdivia None. Hon, W. K. , Shah, R. , & Wu, S. B. (2009). Efficient list for Retrieving Top-k Most Frequenct Do cuments. None Springer, Heidelberg. Niko, V. , & Veli, M. (2007). Space-efficient Algorithms for Document Retrieval. University of Helsinki, Department of Computer Science. Finland None. Y. , M. , S. , M. , S. , C. S. , & J. , Z. (1998). Augmenting suffix trees with applications. 6th Annual European Symposium on Algorithms (ESA 1998) (pp. 67-78). None Springer-Verlag.