Effective construction of relative Lempel-Ziv dictionaries
Conference item
Liao, Kewen, Petri, Matthias, Moffat, Alistair and Wirth, Anthony. (2016). Effective construction of relative Lempel-Ziv dictionaries. WWW '16: 25th International World Wide Web Conference. Switzerland: Association for Computing Machinery. pp. 807 - 816 https://doi.org/10.1145/2872427.2883042
Authors | Liao, Kewen, Petri, Matthias, Moffat, Alistair and Wirth, Anthony |
---|---|
Abstract | Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%. |
Year | 2016 |
Publisher | Association for Computing Machinery |
Digital Object Identifier (DOI) | https://doi.org/10.1145/2872427.2883042 |
Scopus EID | 2-s2.0-84989913349 |
Publisher's version | File Access Level Controlled |
Book title | WWW '16: Proceedings of the 25th International Conference on World Wide Web |
Page range | 807 - 816 |
ISBN | 9781450341431 |
Research Group | Peter Faber Business School |
Place of publication | Switzerland |
https://acuresearchbank.acu.edu.au/item/8q49y/effective-construction-of-relative-lempel-ziv-dictionaries
Restricted files
Publisher's version
193
total views0
total downloads11
views this month0
downloads this month