Effective construction of relative Lempel-Ziv dictionaries

Conference item


Liao, Kewen, Petri, Matthias, Moffat, Alistair and Wirth, Anthony. (2016) Effective construction of relative Lempel-Ziv dictionaries. WWW '16: 25th International World Wide Web Conference. Switzerland: Association for Computing Machinery. pp. 807 - 816 https://doi.org/10.1145/2872427.2883042
AuthorsLiao, Kewen, Petri, Matthias, Moffat, Alistair and Wirth, Anthony
Abstract

Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.

Year2016
PublisherAssociation for Computing Machinery
Digital Object Identifier (DOI)https://doi.org/10.1145/2872427.2883042
Scopus EID2-s2.0-84989913349
Publisher's version
File Access Level
Controlled
Book titleWWW '16: Proceedings of the 25th International Conference on World Wide Web
Page range807 - 816
ISBN9781450341431
Research GroupPeter Faber Business School
Place of publicationSwitzerland
Permalink -

https://acuresearchbank.acu.edu.au/item/8q49y/effective-construction-of-relative-lempel-ziv-dictionaries

Restricted files

Publisher's version

  • 19
    total views
  • 0
    total downloads
  • 2
    views this month
  • 0
    downloads this month
These values are for the period from 19th October 2020, when this repository was created.

Export as

Related outputs

Temporal betweenness centrality in dynamic graphs
Tsalouchidou, Ioanna, Baeza-Yates, Ricardo, Bonchi, Francesco, Liao, Kewen and Sellis, Timos. (2020) Temporal betweenness centrality in dynamic graphs. International Journal of Data Science and Analytics. 9(3), pp. 257 - 272. https://doi.org/10.1007/s41060-019-00189-x
Contextual community search over large social networks
Chen, Lu, Liu, Chengfei, Liao, Kewen, Li, Jianxin and Zhou, Rui. (2019) Contextual community search over large social networks. 35th International Conference on Data Engineering (ICDE). Macao, China: IEEE Computer Society. pp. 88 - 99 https://doi.org/10.1109/ICDE.2019.00017
A fast algorithm for optimally finding partially disjoint shortest paths
Guo, Longkun, Deng, Yunyun, Liao, Kewen, He, Qiang, Sellis, Timos and Hu, Zheshan. (2018) A fast algorithm for optimally finding partially disjoint shortest paths. Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, July 13-19, 2018. Stockholm, Sweden: International Joint Conferences on Artificial Intelligence. pp. 1456 - 1462 https://doi.org/10.24963/ijcai.2018/202
A cost model for long-term compressed data retention
Liao, Kewen, Moffat, Alistair, Petri, Matthias and Wirth, Anthony. (2017) A cost model for long-term compressed data retention. 10th International Conference on Web Search and Data Mining. United States of America: Association for Computing Machinery. pp. 241 - 249 https://doi.org/10.1145/3018661.3018738
Edge labeling schemes for graph data
Goonetilleke, Oshini, Koutra, Danai, Sellis, Timos and Liao, Kewen. (2017) Edge labeling schemes for graph data. 29th International Conference on Scientific and Statistical Database Management. United States of America: Association for Computing Machinery. pp. 1 - 12 https://doi.org/10.1145/3085504.3085516
Effect of off-zenith observations on reducing the impact of precipitation on ground-based microwave radiometer measurement accuracy
Xu, Guirong, Stick Ware, Randolph, Zhang, Wengang, Feng, Guangliu, Liao, Kewen and Liu, Yibing. (2014) Effect of off-zenith observations on reducing the impact of precipitation on ground-based microwave radiometer measurement accuracy. Atmospheric Research. 140-141, pp. 85 - 94. https://doi.org/10.1016/j.atmosres.2014.01.021