# Review 1 ## Overall evaluation 1: weak accept This is a re-submission from the first round. I don't see any major changes, so my evaluation on the paper remains the same. # Review 2 ## Overall evaluation 2: accept PAPER SYNOPSIS -------------- This paper studies the subset sampling (SS) problem: given a set S of n records where each record t has an associated probability p(t) devise a datastructure that allows to (repeatedly) sample a subset X of S where each t in S is sampled into X independently with probability p(t). The paper makes the following contributions: (1) In the Real RAM model, it considers the SS problem in the dynamic setting where tuples in S may be added and deleted, and may have their probability updated. A datastructure is proposed that takes O(n) space and that allows subset sampling with expected time complexity O(1 + mu_S) where mu_S is the expected sample size. It was shown in earlier work that this time complexity is optimal. Furthermore the data structure can be dynamically maintained in O(1) amortized expected update, insert, and delete time. Technically, the data structure is obtained by reducing (twice) the input problem into a problem of size O(log log n) and maintaining a data structure for that redcued problem, using a precomputed lookup table technique. The size reduction is based on a binning argument, dividing the input set S into bins of bounded probabilities. (2) In the external memory RAM model, it considers the SS problem in the static setting where a datastructure of O(n/B) is proposed -- with B the EM block size -- and subset samples can be optained in attractive amortized expected IOs Technically, this result is less deep that the previous one: it follows almost straightforwardly from combining a previous data structure for uniform set sampling and the binning idea from (1). (3) In the Real RAM model an extension to (1) is considered where records in S additionally have a key, and where the task now is to returns subset samples of recrods whose key are in a specified range [a,b]. Here, a data structure of size O(n) is provided with O(log n + mu_{S intersect [a,b]}) expected sampling time and O(log n) expected amortized update, insert, and delete time. Technically, the results build up on (1) and combine this with a treap data structure to obtain the range selection. (4) In the Real RAM model, a variant of (1) is proposed where an element is associated to a weight instead of a probability, and each element should now be drawn with probability proportional to the relative weight of the element (relative to the total weight of all elements). In the dynamic setting, we can now change element's weights, as well as insert and remove elements. In the dynamic setting the weigthed problem version is different from (1) because changing a single weight may affect the relative probabilities of *all* elements in S. Technically, the data structure is again obtained by a repeated binning argument, dividing the input set S into bins of bounded weights. RECOMMENDATION -------------- This is a resubmission of a paper by the same name submitted in the first round. Compared to the first round, the paper has a new introduction that better motivates the importance of subset sampling and its variants in databases. In addition, the weighted variant of the SS problem (variant 4) above and its solution is novel content that was not in the previous version. All algorithms follow a similar strategy (reduce the problem to one of a smaller size with bounded probabilities, then iterate until the problem becomes small enough). While one may take this as a limitation in the depth of the contributions, I personally do not see this as a weakness, but find the solution strategy rather elegant. It is nice that it adapts to the different SS variants. Furthermore, the paper is well-written, making an effort to make both the results and the techniques used understandable. I therefore recommend acceptance. COMMENTS AND TYPOS ------------------- - p 5, col 2. Statement of Problem 4. "...erandomly sample s element with ... " element -> elements. -p5, col 2, last line. "amortize I/O cost" -> amoritzed I/O cost -p7, section 5.2. When discussing ranges, what is the "range" of the index j? Is R_j defined for every integer j? Furthermore, what does the capital letter N denote? As far as I can see, it was not introduced. # Review 3 ## Overall evaluation 0: borderline This paper focuses on the problem of subset sampling and it's many variants. The problem in nutshell is as follows: We are given a set S of items and a probability value attached to each item; say p(i) is the probability of sampling item i. Now the task is to sample a subset such that every element is chosen independently with probability i. And we would like the answer to different queries be independent. The paper presents results for various settings: in particular, Subset Sampling (SS), RSS (Range Subset Sampling), and WSS (Weighted-Induced Sampling). The improvement for SS is allowing modification, for RSS, it is to reduce from "k" (i.e., knowing how many to sample in advance) to expectation, and there is new work on WSS. There is another contribution in the context of I/O requiring external memory. The support for modification follows from the data structure, so the key contributions are really in RSS and WSS. In case of RSS, we can always keep track of the probability mass for a given range; this is just dyadic range summation problem -- and from that we know the value of k with high probability, therefore, the prior work's result can be used. The contribution in case of WSS is indeed new but the algorithm is mostly reworking on SS case; it's not very straightforward, so I don't mean to minimize the contribution. Overall, the paper is well written but at the same time, falls short of clearly positioning itself. There is so much weightage given on SS and RSs when the improvement from prior work is very little. The paper could have been better appreciated if it just focused on WSS and I/O -- although admittedly, in such a case, it falls short of acceptance. Therefore, overall, my inclination is to recommend rejection. Post Response: thanks for the response. I have revised the score to 0