# Review 1 ## Overall evaluation 0: borderline Given a universe S of n items, where each item s is assigned a probability p(s), this paper studies the problem of sampling a subset X of S where each s should be included in X independently with probability p(s). Note that the probabilities need not be a distribution, so the expected size of X can be larger than 1. In what follows, we write mu = sum_{s \in S} p(s) to denote the expected size of X. This problem, while in some sense basic, is of course fundamental and has been studied extensively in various settings. The main contribution of the current work is to study the problem in the dynamic setting, where elements and their probabilities can be added and removed from S. They also consider the non-dynamic setting in the external memory (EM) model, where entries of S must be read in blocks of size B, as well as a related range sampling problem, where each s in S is assigned a key in [0,1], and one can ask for samples from the subset of S of elements whose keys are contained in a range [a,b]. In the static setting, where no updates to the set are made and one would like to efficiently generate new samples after an initial (potentially more expensive) pre-processing phase, the problem has been extensively studied. In particular, Bringmann and Panagiotou (Algorithmica '17) studied the problem in numerous settings, such as when the input is given in sorted order in terms of the probabilities, and when it is not. They gave an algorithm with optimal space O(n) and optimal expected query time O(1 + mu) (the optimality of the query time is easy to see since this is the expected size of the set the algorithm must output). (1) The main contribution of the authors is to achieve the same optimal space O(n) and O(1 + mu) query time as the above but in the dynamic setting. Moreover, their data structure handles updates (insertions and deletions) in constant time. (2) For the EM model, they obtain an algorithm with O(log*(n)/B + mu/B * log(M/B)(n/B)) query time, where M > 2B is the random access memory of the algorithm. (3) Dynamic range sampling algorithm in O(log n + mu_[a,b]) query time, where mu_[a,b] is the sum of the probabilities that fall in the range [a,b]. One should note that, for all three results, there are no directly comparable works, since for (1) the prior work of Bringmann and Panagiotou did not study the dynamic setting, for (2) the closest work is a paper of Hu, Qiao, and Tao that studied the incomparable "Independent Range Sampling" (IRS) problem, where one is again given [a,b] but the goal is to sample exactly k elements that lie in that range (whereas the range sampling problem considered in this paper samples each point in that range independently with its assigned probability), and for (3) again the most closely related algorithms studied the IRS problem, and get a query complexity of O(log n + k) (note the similarity to the complexity in (3) above, where the expected sample size is mu_[a,b]). Technically, the results for (1) are not exceptionally challenging. I was able to pause and work out a O(log(n) + mu) query time algorithm in a few minutes: simply break the points into log(|S|) levels sets based on probability -- points with probability < 1/poly(|S|) will be sampled with very low probability, so the amortized cost of handling such points is O(1). Then a binomial random variable can be applied to sample the subset from each level set, and then a rejection sampling step is made to compensate for the over-sampling due to the level-set approximation of each probability. It turns out, this is more or less precisely what is done in this paper, except they iterate the level set construction recursively so that they need only consider a universe of size m = O(loglog(n)). At this point, one can simply create a look-up table that can be random-accessed into which immediately gives the sample, and saves the log(n) (or even loglog(n)) overhead in the query time. Of course, the look-up table must be exponentially m^m sized, but this is fine and at most O(n) after the iterated universe reduction. This is certainly a nice trick, and some care must be taken when constructing the look-up table and discretizing the probabilities, however, I still do not necessarily feel that the technical contribution of this section is particularly exceptional. For the EM algorithm, the authors show how the same universe reduction based on level sets can be applied to the EM setting, except they must iterate it log*_B(n) times in order to sufficently reduce the universe size (given that blocks of size B must be read). The range structure query data structure is slightly more involved since it involves overlaying a treap data structure over their prior subset sampling procedure (for (1)). One should note, however, that the use of such data structures is quite common when solving range-query-type problems. Summary: This paper studies the fundamental problem of subset sampling under various constraints. While each of the three results appears to be the first of its kind, it is not clear that there is a strong immediate motivation for studying this problem in each of these models, perhaps with the exception of the dynamic setting. On the other hand, it is not clear that the data structure of Bringmann and Panagiotou could not have been modified without too much pain to obtain dynamic and/or EM algorithms for the problem. Overall, this strikes me as a paper that solves a fundamental problem in a variety of settings, none of which appear exceptionally motivated, and a case is not really made for why such algorithms are necessary to have in the literature. Given that there are no directly comparable bounds for any of the three models, offering additional motivation for studying these problems is of particular importance. From a technical and conceptual standpoint, the paper also does not offer exceptionally strong contributions in my opinion, although the techniques are not necessarily below the bar needed for acceptance at PODS. However, due to the aforementioned points in combination with the lack of strong technical contributions, and am somewhat lukewarm about this paper. ## Questions to be addressed by the authors during rebuttal One of the main weaknesses I find with this paper is that the techniques do not appear to be very unique, and the results feel somewhat niche (i.e. EM and dynamic model, where the problem has not been studied before). Perhaps the authors could provide some clarification on any unique aspects of the techniques that only appear in this work. Additionally, an argument (if there is a clean one) for why this extension to dynamic/EM is a useful addition to the literature would also be appreciated. # Review 2 ## Overall evaluation 2: accept PAPER SYNOPSIS -------------- This paper studies the subset sampling (SS) problem: given a set S of n records where each record t has an associated probability p(t) devise a datastructure that allows to (repeatedly) sample a subset X of S where each t in S is sampled into X independently with probability p(t). The paper makes the following contributions: (1) In the Real RAM model, it considers the SS problem in the dynamic setting where tuples in S may be added and deleted, and may have their probability updated. A datastructure is proposed that takes O(n) space and that allows subset sampling with expected time complexity O(1 + mu_S) where mu_S is the expected sample size. It was shown in earlier work that this time complexity is optimal. Furthermore the data structure can be dynamically maintained in O(1) amortized expected update, insert, and delete time. Technically, the data structure is obtained by reducing (twice) the input problem into a problem of size O(log log n) and maintaining a data structure for that redcued problem, using a precomputed lookup table technique. The size reduction is based on a binning argument, dividing the input set S into bins of bounded probabilities. (2) In the external memory RAM model, it considers the SS problem in the static setting where a datastructure of O(n/B) is proposed -- with B the EM block size -- and subset samples can be optained in attractive amortized expected IOs Technically, this result is less deep that the previous one: it follows almost straightforwardly from combining a previous data structure for uniform set sampling and the binning idea from (1). (3) In the Real RAM model an extension to (1) is considered where records in S additionally have a key, and where the task now is to returns subset samples of recrods whose key are in a specified range [a,b]. Here, a data structure of size O(n) is provided with O(log n + mu_{S intersect [a,b]}) expected sampling time and O(log n) expected amortized update, insert, and delete time. Technically, the results build up on (1) and combine this with a treap data structure to obtain the range selection. RECOMMENDATION -------------- ==Strong points: + I found the algorithms, in particular the one of (1) above, to be elegant and non-trivial. + There is renewed, increasing interested towards sampling in the context of database systems. The paper's topic is hence important and timely. + The paper is well-written, making an effort to make both the results and the techniques used understandable. ==Weak points: - The paper adopts the computation model of the Real RAM, where arithmetic operations invovling real numbers (+, *, /, ==, exp, log, ...) all take constant time. While this can be argued to be standard and is certainly valid for the EM results where CPU-comptuation is free, it does raise the question on whether and how the results (1) and (3) above that focus on main-memory computation can be transferred to more realistic models of computation where one can only operate on words of a fixed wordsize in constant time. In particular, drawing from the geometric distribution, which is a key operation in the paper, is not constant time in this fixed-width RAM model. I do not want to belittle the paper's contributions: the assumption of the RAM model is valid for a theory-oriented conference like PODS. Nevertheless, the paper would be stronger IMHO if it discussed the fixed-width RAM model instead. COMMENTS AND TYPOS ------------------- - p2, l137. Because update(v,p) is equivalent to delete(v); insert(v,p) I initially wondered why you consider update operations at all. I later realized that you need it for the size reduction argument (Lemma 2.3), which hinges on the update time only. Please consider already arguing the relevance of update(v,p) here. - p2, l152. What is w in [0,2^{w}-1]? - p3, l327. As far a s I can see, the second equality must be an inequality: E[t] = n\overline{p} \leq \sum_{v \in S} 2p(v). - p4, l423. As far as I can see, the equality here is really an inequality as you replace tilde_p(v) with |S|/|S|^2. (I realize this may be implied by the asymptotic notation, but still think it clearer if the inequality is included.) - p7, l785. You write this bullet as two separate sentences,but this should be only one sentence. (If .... then ...) - p8, l814. You claim that h is O(log n). Why is that? # Review 3 ## Overall evaluation 1: weak accept This paper presents 3 improvements over prior work on the subset sampling (SS) problem. (1) In SS, we are given n elements, each associated with a probability, and the goal is to build a data structure so that a Poisson sample can be drawn efficiently. Prior work in ICALP'12 has solved the problem with O(n) space, and sampling time O(1+u_s), where u_s = sum_i p(i) is the expected sample size. This is cleary optimal. This paper presents a dynamic version of the solution, and the update time is O(1), which is optimal. The technique uses is size reduction plus a table-lookup solution when the size of the problem is small enouhg. The ideas are not ground-breaking (similar techniques have been used a lot in the area of data structures). But it is nice the authors have worked out the details and achieve the optimal runtime. (2) The next result is to extend the result above to the external memory model (EM). Here. the table lookup solution does not work. As a result, they did not achieve the ideal O(1) time, but something that is very close. Not ideal, but reasonable. (3) The last result is the range-version of the problem, where the query is a range and we only want to sample elements from the range. This has been studied in a few past SIGMOD/PODS papers, but there the sampling semantics is sample without replacement. This paper solves the Poisson sampling problem. Overall, I think the paper has made some nice solid improvements to a fundamental problem that is relevant to database theory. The techniques are not ground-breaking, though. One comment: The sampling semantics used is known as Poisson sampling. You may use this more standard term when talking about your results vis-a-vis other results, e.g., sampling without replacement.