Java代写 | COMP90056 Stream Computing and Applications

COMP90056 Stream Computing and Applications
Assignment B – updated, Version 2
Second (Spring) Semester 2019
Posted on LMS: Monday, 30 September 2019
Due: Monday, 21 October 2019 [07:30]
Important: Each student must submit their own code and report, written individually.
This Assignment contributes 15% towards your total mark for this subject. As a
reminder, there is a hurdle on the non-final-examination component for this subject.
Part 1: Theory question
This Part is worth 2 marks.
In class, we showed that, for all x, the estimated count, cˆ(x), of the frequency of x, fx,
in the Misra-Gries summary, satisfies fx − m/k ≤ cˆ(x) ≤ fx. In fact, we can show that
fx −
m − τ
≤ cˆ(x) ≤ fx , (1)
where τ is the sum of the counters of the tracked items.
In this question, you don’t need to prove inequality (1), you can assume it to be true.
Instead, we apply inequality (1) to show how to combine the outcomes of Misra-Gries
on two separate streams.
Suppose we run the Misra-Gries algorithm, with parameter k (looking for items
of relative frequency > 1/k), on two separate streams, σ1 and σ2. For each stream, we
obtain a candidate set of frequent elements, respectively, S1 and S2. For each element x
in S1, denote the counter by cˆ1(x), and for each y in S2 denote it by cˆ2(y).
From S1 and S2, we determine a candidate set for the combined stream, i.e., the
stream comprising σ1 followed by σ2, via the following method:
• For each z in S1 ∪ S2, let the counter cˆ(z) be

cˆ1(z), if z is in S1, but not S2;
cˆ2(z), if z is in S2, but not S1;
cˆ1(z) + cˆ2(z), if z is in both S1 and S2.
• If there are more than k − 1 items in S1 ∪ S2:
– Let a be the value of the k
th-largest counter.
– Subtract a from cˆ(z) for each z in S1 ∪ S2. Consequently at most k − 1 items
have positive counters.
• Return S, the items with positive counters.
Prove that for all z ∈ S returned by this method, inequality (1) holds, for the combined
stream, σ1 ◦ σ2 [Updated, v2: i.e., σ1 followed by σ2].
The University of Melbourne • COMP90056 2019s2 • Assignment B
Part 2: `0-sampling and sparse recovery
This Part is worth 13 marks.
So far, you have some experience implementing reservoir sampling, which is an
elegant technique to sample a family of k distinct item instances from a stream of items.
If you focus on sampling just one item (i.e., k = 1) then this approach is an example
of `1-sampling.
In this Assignment, we change a few aspects of the sampling scenario. First, we
assume a stream comprising a sequence of hitem, count-incrementi updates, where the
count increment might in fact be negative (in some sense, a decrement). Second, we
focus on `0-sampling; here, every item with non-zero frequency has an equal chance of
being sampled, and no item with frequency zero is sampled. That is, if T is the set of
items x with fx 6= 0 [Updated, v2: (i.e., the support of vector ~f)], then the probability of
sampling an item y should be:
1/|T| if y ∈ T ;
0 if y ∈/ T .
We sample uniformly from the items with non-zero frequency. For this Assignment, we
offer the Cormode-Firmani paper (as well as our own teaching materials) as a blueprint
for your implementations1
Given a universe U and a stream S = h(s1, 41), . . . ,(sm, 4m)i, with si ∈ U, the frequency of item x is defined as:
fx = ∑
. (2)
[Updated, v2: Part 2 of this Assignment] is broken down into four component tasks.
(a) `0-sampling in an insert-only stream Suppose items are only added in the input
stream, never deleted, that is, 4i > 0 for all i ∈ [m]. Your task is to implement, test, and
analyze an efficient `0-sampler for this context. Each item that occurs in the stream must
be equally likely to be selected, regardless of how many times it occurs in the stream.
Hint: A randomly chosen hash function will help! Completing this part correctly would
earn you at most 3 marks.
(b) 1-sparse recovery For this task, we seek to detect whether the stream has exactly
one item with non-zero frequency, i.e., when the set T described above has cardinality
one, |T| = 1. In this case, the 1-sparse exact recovery data structure should, with high
probability, return the item (in T) and its frequency. As shown in the slides, Ganguly’s
approach achieves this in a strict turnstile stream.
1G. Cormode, D. Firmani. A unifying framework for `0-sampling algorithms. Distributed and Parallel
Databases 32(3): 315-335 (2014).
The University of Melbourne • COMP90056 2019s2 • Assignment B
Your task is to implement, test, and analyze a 1-sparse exact recovery system that
reports that the input has:
• more than one item in T; or
• zero items in T; or
• a single item in T, reporting the item and its frequency.
Completing this part correctly for a strict turnstile stream (fx ≥ 0, for all x) would
earn you at most 2 marks; completing it correctly for a general stream would earn at
most 3 marks.
(c) s-sparse recovery Now suppose we wish to [Updated, v2: recover a frequency
vector with most s] items with non-zero frequency, i.e., |T| ≤ s. Following Ganguly’s
approach (also mentioned in slides), the aim is to construct a data structure that, with
high probability, reveals an s-sparse input. The sparsity parameter can be provided at
the start of the input stream.
Your task is to implement, test, and analyze an exact s-sparse recovery system that:
[Updated, v2:
• correctly reports that there are zero items in T; or
• returns the exact item-frequency pairs for vector ~f with 1 ≤ |T| ≤ s; or
• reports fail in all other cases;
where the fail case includes both s-sparse vectors not correctly retrieved and inputs
that are not s-sparse (even if they are correctly retrieved).] [Updated, v2: Here is a
(somewhat slow) method to confirm that the frequency vector ~f is in fact retrieved
successfully. First, subtract the recovered item-frequency pairs from the sparse-recovery
data structure; second, test whether this subtracted-from data structure is empty (i.e.,
would now represent fx = 0 for all x).]
Completing this part correctly for a strict turnstile stream (fx ≥ 0, for all x) would
earn you at most 2 marks; completing it correctly for a general stream would earn at
most 3 marks.
(d) `0-sampling in dynamic/general streams This task is the culmination of the
project components. It is to implement, test, and analyze an `0-sampler for a general
stream: allowing for arbitrary increments and decrements, and for negative frequencies.
We recommend combining the techniques from (b) and (c), with scheme for generating efficiently a nested family of sets, to produce an `0-sampler for general stream.
Completing this part correctly would earn you at most 3 marks.
(e) The thirteenth mark In the Cormode and Firmani article, which is the backbone
of this project, they describe in Section 2.5.3 that there are alternative solutions to
`0-sampling, which involve pairwise independence. The task is to implement, test,
and analyze a variant `0-sampler, which incorporates a different level of hash-functionfamily independence from part (d). Recommendation: Only attempt this if you feel
very confident that you have scored highly on all the other parts of this Assignment.
The University of Melbourne • COMP90056 2019s2 • Assignment B
Implementation [Updated, v2: Your implementation] can be in the programming
language of your choice. We will not be testing code [Updated, v2: nor allocating
marks directly for program quality]. If you prefer to use a language like C or C++, for
greater control over memory allocation, you are encouraged to do so.
Preparing the Report
Submit a multi-page report (around 1500 words) that provides a rich evaluation of
the `0-sampler and its primitives. It is your job to convince the reader of the effectiveness
of your `0-sampling data structure on general/turnstile streams.
To assist in presenting your results in a clear and organised manner, we recommend
breaking the report down into five sections: Introduction, Theory, Implementation,
Experimental Set-up (including data sets), Results and Discussion.
Introduction Establish the aims and purpose of the report. Provide some context and
include a use-case for the `0-sampler.
Theory Introduce the data structures that form the study. Offer some intuition behind
the recovery structures and their use in the `0-sampling algorithm. Compare, in a
theoretical sense, the complexity of any relevant sampling data structures. [Updated,
v2: You could do the latter in the form of a table:
`0 insert-only `0 turnstile `0 general baseline
Update time
Query time
Fail rate
] A good theory section will help your discussion. For example, your experiments
may ‘reflect’ what is stated in the complexity table.
Implementation Detail some of the key decisions you made in your implementation.
Which programming language did you use and why? Provide a justification for your
choice of hash function. We also highly recommend that you implement a baseline
solution for each of your data structures: a space-hungry, but simple-to-implement
Experimental Set-up Detail your data-collection process. Did you generate your own
data? If so, how and why? Did you use actual-world data-sets? State the names and
attributes of the data-sets used in the experiments. Describe the metrics and processes
used for measuring memory, time and accuracy.
Results and Discussion Present the results to support your discussion points, for
example, via plots and tables. Discuss the results. Do your results express any tradeoffs2
? Are your results consistent with what you stated in the theory section? [Updated,
2Hint: they should!
The University of Melbourne • COMP90056 2019s2 • Assignment B
v2: How do you test that the samples returned are close to uniform? What does close
to uniform mean?]
You may deviate from this template if you think a different structure suits your
arguments and comparisons. Regardless, your report must be in the following format:
A4 page size, minimum 11-point font, minimum 2cm margins, and must be a pdf file!
There are 13 marks available for Part 2 of this Assignment. The indicative marking
criteria for this Part are:
• What is the standard of your testing regimen, as described in the report? [2 marks]
– Have you tested your code at different scales (input sizes)?
– Do your datasets vary in the distribution of items?
– Do your datasets vary in the kinds of data they can handle?
– Are you testing corner/tricky cases for your code?
• How have you demonstrated good design choices in line with the goals of stream
computing? [2 marks]
– Are your data structures compact?
– Do they support fast update per stream input?
– Do they support fast response to queries?
• Is the theory well understood and presented clearly? [2 marks]
– Have you presented the ideas with consistent logic?
– Have you comprehensively covered all aspects of your algorithms and data
• Are the experimental results clearly presented and organized? [2 marks]
– Can the experiments be replicated?
– Have you used plots, graphics, tables appropriately? Don’t just repeat what’s
in text: amplify and summarize it.
• Does the discussion align the results with the goals and purpose of the report?
[2 marks]
– Is there a systematic critical engagement and analysis of the design decisions
and experimental results in line with the aims of streaming algorithms and
data structures?
– Were there relevant hypotheses and how were these reflected upon?
• Is the report well structured and written? [2 marks]
– Are there appropriate paragraph and section headings?
The University of Melbourne • COMP90056 2019s2 • Assignment B
– Have all aspects of the task been covered?
– Are arguments presented in a logical way and sequence?
• Does the report display some particular creativity or insight? [1 mark]
– Is independent thinking demonstrated?
– Is there an alternative approach to `0-sampling implemented and analyzed?
A submission with a reasonable quality report, that runs correctly on insert-only streams
may be expected to earn at most 7 marks, notwithstanding the marks for each criterion
We want you to demonstrate that you can implement [Updated, v2: the `0-sampler,
and test it] within a stream computing framework. This includes knowing how to
test the implementation in a way that reveals [Updated, v2: memory consumption,
running time, and closeness to uniformity.]
You should lodge your submission for Assignment A via the LMS. You must identify
yourself in each of your source files and the report. Poor-quality scans of solutions written
or printed on paper will not be accepted. There are scanning facilities on campus,
not to mention scanning apps for smartphones etc. Solutions generated directly on a
computer are of course acceptable. Submit three files:
• A part1.pdf file containing your answer to the Part-1 theory question.
• A report.pdf file comprising your report for Part 2.
• A file containing all your source files for Part 2, but not including
“standard” .jar files.
Do not include the testing files, as these might be large. REPEAT: DO NOT INCLUDE
TESTING FILES! It is very important, so that you can justify ownership of your work,
that you detail your contributions in comments in your code, and in your report.
Administrative issues
When is late? What do I do if I am late? The due date and time are printed on the
front of this document. The lateness policy is on the handout provided at the first
lecture. As a reminder, the late penalty for non-exam assessment is two marks per
day (or part thereof) overdue. Requests for extensions or adjustment must follow the
University policy (the Melbourne School of Engineering “owns” this subject), including
the requirement for appropriate evidence.
Late submissions should also be lodged via the LMS, but, as a courtesy, please
also email Tony Wirth when you submit late. If you make both on-time and late
submissions, please consult the subject coordinator as soon as possible to determine
which submission will be assessed.
The University of Melbourne • COMP90056 2019s2 • Assignment B
Individual work You are reminded that your submission for this Assignment is to be
your own individual work. Students are expected to be familiar with and to observe the
University’s Academic Integrity policy
For the purpose of ensuring academic integrity, every submission attempt by a student
may be inspected, regardless of the number of attempts made.
Students who allow other students access to their work run the risk of also being
penalized, even if they themselves are sole authors of the submission in question. By
submitting your work electronically, you are declaring that this is your own work.
Automated similarity checking software may be used to compare submissions.
You may re-use code provided by the teaching staff, and you may refer to resources
on the Web or in published or similar sources. Indeed, you are encouraged to read
beyond the standard teaching materials. However, all such sources must be cited fully
and, apart from code provided by the teaching staff, you must not copy code.
Finally Despite all these stern words, we are here to help! There is information about
getting help in this subject on the LMS pages. Frequently asked questions about the
Assignment will be answered in the LMS discussion group.
William Holland and Tony Wirth, with assistance from the COMP90056 team.
October 13, 2019


本网站支持淘宝 支付宝 微信支付  paypal等等交易。如果不放心可以用淘宝交易!

E-mail: [email protected]  微信:itcsdx