本次Java代写是流计算与应用

COMP90056 Stream Computing and Applications

Assignment B – updated, Version 2

Second (Spring) Semester 2019

Posted on LMS: Monday, 30 September 2019

Due: Monday, 21 October 2019 [07:30]

Important: Each student must submit their own code and report, written individually.

This Assignment contributes 15% towards your total mark for this subject. As a

reminder, there is a hurdle on the non-final-examination component for this subject.

Part 1: Theory question

This Part is worth 2 marks.

In class, we showed that, for all x, the estimated count, cˆ(x), of the frequency of x, fx,

in the Misra-Gries summary, satisfies fx − m/k ≤ cˆ(x) ≤ fx. In fact, we can show that

fx −

m − τ

k

≤ cˆ(x) ≤ fx , (1)

where τ is the sum of the counters of the tracked items.

In this question, you don’t need to prove inequality (1), you can assume it to be true.

Instead, we apply inequality (1) to show how to combine the outcomes of Misra-Gries

on two separate streams.

Suppose we run the Misra-Gries algorithm, with parameter k (looking for items

of relative frequency > 1/k), on two separate streams, σ1 and σ2. For each stream, we

obtain a candidate set of frequent elements, respectively, S1 and S2. For each element x

in S1, denote the counter by cˆ1(x), and for each y in S2 denote it by cˆ2(y).

From S1 and S2, we determine a candidate set for the combined stream, i.e., the

stream comprising σ1 followed by σ2, via the following method:

• For each z in S1 ∪ S2, let the counter cˆ(z) be

cˆ1(z), if z is in S1, but not S2;

cˆ2(z), if z is in S2, but not S1;

cˆ1(z) + cˆ2(z), if z is in both S1 and S2.

• If there are more than k − 1 items in S1 ∪ S2:

– Let a be the value of the k

th-largest counter.

– Subtract a from cˆ(z) for each z in S1 ∪ S2. Consequently at most k − 1 items

have positive counters.

• Return S, the items with positive counters.

Prove that for all z ∈ S returned by this method, inequality (1) holds, for the combined

stream, σ1 ◦ σ2 [Updated, v2: i.e., σ1 followed by σ2].

1

The University of Melbourne • COMP90056 2019s2 • Assignment B

Part 2: `0-sampling and sparse recovery

This Part is worth 13 marks.

So far, you have some experience implementing reservoir sampling, which is an

elegant technique to sample a family of k distinct item instances from a stream of items.

If you focus on sampling just one item (i.e., k = 1) then this approach is an example

of `1-sampling.

In this Assignment, we change a few aspects of the sampling scenario. First, we

assume a stream comprising a sequence of hitem, count-incrementi updates, where the

count increment might in fact be negative (in some sense, a decrement). Second, we

focus on `0-sampling; here, every item with non-zero frequency has an equal chance of

being sampled, and no item with frequency zero is sampled. That is, if T is the set of

items x with fx 6= 0 [Updated, v2: (i.e., the support of vector ~f)], then the probability of

sampling an item y should be:

(

1/|T| if y ∈ T ;

0 if y ∈/ T .

We sample uniformly from the items with non-zero frequency. For this Assignment, we

offer the Cormode-Firmani paper (as well as our own teaching materials) as a blueprint

for your implementations1

.

Details

Given a universe U and a stream S = h(s1, 41), . . . ,(sm, 4m)i, with si ∈ U, the frequency of item x is defined as:

fx = ∑

i:si=x

4i

. (2)

[Updated, v2: Part 2 of this Assignment] is broken down into four component tasks.

(a) `0-sampling in an insert-only stream Suppose items are only added in the input

stream, never deleted, that is, 4i > 0 for all i ∈ [m]. Your task is to implement, test, and

analyze an efficient `0-sampler for this context. Each item that occurs in the stream must

be equally likely to be selected, regardless of how many times it occurs in the stream.

Hint: A randomly chosen hash function will help! Completing this part correctly would

earn you at most 3 marks.

(b) 1-sparse recovery For this task, we seek to detect whether the stream has exactly

one item with non-zero frequency, i.e., when the set T described above has cardinality

one, |T| = 1. In this case, the 1-sparse exact recovery data structure should, with high

probability, return the item (in T) and its frequency. As shown in the slides, Ganguly’s

approach achieves this in a strict turnstile stream.

1G. Cormode, D. Firmani. A unifying framework for `0-sampling algorithms. Distributed and Parallel

Databases 32(3): 315-335 (2014).

2

The University of Melbourne • COMP90056 2019s2 • Assignment B

Your task is to implement, test, and analyze a 1-sparse exact recovery system that

reports that the input has:

• more than one item in T; or

• zero items in T; or

• a single item in T, reporting the item and its frequency.

Completing this part correctly for a strict turnstile stream (fx ≥ 0, for all x) would

earn you at most 2 marks; completing it correctly for a general stream would earn at

most 3 marks.

(c) s-sparse recovery Now suppose we wish to [Updated, v2: recover a frequency

vector with most s] items with non-zero frequency, i.e., |T| ≤ s. Following Ganguly’s

approach (also mentioned in slides), the aim is to construct a data structure that, with

high probability, reveals an s-sparse input. The sparsity parameter can be provided at

the start of the input stream.

Your task is to implement, test, and analyze an exact s-sparse recovery system that:

[Updated, v2:

• correctly reports that there are zero items in T; or

• returns the exact item-frequency pairs for vector ~f with 1 ≤ |T| ≤ s; or

• reports fail in all other cases;

where the fail case includes both s-sparse vectors not correctly retrieved and inputs

that are not s-sparse (even if they are correctly retrieved).] [Updated, v2: Here is a

(somewhat slow) method to confirm that the frequency vector ~f is in fact retrieved

successfully. First, subtract the recovered item-frequency pairs from the sparse-recovery

data structure; second, test whether this subtracted-from data structure is empty (i.e.,

would now represent fx = 0 for all x).]

Completing this part correctly for a strict turnstile stream (fx ≥ 0, for all x) would

earn you at most 2 marks; completing it correctly for a general stream would earn at

most 3 marks.

(d) `0-sampling in dynamic/general streams This task is the culmination of the

project components. It is to implement, test, and analyze an `0-sampler for a general

stream: allowing for arbitrary increments and decrements, and for negative frequencies.

We recommend combining the techniques from (b) and (c), with scheme for generating efficiently a nested family of sets, to produce an `0-sampler for general stream.

Completing this part correctly would earn you at most 3 marks.

(e) The thirteenth mark In the Cormode and Firmani article, which is the backbone

of this project, they describe in Section 2.5.3 that there are alternative solutions to

`0-sampling, which involve pairwise independence. The task is to implement, test,

and analyze a variant `0-sampler, which incorporates a different level of hash-functionfamily independence from part (d). Recommendation: Only attempt this if you feel

very confident that you have scored highly on all the other parts of this Assignment.

3

The University of Melbourne • COMP90056 2019s2 • Assignment B

Implementation [Updated, v2: Your implementation] can be in the programming

language of your choice. We will not be testing code [Updated, v2: nor allocating

marks directly for program quality]. If you prefer to use a language like C or C++, for

greater control over memory allocation, you are encouraged to do so.

Preparing the Report

Submit a multi-page report (around 1500 words) that provides a rich evaluation of

the `0-sampler and its primitives. It is your job to convince the reader of the effectiveness

of your `0-sampling data structure on general/turnstile streams.

To assist in presenting your results in a clear and organised manner, we recommend

breaking the report down into five sections: Introduction, Theory, Implementation,

Experimental Set-up (including data sets), Results and Discussion.

Introduction Establish the aims and purpose of the report. Provide some context and

include a use-case for the `0-sampler.

Theory Introduce the data structures that form the study. Offer some intuition behind

the recovery structures and their use in the `0-sampling algorithm. Compare, in a

theoretical sense, the complexity of any relevant sampling data structures. [Updated,

v2: You could do the latter in the form of a table:

`0 insert-only `0 turnstile `0 general baseline

Memory

Update time

Query time

Fail rate

] A good theory section will help your discussion. For example, your experiments

may ‘reflect’ what is stated in the complexity table.

Implementation Detail some of the key decisions you made in your implementation.

Which programming language did you use and why? Provide a justification for your

choice of hash function. We also highly recommend that you implement a baseline

solution for each of your data structures: a space-hungry, but simple-to-implement

solution.

Experimental Set-up Detail your data-collection process. Did you generate your own

data? If so, how and why? Did you use actual-world data-sets? State the names and

attributes of the data-sets used in the experiments. Describe the metrics and processes

used for measuring memory, time and accuracy.

Results and Discussion Present the results to support your discussion points, for

example, via plots and tables. Discuss the results. Do your results express any tradeoffs2

? Are your results consistent with what you stated in the theory section? [Updated,

2Hint: they should!

4

The University of Melbourne • COMP90056 2019s2 • Assignment B

v2: How do you test that the samples returned are close to uniform? What does close

to uniform mean?]

You may deviate from this template if you think a different structure suits your

arguments and comparisons. Regardless, your report must be in the following format:

A4 page size, minimum 11-point font, minimum 2cm margins, and must be a pdf file!

Criteria

There are 13 marks available for Part 2 of this Assignment. The indicative marking

criteria for this Part are:

• What is the standard of your testing regimen, as described in the report? [2 marks]

– Have you tested your code at different scales (input sizes)?

– Do your datasets vary in the distribution of items?

– Do your datasets vary in the kinds of data they can handle?

– Are you testing corner/tricky cases for your code?

• How have you demonstrated good design choices in line with the goals of stream

computing? [2 marks]

– Are your data structures compact?

– Do they support fast update per stream input?

– Do they support fast response to queries?

• Is the theory well understood and presented clearly? [2 marks]

– Have you presented the ideas with consistent logic?

– Have you comprehensively covered all aspects of your algorithms and data

structures?

• Are the experimental results clearly presented and organized? [2 marks]

– Can the experiments be replicated?

– Have you used plots, graphics, tables appropriately? Don’t just repeat what’s

in text: amplify and summarize it.

• Does the discussion align the results with the goals and purpose of the report?

[2 marks]

– Is there a systematic critical engagement and analysis of the design decisions

and experimental results in line with the aims of streaming algorithms and

data structures?

– Were there relevant hypotheses and how were these reflected upon?

• Is the report well structured and written? [2 marks]

– Are there appropriate paragraph and section headings?

5

The University of Melbourne • COMP90056 2019s2 • Assignment B

– Have all aspects of the task been covered?

– Are arguments presented in a logical way and sequence?

• Does the report display some particular creativity or insight? [1 mark]

– Is independent thinking demonstrated?

– Is there an alternative approach to `0-sampling implemented and analyzed?

A submission with a reasonable quality report, that runs correctly on insert-only streams

may be expected to earn at most 7 marks, notwithstanding the marks for each criterion

above.

Expectations

We want you to demonstrate that you can implement [Updated, v2: the `0-sampler,

and test it] within a stream computing framework. This includes knowing how to

test the implementation in a way that reveals [Updated, v2: memory consumption,

running time, and closeness to uniformity.]

Submissions

You should lodge your submission for Assignment A via the LMS. You must identify

yourself in each of your source files and the report. Poor-quality scans of solutions written

or printed on paper will not be accepted. There are scanning facilities on campus,

not to mention scanning apps for smartphones etc. Solutions generated directly on a

computer are of course acceptable. Submit three files:

• A part1.pdf file containing your answer to the Part-1 theory question.

• A report.pdf file comprising your report for Part 2.

• A sampler.zip file containing all your source files for Part 2, but not including

“standard” .jar files.

Do not include the testing files, as these might be large. REPEAT: DO NOT INCLUDE

TESTING FILES! It is very important, so that you can justify ownership of your work,

that you detail your contributions in comments in your code, and in your report.

Administrative issues

When is late? What do I do if I am late? The due date and time are printed on the

front of this document. The lateness policy is on the handout provided at the first

lecture. As a reminder, the late penalty for non-exam assessment is two marks per

day (or part thereof) overdue. Requests for extensions or adjustment must follow the

University policy (the Melbourne School of Engineering “owns” this subject), including

the requirement for appropriate evidence.

Late submissions should also be lodged via the LMS, but, as a courtesy, please

also email Tony Wirth when you submit late. If you make both on-time and late

submissions, please consult the subject coordinator as soon as possible to determine

which submission will be assessed.

6

The University of Melbourne • COMP90056 2019s2 • Assignment B

Individual work You are reminded that your submission for this Assignment is to be

your own individual work. Students are expected to be familiar with and to observe the

University’s Academic Integrity policy http://academicintegrity.unimelb.edu.au/.

For the purpose of ensuring academic integrity, every submission attempt by a student

may be inspected, regardless of the number of attempts made.

Students who allow other students access to their work run the risk of also being

penalized, even if they themselves are sole authors of the submission in question. By

submitting your work electronically, you are declaring that this is your own work.

Automated similarity checking software may be used to compare submissions.

You may re-use code provided by the teaching staff, and you may refer to resources

on the Web or in published or similar sources. Indeed, you are encouraged to read

beyond the standard teaching materials. However, all such sources must be cited fully

and, apart from code provided by the teaching staff, you must not copy code.

Finally Despite all these stern words, we are here to help! There is information about

getting help in this subject on the LMS pages. Frequently asked questions about the

Assignment will be answered in the LMS discussion group.

William Holland and Tony Wirth, with assistance from the COMP90056 team.

October 13, 2019

7

**程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB**

本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！

**E-mail:** [email protected] **微信:**itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。