BINF6111 – Assignment 1
Assignment 1 – Protein sequence evolution and alignment
Allocation: the assignment is worth 20% of your final assessment.
The aim of this assignment is for you to write software in C, Java or Python and run it on biological sequence data to generate outputs
which you will write up in a comprehensive report. Note: a key aim of this assignment is to be able to develop bioinformatics software
“from scratch”, so you are permitted to use bioinformatics-specific software libraries (such as BioJava or BioPython) in your code only
for low-level operations such as reading and writing files in this assignment !
The assignment will test your understanding of some key concepts of evolutionary modelling and sequence alignment, your ability to
translate these into working software, and will require you to demonstrate good written communication skills.
The assignment is structured to have three deliverable parts, as follows.
implement a protein sequence evolution simulator [35% of assignment mark]
Part 2 (alignment)
implement optimal pairwise alignment by dynamic programming [45% of assignment mark] and
Part 2 (evaluation and report)
run your software from Part 1 to simulate the evolution of a protein sequence over increasing evolutionary time, and run your
alignment software from to align the mutated sequences with the original, then write a report including a plot of percentage
identity against evolution time and discuss your results. [20% of assignment mark]
Each part builds on the previous part and you should complete each part before moving on to the next.
The marking will be based on having a running program for each part and writing a proper report. Marks will also be given for good
programming and writing style. Marks will be deducted for program bugs and other errors.
The standard warnings against plagiarism apply and will be enforced. In particular, you must not re-use or copy software from another
student, or that you found on the Internet, in a textbook, or elsewhere to implement either parts 1 or 2 of this assignment. Doing so will
simply deprive you of the essential learning experience that can only come from solving a programming problem and building working
code by yourself to implement your solution.
Penalties for late submission of assignment parts will be incurred at the rate of a reduction of 1 mark per day in the maximum possible
mark for a part, up to the number of marks for that part.
Description of the task:
You should download the file of sample amino acid sequences to help in development of your program. These sequences should be in
separate files in FASTA format (see below). You also need to download the amino acid mutation matrix (see below) and add it to your
code. This is an asymmetric 20×20 matrix. A matrix entry M_i,j in column i and row j denotes the probability that the column amino
acid will mutate to the row amino acid. For ease of handling, the probabilities are expressed as counts per 10000. So “56” means the
probability 0.0056 in this matrix. The columns and rows are labelled with a single letter amino acid code, as follows:
Code Abbreviation Amino acid
A Ala alanine
R Arg arginine
N Asn asparagine
D Asp aspartate
C Cys cysteine
Q Gln glutamine
E Glu glutamate
G Gly glycine
H His histidine
I Ile isoleucine
L Leu leucine
BINF6111 – Assignment 1 9/6/20, 9:32 am
http://www.cse.unsw.edu.au/~bi6111/spec11.html Page 2 of 3
K Lys lysine
M Met methionine
F Phe phenylalanine
P Pro proline
S Ser serine
T Thr threonine
W Trp tryptophan
Y Tyr tyrosine
V Val valine
N.B. this is a subset of the FASTA amino acid code letter set (it does not include B,U,Z,X,*,-).
Amino acid mutation matrix:
The evolutionary mutation model in this task is as follows. For each amino acid in the current sequence, its probability of mutating to
any other amino acid is given by the appropriate entry in the mutation matrix. Note that we apply an independence assumption, namely
that the mutation of any amino acid is a random event unaffected by any other amino acid in the sequence. The model is to be run for a
fixed number of iterations, each corresponding to a “generation”. The choice by “evolution” of a mutation in each amino acid is
modelled by indexing into the mutation matrix using a random number. The initial current sequence is given as input. At each step, the
current sequence has each of its amino acids “mutated” with a certain probability. (Note that this mutation process does not necessarily
cause the amino acid to change at every round of mutation. In fact, looking at the mutation matrix, it is likely that amino acids will
change only rarely.) This new, mutated sequence then becomes the current sequence for the next generation.
The basic steps in your program will be to read in the initial sequence and run the evolutionary model based on the mutation matrix for
500 generations. The original sequence, plus all the mutated sequences generated, one per iteration, should be saved in a single output
file in FASTA format. The generation number should go in the description line preceding each sequence, with your initial sequence
numbered 0, the first mutated sequence numbered 1 and the final mutated sequence numbered 500.
A single source file “evolve.c”, “evolve.java” or “evolve.py” which should be adequately commented.
Use the give system on CSE machines to submit your assignment. This can be accessed either by running give from the command-line
on a CSE machine, or from the web-interface.
At the command-line the submission command will look like:
$ give bi6111 ass1_1 evolve.c
$ give bi6111 ass1_1 evolve.java
BINF6111 – Assignment 1 9/6/20, 9:32 am
http://www.cse.unsw.edu.au/~bi6111/spec11.html Page 3 of 3
Last modified Mon 8 Jun 2020 14:43:57 AEST
$ give bi6111 ass1_1 evolve.py
NOTE: be sure to test that your program works correctly on CSE machines running Linux before submitting it – you may lose marks if
it does not work correctly on a CSE machine running Linux, e.g., one of the lab machines.
Inputs and Outputs:
s0 your input sequence file containing the initial sequence in FASTA format
s501 the output file containing the initial sequence plus all 500 mutated sequences in FASTA format
You may assume that the I/O comes from stdin and goes to stdout.
For example, the program may be executed as follows:
% evolve < s0 > s501
The only input and output routines you need are to read and write sequences in FASTA format. Design these well for part 1 and you can
re-use them in part 2.
Your program should report an error and halt if the input is not in FASTA format or the sequence contains something other than the
above 20 amino acid code letters.
Your program should report an error and halt if the input does not contain exactly one sequence in FASTA format.
You should submit part 1 by 23:59:59 on Sunday June 14, 2020.
FASTA Sequence Format Description [restricted version]:
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is
distinguished from the sequence data by a greater-than (“>”) symbol in the first column. It is recommended that all lines of text be
shorter than 80 characters in length. An example sequence in FASTA format is:
> description of the sequence
More than one sequence can be included in the same file:
> description of initial sequence
> description of another sequence
> description of yet another sequence
To be released.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx