הטכניון - מכון טכנולוגי לישראל Technion - Israel Institute of Technology Технион - израильский технологический институт ألتخنيون - معهد تكنولوجي لإسرائيل

02360703 - תכנות מונחה עצמים 02360703 - Object-oriented Programming 02360703 - Object-oriented Programming 02360703 - Object-oriented Programming

אביב 2003-2004Spring 2003-2004Весна 2003-2004ربيع 2003-2004

שאלות ותשובות - Assignment 4 FAQ Frequently Asked Questions - Assignment 4 FAQ Вопросы и Ответы - Assignment 4 FAQ أسئلة وأجوبة - Assignment 4 FAQ

		.. (לתיקייה המכילה)

Can you give examples of other sequence alignment algorithms except the two algorithms given in the assignment?

Where should I start...

The Smith-Waterman algorithm is very similar to Needleman-Wunsch, but finds local alignment of two sequences. It also uses a matrix for this purpose.
It is also possible to use scoring matrices for these two algorithms: rather than assuming that every match/mismatch gets the same score, it is possible to assign a different score for each two letters. This is particularly useful in case of proteins, where some amino acids are more likely to substitute others.
Also, it is possible to use some more complicated gap penalties scheme, which gives severe punishments for opening a series of gaps, but moderate penalties for continuing an existing series.
Other algorithms perform multiple sequence alignment, namely the alignment of n>2 sequences; although the assignment deals with alignment of two sequences, such algorithms are natural generalization of our algorithms.
There are plenty of algorithms that do not use scoring matrices but other methods; also, it is possible to compare other properties of proteins and DNA, such as the 3D structure of proteins (in which case we will use a different alphabet), etc.

These are all beyond the scope of this assignment, but are given to give you an idea about possible natural extensions of your program.

What do you mean by "3D structure of proteins"?

The 3D structure of a protein is its structure in the 3D space. This structure is extremely important for its role in the cell. This structure is described by the protein's second and tertiary structures which are the local and global structure of the protein in the 3D space, and by its quarternary structure, namely the composition of few amino acids chains into one protein.
This whole (interesting) topic is beyond the scope of our assignment, of course; however, keep in mind that sequence alignment algorithms might be incorperated for the task of higher-level structure comparison.

What is the "expression process of genes"?
This is the process in which genes are translated into proteins in the cell. Read some more about this here

is it possilbe that my input sequence will be "ACGTT", and after alignment it would come out as " AC-TT"? meaning the gap has replaced one of the original letters ?
No; all letters in the input sequence must also appear in the output sequence. Gaps may be added, but not replace any letter.

Regarding the "Needleman-Wunsch" alg. the last bullet says: "In case there is more than one possibility we will randomly choose one" , do you mean literaly randomly, or arbitrary (meaning i can chose the last that changed it)?
Whatever you choose (as long as it is optimal) is fine (randomly or arbitrarily)

can I use a Matrix class i've got of the net? (for which i won' be responsible for the design)
Well, if you think it fits your needs then this is fine with me. I mean: don't use it just because it is the easiest thing to do, but make sure it really fits your needs.

can I assume that a score_schem is for comparing 2 strings at a time (even in "multiple sequence alignment" algorithms)?
Yes.

Is it a design rule to declare the Function Throw List in every function that throws? becuase as I understand from the Tutorial , if another function/class throws an excption you don't expect unexpected() is called , which by default calls terminate().
I find it not that smart.

As I said in the tutorial, you don't have to use the throw list. VC++6.0 does not support it anyways (if you are working with this). In principle, if you know what kinds of exceptions are expected to be thrown from a given function then it could be smart to have a declaration list; you can also override the unexpected() function, and have details regarding the unexpected exception, at least for debugging purposes; but, as I said, this is not a requirement for this assignment.

There are a few types of sequences in the EX: RNA / DNA protein etc. Should i implment them all (so far i've implemented DNA only)
implement, at least, DNA and protein. Althouh it looks like more work, I believe it will help you to have your design more flexible :-)

is it ok to assume that a sequence size won't exceed "long" bounds ? (i'm using a long variable as index for it)
Well... human genome is 3G bases long which is on the same order of magnitude as the long bound... what do you think? personally, I prefer to have this kind of assumptions as easy as possible to change later if they prove to be wrong. Think how you can do that.

do you want the all the class in one .h file and one .cpp file (as in preivous Ex.)?
Hell, no! :-) you are going to gave a huge messy .h file... put each one in its own file, with the exception of closely related classes that may be put in the same file.

I understand that the format that the strings are written in the file can be chosen by us. But does the input file must contain 2 strings or many pairs of strings or maybe we should have many files with one string only in each file? Is it considered a format too?
All these are considred as different formats; all of them look reasonable. Choose one of them, but be prepared for changes/addition of other formats.

Is it possible to write the program as a Windows application, using MFC?

Yes, BUT: the main issue in this assignment is your design and coding. If you want to practice your Windows coding skills this is fine with me, but do not except to get extra points for a nice GUI etc. So, if you are not experienced with Windows programming you might want to invest some time in your design first (you should do this anyway...), and then see if it's not too much to be coded as a Windows application.

[BUG FIX]: In the Needleman-Wunsch algorithm description, on the fourth bullet it is written: For 1 < i < m+1, 1 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s2(j)), M[i, j-1]+score(s1(i), gap), M[i-1, j-1]+score(s1(i), s2(j)) Is this correct?
No, it should be: For 0 < i < m+1, 0 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s2(j)), M[i, j-1]+score(s1(i), gap), M[i-1, j-1]+score(s1(i), s2(j))

[BUG FIX]: In the Needleman-Wunsch algorithm, shouldn't the fourth bullet be: For 0 < i < m+1, 0 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s1(j)), M[i, j-1]+score(s2(i), gap), M[i-1, j-1]+score(s1(i), s2(j))
Yes

שאלות ותשובות - Assignment 4 FAQ Frequently Asked Questions - Assignment 4 FAQ Вопросы и Ответы - Assignment 4 FAQ أسئلة وأجوبة - Assignment 4 FAQ

Can you give examples of other sequence alignment algorithms except the two algorithms given in the assignment?

What do you mean by "3D structure of proteins"?

What is the "expression process of genes"?

is it possilbe that my input sequence will be "ACGTT", and after alignment it would come out as " AC-TT"? meaning the gap has replaced one of the original letters ?

Regarding the "Needleman-Wunsch" alg. the last bullet says: "In case there is more than one possibility we will randomly choose one" , do you mean literaly randomly, or arbitrary (meaning i can chose the last that changed it)?

can I use a Matrix class i've got of the net? (for which i won' be responsible for the design)

can I assume that a score_schem is for comparing 2 strings at a time (even in "multiple sequence alignment" algorithms)?

Is it a design rule to declare the Function Throw List in every function that throws? becuase as I understand from the Tutorial , if another function/class throws an excption you don't expect unexpected() is called , which by default calls terminate(). I find it not that smart.

There are a few types of sequences in the EX: RNA / DNA protein etc. Should i implment them all (so far i've implemented DNA only)

is it ok to assume that a sequence size won't exceed "long" bounds ? (i'm using a long variable as index for it)

do you want the all the class in one .h file and one .cpp file (as in preivous Ex.)?

I understand that the format that the strings are written in the file can be chosen by us. But does the input file must contain 2 strings or many pairs of strings or maybe we should have many files with one string only in each file? Is it considered a format too?

Is it possible to write the program as a Windows application, using MFC?

[BUG FIX]: In the Needleman-Wunsch algorithm description, on the fourth bullet it is written: For 1 < i < m+1, 1 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s2(j)), M[i, j-1]+score(s1(i), gap), M[i-1, j-1]+score(s1(i), s2(j)) Is this correct?

[BUG FIX]: In the Needleman-Wunsch algorithm, shouldn't the fourth bullet be: For 0 < i < m+1, 0 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s1(j)), M[i, j-1]+score(s2(i), gap), M[i-1, j-1]+score(s1(i), s2(j))

Is it a design rule to declare the Function Throw List in every function that throws? becuase as I understand from the Tutorial , if another function/class throws an excption you don't expect unexpected() is called , which by default calls terminate().
I find it not that smart.

I understand that the format that the strings are written in the file can be chosen by us.
But does the input file must contain 2 strings or many pairs of strings or maybe we should have many files with one string only in each file? Is it considered a format too?

[BUG FIX]: In the Needleman-Wunsch algorithm description, on the fourth bullet it is written:
For 1 < i < m+1, 1 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s2(j)), M[i, j-1]+score(s1(i), gap), M[i-1, j-1]+score(s1(i), s2(j))
Is this correct?

[BUG FIX]: In the Needleman-Wunsch algorithm, shouldn't the fourth bullet be:
For 0 < i < m+1, 0 < j < n+1, M[i, j] = max{M[i-1, j]+score(gap, s1(j)), M[i, j-1]+score(s2(i), gap), M[i-1, j-1]+score(s1(i), s2(j))