הטכניון - מכון טכנולוגי לישראל Technion - Israel Institute of Technology Технион - израильский технологический институт ألتخنيون - معهد تكنولوجي لإسرائيل

02360523 - מבוא לביואינפורמטיקה 02360523 - Introduction to Bioinformatics 02360523 - Introduction to Bioinformatics 02360523 - Introduction to Bioinformatics

חורף 2019-2020Winter 2019-2020Зима 2019-2020شتاء 2019-2020

שאלות ותשובות - HW2 Frequently Asked Questions - HW2 Вопросы и Ответы - HW2 أسئلة وأجوبة - HW2

		.. (לתיקייה המכילה)

Q1 - Loading the data file
* Note that the files are “tab delimited” – columns are separated by a tab character (“\t”). * A possible outcome after you load the data is: rawcounts # 31181x1513 dataframe (with cell names as column names and gene names as index) metadata # 1513x6 dataframe (with cell names as index) * It's OK if you don't use the cell/gene names as index and have an additional column.

Q3 - Filtering the data

* Use the three conditions with an AND logic operand between them. the remaining samples should apply to all three conditions.
* You should be left with 510 cells. The section was updated accordingly.
* A possible outcome of this section is:
filtered_rawcounts # 31181x510 dataframe (with cell names as column names and gene names as index)
filtered_metadata # 510x6 dataframe (with cell names as index)

Q4b - quality control
What do we need to compare?
What are the synthetic genes?
Where do we get the known concentrations from?

The counts data table contains 31,089 rows of the form ENSMUSG#### - these are real genes that their expression was measured.
In addition, there are 92 rows of the form ERCC-#### - these are "synthetic genes" that were introduced to each sample in known concentrations for quality control.
The concentration for each such synthetic gene is given in the file "ERCC_conc.tab".
You need to:
1. Split the count table to 2 tables: one for the real genes and one for the synthetic genes.
2. For each sample you need to calculate the correlation between the 92 count values for these synthetic genes and their known concentrations.
A correlation is calculated on a set of (x,y) value pairs. in our example, for each sample, we have 92 such pairs, each of which represents an ERCC gene. The x value is the known concentration from the ERCC_conc file and the y value is the count of that gene in the tested sample.
If you give the function cor two parameters X and Y where X is a vector of x values and Y is the vector of corresponding y values (the order must match) the result would be the correlation between X and Y.
Note that the known concentrations file is not sorted by gene name. you need to sort the values according to the gene name before you calculate the correlation.

We get an error message for the line library(DESEQ2) saying there is no package named "DESEQ2".
You need to install DESEQ2. Open RStudio and run the following commands in the console window: install.packages(c("knitr","ggplot2","readr","dplyr","BiocManager","NMF")) BiocManager::install("DESeq2") See a PDF with deatils in the HW2 section

We observe "weird" behaviour while re-knitting our Rmd file
Rstudio keeps some of the knitting output in cache. try clearing the knit cache from the knit menu before re-knitting.

Q4 - Filtering using QC
* You should be left with 307 samples after this section. It was updated. * A possible outcome of this section is: filtered_metadata # 307x6 dataframe (with cell names as index) filtered_ERCCcounts # 92x307 dataframe (with cell names as column names and gene names as index) filtered_genecounts # 31089x307 dataframe (with cell names as column names and gene names as index)

Q5 - Filtering lowly expressed genes
* You should be left with 9088 genes after that section. It was updated. A possible outcome of this section is: filtered_genecounts # 9088x307 dataframe (with cell names as column names and gene names as index)

Part 2 - Q1 - Gorilla
* You only need to use the sorted gene names in GOrilla. ignore all other values that you exported from R.

Q3c - How do we need to show an example of a good and a bad sample?
Plotting the read counts vs. the known concentrations using a scatter plot is enough.

Should we include the R commands in the report?
YES. We need to see what was the code you used for each section. Use the 'echo = TRUE' parameter in the R chunk for that (It should be set like that by default if you use the template file)

שאלות ותשובות - HW2 Frequently Asked Questions - HW2 Вопросы и Ответы - HW2 أسئلة وأجوبة - HW2

Q1 - Loading the data file

Q3 - Filtering the data

Q4b - quality control What do we need to compare? What are the synthetic genes? Where do we get the known concentrations from?

We get an error message for the line library(DESEQ2) saying there is no package named "DESEQ2".

We observe "weird" behaviour while re-knitting our Rmd file

Q4 - Filtering using QC

Q5 - Filtering lowly expressed genes

Part 2 - Q1 - Gorilla

Q3c - How do we need to show an example of a good and a bad sample?

Should we include the R commands in the report?

Q4b - quality control
What do we need to compare?
What are the synthetic genes?
Where do we get the known concentrations from?