![]() |
.. (לתיקייה המכילה) | |
Where should the output of the calculation be written to? | |
The functions you write receive "int* vector" as its first argument. Use this as both input and output. |
May we modify the Makefile? Speed is important and the original Makefile compiles our code without any optimizations. | |
No, we will use our make file to compile and test your code, so that won't help. |
If the server indeed has 8 cores, we should be able to get the highest speedup at 8 threads, but at 8 threads the performance is the worst! Can you suggest a reason for this (4 and 16 threads faster than 8)? | |
In general it looks very much like a mixture of two effects: cache size and cache coherence traffic overhead. With 8 cores the cache is used in a suboptimal way and causes a lot of coherence traffic between the CPUs. Note that the server indeed has two 4-core CPUs with fully coherent cache, so whenever you cross the boundary of a single CPU (anything more than 4 threads) you're going to hit much higher latency when accessing the data. 16 threads may improve this in two ways: first, the data is split in smaller blocks and may fit high levels of the cache, second it may reduce the coherence traffic because each thread accesses more local data. It depends on the implementation A LOT. The difference between 8 and 16 should become negligible for very large data sets. |
Can we use inline assembly in our code? | |
No. Use standard C (C99) + OpenMP please. |
Our code compiles and runs, but it only executes in one thread. We've set OMP_NUM_THREADS, but it was ignored. | |
You're probably trying to compile your code directly, without using the provided Makefile. Add the following flags to gcc: -fopenmp - to add language support -lgomp - to link to the OpenMP library If this is not the case, make sure your #pramga commands are spelled correctly. If they are not, the compiler will silently ignore them. |
Should we achieve a 3.2 speedup comparing to our own implementation or the supplied one? Which one is presented at the end of the run? | |
The speedup that is presented at the end of the run is comparing to the supplied serial implementation. The bonus will be awarded to those who get the highest reported speedup. You should get a speedup of at least 3.2 comparing to your own implementation, running with 1 thread. |
How do we know if a core is equivalent to a BCE or not? | |
A BCE is indeed a relative definition. For this question, consider slide #6 in the tutorial. Do the server cores resemble the BCE or the large core? |
Is it forbidden to use "#pragma omp parallel for" at all? | |
No, you are allowed to use "#pragma omp parallel for". It is forbidden to use it without specifying the scheduling option explicitly, thus using the default option which is implementation dependent. |
Does the previous answer apply to the usage of "#pragma omp for" as well and not only "#pragma omp parallel for"? | |
Yes, it applies to all the ways of using the for construct and its scheduling option. |
Does the prohibition of using the default scheduling of the for construct apply to both sections? | |
Yes, both simple and fast implementations. |