.. (לתיקייה המכילה) | ||
in the simple implementation , can we assume that the calculation of bitwise AND and counting the number of 1s using x&=(x-1) is O(1)? | |
it is the number of 1's , we don't know exactly what is the number of 1's but it is at most 32 , so it is O(1). But, you shouldn't use the O(..) , you should write a precise formula to be able to count the number of commands your program executes. you can assume that bitwise AND and counting is 1 operation. |
is there a specific criteria we're graded on (for example: "16M-integer, 2-8 threads")? | |
Your fast implementation should be able to calculate up to 16M integer , and should be able to run upto 128 threads at least (not that your memory allocations shouldn't depend on threads number). |
May we modify the Makefile ? Speed is important and the original Makefile compiles our code without any optimizations. | |
No, we will use our make file to compile and test your code , so that won't help. |
when running with fast parallel, it somehow runs it twice and not once (it says running parallel fast twice instead of once),why? | |
It runs twice , but we print the avg of the times of the two runs (so the time you see is not really the time you wait :)) . |
We have compilation problem " walsh-transform.o' is incompatible with i386:x86-64 output" | |
Use the updated version of the file walsh-transform.o , that should solve the problem |
Does our parallel fast implementation need to beat your serial implementation? | |
Your solution should scale well upto 8-16 threads at least , specially with larg inputs (it may scale down for small inputs, why?). |
Where to write the output of the calculation? | |
To the a matrix (allocated by us) which is passed as a parameter to your functions. |
can we use any other function that weren't covered in class? | |
you can use only materials covered in class. |
If the server indeed have 8 cores, we should be able to get the highest speedup at 8 threads, but at 8 threads the performance is the worst! Can you suggest a reason for this (4 and 16 threads faster than 8)? | |
In general it looks very much like a mixture of two effects: cache size and cache coherence traffic overhead. With 8 cores the cache is used in a suboptimal way and causes a lot of coherence traffic between the CPUs. - note the server indeed has two 4-core CPUs with fully coherent cache, so whenever you cross the boundary of a single CPU (anything more than 4 threads) you're going to hit much higher latency when accessing the data. 16 threads may improve this in two ways: first, the data is split in smaller blocks and may fit high levels of the cache, second it may reduce the coherence traffic because each thread accesses more local data. It depends on the implementation A LOT. The difference b/w 8 and 16 should become negligible for very large data sets. |
After compiling , how can I run and test my code? | |
you should use the run.sh file attached with the hw. >> chmod +rwx run.sh >> sh run.sh [Number of threads] [size of array] [1 -for simple , 2-for fast] |