ECE4893A/CS4803MPG – Homework #8

ECE4893A/CS4803MPG – Homework #8

ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

Fall 2009

Homework #8: The Cell

8A checkpoint due: Saturday, Dec. 5 at 11:59 PM (via T-square)

8B boss battle due: Friday, Dec. 11 at 2:20 PM (via T-square)


Late policy: The “8A checkpoint” will be graded out of 20 points;
the “8B boss battle” will be graded out of 100 points. We will
accept late submissions; however,
for every day that a submission is overdue (including weekend days),
we will subtract 10 points (for 8A) or 20 points from
the total (for 8B). 8B will not be accepted past 2:20 on Sunday,
Dec. 13, since I need to turn in grades the following day.

(We understand that sometimes multiple assignments hit at once, or other
life events intervene, and hence you have to make some tough choices. We’d
rather let you turn something in
late, with some points off, than have a “no late assignments
accepted at all”
policy, since the former encourages you to still do the assignment
and learn something from it, while the latter just grinds down your soul.)


Note about the unusual due dates:
Many students I
have talked to have major projects due on Friday of “dead week.” I set the
“checkpoint” to be due the following today so people slammed with multiple
projects that are due Friday have some breathing room, but I strongly
recommend handling the checkpoint, which is meant to make sure you can compile
and run Cell programs, early. The 2:20 PM due date for the “boss battle”
on Friday of finals week is set to correspond to when our final would have
ended if we had actually had a final; this seems like a reasonable approach
since we’re not doing a final. Again, I recommend you try to get the
assignment finished early, but this gives you the flexibility to manage
your time however seems best to you.


Review:

  • Be sure to sign up for a
    CellBuzz account. Prof.
    Bader, who is in charge of the Cell cluster, has to manually approve these,
    so sign up as soon as possible.

  • Read over the
    CellBuzz
    User Guide
    carefully; be sure to note the section near the
    bottom on how to specifically log into
    blades running the 3.0 SDK. The system defaults
    to giving you blades running the 3.1 SDK, which may be problematic since
    the VMWare image with the simulator described in the points below runs
    the 3.0 SDK.

  • Remember cell-user.cc.gatech.edu isn’t a Cell machine, so your Cell
    code won’t run on it! It’s the job of cell-user to assign you an actual Cell
    blade.

  • You can download the VMWare
    Player and the Cell workshop VMWare image from
    here.
    You log in as root. The password is inn0vate.

  • To start the Cell simulator, go to the
    directory /opt/ibm/systemsim-cell/run/cell/linux and run
    ../run_gui. Clicking the “Go” button will start the Cell Linux running
    in the simulator takes a quite while to start up.

  • To import executables (or whatever)
    from the main Linux image running in the VMWare Player into
    the Cell Linux image running in the simulator, type
    callthru source directoryinmainlinux/myfilename > myfilename.

  • After the import, you’ll typically need to use chmod
    a+x myfilename
    to make
    the file executable.

  • Be sure to set various options to the “fast” modes, or you will be
    waiting all eternity for the Cell Linux to boot and your program to run.

8A Checkoff:
Modify the
example in
/opt/cell_class/Hands-on-30/basicDMA/DMA_getbuf_libspe2-async
to replace the
strings “Good morning!” and “Guten Morgen!”
with your name and the name of
your
favorite
video game, respectively (for instance, “Aaron Lanterman” and “Metal
Gear Solid”.)
Submit two screenshots:
(1) a screenshot of your complete VMWare screen
showing you running the program in the simulator and (2) a screenshot of a
terminal window showing you run it on the
CellBuzz
cluster
.
This is checkoff is
obviously intended to make sure you understand
the mechanics of compiling and
running Cell code out of the way, and
that you’re not still trying to figure out how to compile and run code the day
before 8B is due. (If you
manage to complete 8B by the 8A deadline, you can directly turn in 8B early
and not bother with 8A.)


This may seem trivial, and it is; but if I don’t assign this as a separate
early turn in, believe me, I will
be getting a dozen e-mails Thursday night saying “I can’t download the
image/I
can’t get VMWare Player to work/I can’t log in to Cellbuzz/I forgot to sign
up for a Cellbuzz account what should I do now/etc.”


Warning about compiling other examples:
The example used in 8A seems to compile fine, but
I think some of the examples in the cell_class directory may have been
made for an earlier version of the SDK, so you need to do some tweaking to
get them to work. In particular, IBM seems to have moved the locations of
needed files around just to annoy us.


8B Boss Battle: Quaternions are a computationally efficient
and conceptually intriguing
alternative to rotation matrices. To concatenate rotations represented
by quaternions, you just multiply the quaternions, just like you do
with rotation matrices. Before continuing, please read over the set
of supplemental slides on quaternions that we have prepared
(slides,
4-up), particularly the slide on
quaternion multiplication.


We will create a Cell program to do some computations with quaternions.
We will create two sets of quaternions, which we will call A and B, each
containing 1,000 quaternions, numbered 0 through 999.
You can “hard code” the number of quaternions as 1000,
i.e. declare arrays the size
you need, and not worry about “mallocing” variable amounts of space for
an arbitary numbers of quaternions.


We will
initialize quaternions 0 through 399 of set A as


A(n) = {cos([2*pi*n/400]/2),sin([2*pi*n/400]/2)*(1,0,0)},


quaternions 400 to 699 of set A as


A(n) = {cos([2*pi*(n-400)/300]/2),sin([2*pi*(n-400)/300]/2)*(0,1,0)},


and
700 to 999 of set A as


A(n) = {cos([2*pi*(n-700)/300]/2),sin([2*pi*(n-700)/300]/2)*(0,0,1)}


We will initialize quaternion set B
as


B(n) =
{cos([2*pi*n/1000]/2),sin([2*pi*n/1000]/2)*(sqrt(1/6),sqrt(1/3),sqrt(1/2)}


Initialize your quaternions on the PPE, and then send that data to one of
the SPEs
.
On the SPE, iterate the computation
A(n) = A(n)*B(n), for all n=0…999 quaternions,
where the equal sign represents assignment (not mathematical equality)
and the * sign represents the quaternion multiplication
as defined on slide 3 of
the supplemental slides, 20 times. Do all computations in single-precision
floating point. Note that A will change
as a result of these iterations, but B will remain constant. This models a
situation in which A represents the orientations of objects,
and B represents
an amount by which they rotate over a given time interval. Organize
your computation such that the 20 iterations is the “outer loop” and
the computation on the 1,000 quaternion pairs is the “inner loop” – the idea
is that in a real game, some of the other SPEs may be doing other kinds
of computations at each iteration of the “outer loop.”
When your iterations are done, ship the data for the final state of the A
quaternions back to the PPE.


Do the same calculation on the PPE.
If you are clever you can have
the PPE and the SPE calculating simultaneously; alternatively, you can
do the calculations sequentially, doing the PPE calculations first and the
SPE calculations second or vice-versa. (I’m not picky about this since
just getting the DMA working and getting a handle on the SIMD instructions
is difficult enough.)




When all the computations are done, first
have the PPE print out the final quaternions A(n)
for n=0, 600, and 999 that were computued on the PPE, and then have
the PPE print out the final quaternions A(n)
for n=0, 600, and 999 that were computued on the SPE. Note that these
results may differ slightly due to the SPEs handling floating point
numbers differently than the PPEs.




Make extensive and clever use of SIMD intrinsics, particularly in your
quaternion multiplications,
both on the PPE and the SPE, to make
the Cell purr.

If you do not use SIMD intrinsics and strictly
write old-fashioned
scalar code, not only will your
Cell code not purr, but you will lose points since
getting a hang of using those instrinsics is one of the main points
of the assignment. Note that although the SIMD intrinsics on
PPE and SPE are similar, they are not the same. (You don’t need to use
SIMD intrinsics in the quaternion initialization phase. You can if you want,
but I won’t be picky about that since that code is only run once.)


In setting up your code, you will need to give careful thought to how
you want to organize your data. Mike Acton, Engine Director for
Insomniac Games, is fond of saying that “data is more
important than code,” and noting this is particularly true on the Cell.
You will need to think about whether
want to adopt an “array-of-structures” (AOS) style or a
“structure-of-arrays” (SOA) style. An AOS style is natural on a GPU,
since you could put the four elements of the quaternions into the
(r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you
have “free”
swizzle operations that would make implementing quaternion multiplication
easy with that format. However, on the Cell, data permutation can be costly,
so it might be better to take an SOA approach and pack all instance of one
quaternion coordinate in one giant array, all instances
of a second coordinate in another array, etc., and have your quaternion
multiply essentially operate on four quaternion pairs at a time.


Although will you probably want to use the Cell simulator for you
initial development and debugging
(perhaps lowering the number of rotation iterations,
at first, for the sake of speed), your final run of the code should be
on the CellBuzz cluster.


Note we
are not
requiring you to use multiple SPEs or do any SPESPE communication.


For this part,
upload a ZIP file with your code, makefile, executable, and a screenshot
showing your code running on the CellBuzz
cluster, showing your results.


Deliverables: See the descriptions above.


Be sure to finish sufficiently in
advance of the deadline that you will
be able to work around any troubles
T-square gives you to successfully
submit before the deadline. If you have
trouble getting T-square to work,
please e-mail your compressed
file to lanterma@ece.gatech.edu,
with “MPG HW #8A” or “MPG HW #8B” and your full
name in the header line;
please only use
this e-mail submission as a last resort if T-square isn’t working.


Playstation 3 option: If you happen to
have Linux installed on your own Playstation 3,
you are welcome to try using it instead of the CellBuzz cluster. Please
contact me if you intent to take that route.


Discussion board: As an experiment, we
have set up a “HW #8” discussion board where
students can discuss the homework and in particular ask questions that
the professors and other students can help answer. (Try to avoid posting
significant chunks of code on the discussion board; those are probably best
directly e-mailed to the professors.)


Ground rules:
You are welcome to discuss
high-level implementation issues
with your fellow students,
but you should avoid actually looking at
one another student’s
code as whole, and
under no circumstances should you be copying
any portion of another student’s code.
However, asking another student to
focus on a few lines of your code to discuss
why you are getting a particular
kind of error is reasonable.
Basically, these “ground rules” are
intended to prevent a student from
“freeloading” off another student,
even accidentally, since they
won’t get the full yummy nutritional
educational goodness out of the assignment if they do.


Looking at
code from homeworks done in previous years is strictly prohibited.