ECE4893A/CS4803MPG – Homework #7

ECE4893A/CS4803MPG – Homework #7

ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

Fall 2008

Homework #7: The Cell

7A checkpoint due: Friday, Dec. 5 (in class at the start of class)

7B boss battle due: Sunday, Dec. 7 at 5:00 PM (via T-square)

Late policy: The “7A checkpoint” will be graded out of 20 points;
the “7B boss battle” will be graded out of 100 points. We will
accept late submissions; however,
for every day that a submission is overdue (including weekend days),
we will subtract 10 points (for 7A) or 20 points from
the total (for 7B).
(We understand that sometimes multiple assignments hit at once, or other
life events intervene, and hence you have to make some tough choices. We’d
rather let you turn something in
late, with some points off, than have a “no late assignments
accepted at all”
policy, since the former encourages you to still do the assignment
and learn something from it, while the latter just grinds down your soul.)

Note about the unusual Sunday due date:
Having anything due during finals
week has proved to be wildly unpopular in previous semesters; I thought
it would be good to give people more time, and give them flexibility in
managing their time, but it seems that it
distracts people from studying for finals.
In particular,
having somethig due near the end of finals week was said to put people with
flight plans, family coming into town for graduation, etc. at a severe
disadvantage. On the other hand, many students I
have talked to have projects
due or exams on Friday of dead week, were begging to give them at
least the
weekend before finals
to work on it, so they
could spend this week working on their other projects.
The Sunday 5:00 date/time is a compromise between those two
positions; you an turn it in and then spend Sunday evening studying for your
Monday exams.
If you do not like the Sunday due date, simply pretend that it is
due the preceeding Friday.


  • You can download the VMWare
    Player and the Cell workshop VMWare image from
    You log in as root. The password is inn0vate.

  • To start the Cell simulator, go to the
    directory /opt/ibm/systemsim-cell/run/cell/linux and run
    ../run_gui. Clicking the “Go” button will start the Cell Linux running
    in the simulator takes a quite while to start up.

  • To import executables (or whatever)
    from the main Linux image running in the VMWare Player into
    the Cell Linux image running in the simulator, type
    callthru source directoryinmainlinux/myfilename > myfilename.

  • After the import, you’ll need to use chmod
    a+x myfilename
    to make
    the file executable.

  • Be sure to set various options to the “fast” modes, or you will be
    waiting all eternity for the Cell Linux to boot and your program to run.

7A Checkoff:
Modify the
example in
to replace the
strings “Good morning!” and “Guten Morgen!”
with your name and the name of
video game, respectively (for instance, “Aaron Lanterman” and “Metal
Gear Solid”.)
Hand in hard copies of two screenshots:
(1) a screenshots of your complete VMWare screen
showing you running the program in the simulator and (2) a screenshot of a
terminal window showing you run it on the
This is checkoff is
obviously intended to make sure you understand
the mechanics of compiling and
running cell code out of the way, and
that you’re not still trying to figure out how to compile and run code the day
before 7B is due. (If you
manage to complete 7B by the 7A deadline, you can directly turn in 7B early
and not bother with 7A.)

This may seem trivial, and it is; but if I don’t assign this as a separate
early turn in, believe me, I will
be getting a dozen e-mails Saturday night saying “I can’t download the
can’t get VMWare Player to work/I can’t log in to Cellbuzz/I forgot to sign
up for a Cellbuzz account what should I do now/etc.”

Warning about compiling other examples:
The example used in 7A seems to compile fine, but
I think some of the examples in the cell_class directory may have been
made for an earlier version of the SDK, so you need to do some tweaking to
get them to work. In particular, IBM seems to have moved the locations of
needed files around just to annoy us.

7B Boss Battle: Quaternions are a computationally efficient
and conceptually intriguing
alternative to rotation matrices. To concatenate rotations represented
by quaternions, you just multiply the quaternions, just like you do
with rotation matrices. Before continuing, please read over the set
of supplemental slides on quaternions that we have prepared
4-up), particularly the slide on
quaternion multiplication.

We will create a Cell program to do some computations with quaternions.
We will create two sets of quaternions, which we will call A and B, each
containing 1,000 quaternions, numbered 0 through 999.
You can “hard code” the number of quaternions as 1000,
i.e. declare arrays the size
you need, and not worry about “mallocing” variable amounts of space for
an arbitary numbers of quaternions.

We will
initialize quaternions 0 through 299 of set A as

A(n) = {cos([2*pi*n/300]/2),sin([2*pi*n/300]/2)*(1,0,0)},

quaternions 300 to 599 of set A as

A(n) = {cos([2*pi*(n-300)/300]/2),sin([2*pi*(n-300)/300]/2)*(0,1,0)},

600 to 999 of set A as

A(n) = {cos([2*pi*(n-600)/400]/2),sin([2*pi*(n-600)/400]/2)*(0,0,1)}

We will initialize quaternion set B

B(n) =

Initialize your quaternions on the PPE, and then send that data to one of
the SPEs.
On the SPE, iterate the computation
A(n) = A(n)*B(n), for all n=0…999 quaternions,
where the equal sign represents assignment (not mathematical equality)
and the * sign represents the quaternion multiplication
as defined on slide 3 of
the supplemental slides, 2,000 times. Note that A will change
as a result of these iterations, but B will remain constant. This models a
situation in which A represents the orientations of objects,
and B represents
an amount by which they rotate over a given time interval. Organize
your computation such that the 2,000 iterations is the “outer loop” and
the computation on the 1,000 quaternion pairs is the “inner loop” – the idea
is that in a real game, some of the other SPEs may be doing other kinds
of computations at each iteration of the “outer loop.”
your iterations is done, ship the data for the final state of the A
quaternions back to the PPE.

Then, on the PPE, compute a 4×4 matrix for each quaternion A(n)
that contains a rotation matrix in the upper 3×3 submatrix derived
from the quaternion representations (see Slide 7 of supplemental slides.)
Put 0s along the bottommost row and
rightmost row, except for a 1 in the lower right corner. The idea here
is that this is the representation that would be passed to the RSX (if we
had access to the RSX). Each row of the matrix should consist of a
128-bit word with the usual x,y,z,w ordering of four 32-bit bytes, since
that is how the data would be typically be presented
to the GPU. (As an aside, quaternion
elements are usually labeled w,x,y,z, so don’t let that confuse you;
they’re all just labels.)

When all the computations are done, have the PPE print out the quaternion
and associated rotation matrix for quaternions 0, 500, and 999.

Make extensive and clever use of SIMD intrinsics throughout your code,
both on the PPE and the SPE, to make
the Cell purr.

If you do not use SIMD intrinsics and strictly
write old-fashioned
scalar code, not only will your
Cell code not purr, but you will lose points since
getting a hang of using those instrinsics is one of the main points
of the assignment.

In setting up your code, you will need to give careful thought to how
you want to organize your data. Mike Acton, Engine Director for
Insomniac Games, is fond of saying that “data is more
important than code,” and noting this is particularly true on the Cell.
You will need to think about whether
want to adopt an “array-of-structures” (AOS) style or a
“structure-of-arrays” (SOA) style. An AOS style is natural on a GPU,
since you could put the four elements of the quaternions into the
(r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you
have “free”
swizzle operations that would make implementing quaternion multiplication
easy with that format. However, on the Cell, data permutation can be costly,
so it might be better to take an SOA approach and pack all instance of one
quaternion coordinate in one giant array, all instances
of a second coordinate in another array, etc., and have your quaternion
multiply essentially operate on four quaternion pairs at a time.

Although will you probably want to use the Cell simulator for you
initial development and debugging
(perhaps lowering the number of rotation iterations,
at first, for the sake of speed), your final run of the code should be
on the Cellbuzz cluster.
When you run your code for the screenshot, run it via
“time yourprogramnamehere” at the Linux prompt. This will present
some extremely course timing information. (The Cell SDK provides a wide
variety of sophisticated profiling tools, but we did not have time to
cover them this semester.)

Note we
are not
requiring you to use multiple SPEs or do any SPESPE communication.

For this part,
upload a ZIP file with your code, makefile, executable, and a screenshot
showing your code running on the CellBuzz
cluster and which shows the desired
and timing information.

Deliverables: See the descriptions above.
For both 7A and 7B, please tell us an approximate number of hours
you spent working it,
as well your thoughts on the homework, particularly
suggestions for improving it in the future. (If you don’t have any suggestions,
that’s OK, just tell us an approximate number of hours.)

Be sure to finish sufficiently in
advance of the deadline that you will
be able to work around any troubles
T-square gives you to successfully
submit before the deadline. If you have
trouble getting T-square to work,
please e-mail your compressed
file to,
with “MPG HW #7B” and your full
name in the header line;
please only
use this e-mail submission as a last resort if T-square isn’t working.