ECE4893A/CS4803MPG – Homework #5

ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

Fall 2007

Homework #5: The Cell

5A checkpoint due: Friday, Nov. 7 at 23:59:59 (via T-square)

5B boss battle due: Friday, Nov. 14 at 17:00:00 (via T-square)

Late policy: The “5A checkpoint” will be graded out of 20 points;
the “5B boss battle” will be graded out of 100 points. We will
accept late submissions; however,
for every day that a submission is overdue (including weekend days),
we will subtract 10 points (for 4A) or 20 points from
the total (for 4B). The latest you can turn it in is Sunday,
November 16, at 5:00 PM; this is because we need to turn in grades by noon
on the following Monday.
(We understand that sometimes multiple assignments hit at once, or other
life events intervene, and hence you have to make some tough choices. We’d
rather let you turn something in
late, with some points off, than have a “no late assignments
accepted at all”
policy, since the former encourages you to still do the assignment
and learn something from it, while the latter just grinds down your soul.)

<!–[I’ve gotten some good questions on this homework; I’ve included some
clarifications and modifications in the writeup below in boldface.] –>

Review:
Remember the following tips from Session 34, which Seunghwa Kang presented:

You can download the VMWare Player and the Cell workshop VMWare image from
here.
You log in as root.
Note that the password in the README file
is incorrect; see the cited webpage for the real one.
To start the Cell simulator, go to the
directory opt/ibm/systemsim-cell/run/cell/linux and run
../run_gui. Clicking the “Go” button will start the Cell Linux running
in the simulator takes a quite while to start up.
To import executables (or whatever)
from the main Linux image running in the VMWare Player into
the Cell Linux image running in the simulator, type
callthru source directoryinmainlinux/myfilename > myfilename.
After the import, you’ll need to use chmod a+x myfilename to make
the file executable.

5A Checkoff:
Modify the example in /opt/cell_class/Hands-on/DMA/example1b to replace the
strings “Good morning!” and “Guten Morgen!” with your name and the name of your
favorite
video game, respectively. Upload two screenshots:
(1) a screenshots of your complete VMWare screen
showing you running the program in the simulator and (2) a screenshot of a
terminal window showing you run it on the
CellBuzz
cluster.
This is checkoff is
obviously intended to make sure you get the mechanics of compiling and
running cell code out of the way before the weekend after classes hits, and
that you’re not still trying to figure out how to compile and run code the day
before 5B is due. (If you
manage to complete 5B by the 5A deadline, you can directly turn in 5B early
and not bother with 5A. However, I
suspect few students will be able to get 5B working
by then.)

Warning about compiling the examples:
I think the examples in the cell_class directory may have been
made for an easlier version of the SDK, so you need to do some tweaking.

When you try compiling, you’ll get the complaint:

No rule to make target /opt/ibmcmp/xlc/8.1/include/spu_intrinsics.h

Digging around, I found there is a /opt/ibmcmp/xlc/8.2 directory, not an
/opt/ibmcmp/xlc/8.1 directory. I resolved this my changing the 8.1’s in
the *.d files (which seem to hold dependencies) to 8.2’s. (A more elegant
approach migh be to just rename the 8.2 directory to 8.1, but I haven’t tried it,
so I’m not sure if that breaks something else).

Once I fixed that, it then complained that it couldn’t find things like:

/opt/cell/toolchain-3.3/sysroot/usr/include/stdio.h

Well, it turns out /opt/cell/sysroot/blah blah blah exists,
so if you remove
the toolchain-3.3 part
(or perhaps put a symbolic link in the cell directory called
toolchain-3.3 that points to sysroot), it seems happy.

5B Boss Battle:
Here, we will implement a tiny part of a physics engine for
300 Spartan
warriors and 700 Theban volunteers
tumbling through free space after falling off a cliff. (Most people
only talk about the 300 Spartans, but they forget the 700 Thebans,
as well as the 1,000 or so slaves that also faught at the
Battle of Thermopylae, according to Wikipedia.)
On the way down,
they all caught a glimpse of the gaze of the Medusa, which turned them to
stone. Hence, we can treat them as rigid bodies.

We will treat the Spartans and Thebans as the same.
You can “hard code” the number of warriors as 1000,
i.e. declare arrays the size
you need, and not worry about “mallocing” variable amounts of space for
an arbitary numbers of warriors.

We will only worry about the orientations of the warriors, which will
be represented via quaternions. We will not worry
about their positions, and we will not worry about collisions between them.
(I couldn’t think of a good narrative hack to explain away why we’re not
worrying about collisions.)

Let’s number the warriors as 0 through 999. Referring to the number
of a particular warrior as n, let’s get some variety by initializing
the quaternion for warriors 0 through 299 as

q = {cos([2*pi*n/300]/2),sin([2*pi*n/300]/2)*(1,0,0)}

for warriors 300 to 599 as

q = {cos([2*pi*(n-300)/300]/2),sin([2*pi*(n-300)/300]/2)*(0,1,0)}

for warriors 600 to 999 as

q = {cos([2*pi*(n-600)/400]/2),sin([2*pi*(n-600)/400]/2)*(0,0,1)}

Each warrior is rotating with a different angular velocity
(omega0,omega1,omega2).
Let’s set the angular velocities as

omega0 = pi*(n/1000) radians per second

omega1 = 2*pi*(n/1000) radians per second

omega2 = 3*pi*(n/1000) radians per second

for each of the 1000 warriors.

We will keep the angular velocites constant through the simulation.
You should precompute the angular velocities and store them in
variables, and not recompute them on every iteration.
The idea is that in the “real game,” other forces would modify those
angular velocities (for instance, wind, impulse collisions, magic spells,
etc.)
as gameplay proceeded.

In your lecture notes on physics engines, you will find the differential
equation governing rotational dynamics using quaternion representations:

dq
— = [omega0 i + omega1 j + omega2 k] q / 2.
dt

The above multiplication is quaternion multiplication (see Slide 3
of the Quaternion supplement slides.)
Use a 2nd-order Runge-Kutta method (also known as the midpoint method)
to numerically implement this differential equation. See your notes
from the physics engine lectures or ask Wikipedia for details of the
method.

After each iteration,
renormalize the quaternions to have unit norm. This will
prevent errors from building up. (Typically you would only bother to
do this every-so-many iterations; here, for simplicity, we’ll do it every
iteration.)
Then, compute a 4×4 matrix for each of the warriors
that contains a rotation matrix in the upper 3×3 submatrix derived
for that warrior from the quaternion representations (see Slide 7 of
the Quaternion slide suppelment). Put 0s along the bottommost row and
rightmost row, except for a 1 in the lower right corner. The idea here
is that this is the representation that would be passed to the RSX (if we
had access to the RSX). Each row of the matrix should consist of a
128-bit word with the usual x,y,z,w ordering of four 32-bit bytes, since
that is how the data would be typically be presented
to the GPU. (As an aside, quaternion
elements are usually labeled w,x,y,z, so don’t let that confuse use;
they’re all just labels.)

Use a time stepsize of 1/60th of a second. Run the simulation for
10 seconds of “simulated time.” The code shouldn’t need very much
“read-world time”
to do these calculations.

Generate your initial quaternions and angular velocities on the PPU,
and then send that data to one of the SPUs. Do the simulation on the SPU.
At the end of the 10 seconds of “simulated time,”
send the final result of the 1000 quaternions
and the matrix representations back to the PPU. Have the PPU print out
the quaternion and matrix for “warrior 0,” “warrior 500,” and “warrior 999.”
(Note we
are not requiring you to use multiple SPEs or do any SPESPE communication.)

In setting up your code, you will need to give careful thought to how
you want to organize your data. Mike Acton, Engine Director for
Insomniac Games, is fond of saying that on the Cell, “data is more
important than code.” In particular, you will need to think aboue whether you
want to adopt an “array-of-structures” (AOS) style or a
“structure-of-arrays” (SOA) style. An AOS style is natural on a GPU,
since you could put the four elements of the quaternions into the
(r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you
have “free”
swizzle operations that would make implementing quaternion multiplication
easy with that format. However, on the Cell, data permutation can be costly,
so it might be better to take an SOA approach and pack all instance of one
quaternion coordinate in one giant array, all instances
of a second coordinate in another array, etc., and have your quaternion
multiply essentially operate on four quaternion pairs at a time.

Make extensive and clever use of SIMD intrinsics throughout your code to make
the Cell purr. If you do not use SIMD intrinsics and strictly
write old-fashioned
scalar code, not only will your
Cell code not purr, but you will lose points since
getting a hang of using those instrinsics is one of the main points
of the assignment. (Exception: I don’t care how efficient or not the
initialization on the PPU is. You should focus your SIMD efforts on the
our SPU code.)

When you run your code for the screenshot, run it via
“time yourprogramnamehere” at the Linux prompt. This will present
some extremely course timing information. (The Cell SDK provides a wide
variety of sophisticated profiling tools that I was planning to have you
use, but I decided that would be too much for the limited time we have
left this semester. Just wrapping your head around how to organize the
data and how to use the SIMD intrinsics will probably take up enough of
your time).

You will find the example code in the
/opt/cell_class/Hands-on/euler instructive.
You may feel free to use that code, or any other code you see in the
/opt/cell_class/Hands-on, as the basis for your work for 5B,
or make up something from
scratch.

For this part, upload a ZIP file with your code, executable, and a screenshot
showing your code running on the CellBuzz cluster and which shows the desired
output
and timing information. (You will probably want to do your debugging on
the simulator, perhaps using a smaller number of warriors.)

Deliverables: See the descriptions above for what to upload.
For both 5A and 5B, please tell us an approximate number of hours
you spent working it,
as well your thoughts on the homework, particularly
suggestions for improving it in the future. (If you don’t have any suggestions,
that’s OK, just tell us an approximate number of hours.)

Be sure to finish sufficiently in
advance of the deadline that you will
be able to work around any troubles
T-square gives you to successfully
submit before the deadline. If you have
trouble getting T-square to work,
please e-mail your compressed
file to lanterma@ece.gatech.edu,
with “MPG HW #5A” or “MPG HW #5B” and your full
name in the header line;
please only
use this e-mail submission as a last resort if T-square isn’t working.

When you submit your homework, please tell us an approximate number of hours
you spent working on it, as well your thoughts on the homework, particularly
suggestions for improving it in the future. (If you don’t have any suggestions,
that’s OK, just tell us an approximate number of hours.) This will help us
with future offerings of the class.