# ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

## Fall 2009

## Homework #8: The Cell

## 8A checkpoint due: Saturday, Dec. 5 at 11:59 PM (via T-square)

## 8B boss battle due: Friday, Dec. 11 at 2:20 PM (via T-square)

**Late policy:** The “8A checkpoint” will be graded out of 20 points;

the “8B boss battle” will be graded out of 100 points. We will

accept late submissions; however,

for every day that a submission is overdue (including weekend days),

we will subtract 10 points (for 8A) or 20 points from

the total (for 8B). **8B will not be accepted past 2:20 on Sunday,
Dec. 13, since I need to turn in grades the following day.**

(We understand that sometimes multiple assignments hit at once, or other

life events intervene, and hence you have to make some tough choices. We’d

rather let you turn something in

late, with some points off, than have a “no late assignments

accepted at all”

policy, since the former encourages you to still do the assignment

and learn something from it, while the latter just grinds down your soul.)

**Note about the unusual due dates**:

Many students I

have talked to have major projects due on Friday of “dead week.” I set the

“checkpoint” to be due the following today so people slammed with multiple

projects that are due Friday have some breathing room, but I strongly

recommend handling the checkpoint, which is meant to make sure you can compile

and run Cell programs, early. The 2:20 PM due date for the “boss battle”

on Friday of finals week is set to correspond to when our final would have

ended if we had actually had a final; this seems like a reasonable approach

since we’re not doing a final. Again, I recommend you try to get the

assignment finished early, but this gives you the flexibility to manage

your time however seems best to you.

**Review:**

- Be sure to sign up for a

CellBuzz account. Prof.

Bader, who is in charge of the Cell cluster, has to manually approve these,

so sign up as soon as possible. - Read over the

CellBuzz

User Guide carefully; be sure to note the section near the

bottom on how to specifically log into

blades running the 3.0 SDK. The system defaults

to giving you blades running the 3.1 SDK, which may be problematic since

the VMWare image with the simulator described in the points below runs

the 3.0 SDK. - Remember cell-user.cc.gatech.edu isn’t a Cell machine, so your Cell

code won’t run on it! It’s the job of cell-user to assign you an actual Cell

blade. - You can download the VMWare

Player and the Cell workshop VMWare image from

here.

You log in as root. The password is inn0vate. - To start the Cell simulator, go to the

directory`/opt/ibm/systemsim-cell/run/cell/linux`and run

`../run_gui`. Clicking the “Go” button will start the Cell Linux running

in the simulator takes a quite while to start up. - To import executables (or whatever)

from the main Linux image running in the VMWare Player into

the Cell Linux image running in the simulator, type

`callthru source directoryinmainlinux/myfilename > myfilename`. - After the import, you’ll typically need to use
`chmod`to make

a+x myfilename

the file executable. - Be sure to set various options to the “fast” modes, or you will be

waiting all eternity for the Cell Linux to boot and your program to run.

**8A Checkoff:**

Modify the

example in

`/opt/cell_class/Hands-on-30/basicDMA/DMA_getbuf_libspe2-async`

to replace the

strings “Good morning!” and “Guten Morgen!”

with your name and the name of

your

favorite

video game, respectively (for instance, “Aaron Lanterman” and “Metal

Gear Solid”.)

Submit two screenshots:

(1) a screenshot of your complete VMWare screen

showing you running the program in the simulator and (2) a screenshot of a

terminal window showing you run it on the

CellBuzz

cluster.

This is checkoff is

obviously intended to make sure you understand

the mechanics of compiling and

running Cell code out of the way, and

that you’re not still trying to figure out how to compile and run code the day

before 8B is due. (If you

manage to complete 8B by the 8A deadline, you can directly turn in 8B early

and not bother with 8A.)

This may seem trivial, and it is; but if I don’t assign this as a separate

early turn in, believe me, I will

be getting a dozen e-mails Thursday night saying “I can’t download the

image/I

can’t get VMWare Player to work/I can’t log in to Cellbuzz/I forgot to sign

up for a Cellbuzz account what should I do now/etc.”

**Warning about compiling other examples:**

The example used in 8A seems to compile fine, but

I think some of the examples in the cell_class directory may have been

made for an earlier version of the SDK, so you need to do some tweaking to

get them to work. In particular, IBM seems to have moved the locations of

needed files around just to annoy us.

**8B Boss Battle**: *Quaternions* are a computationally efficient

and conceptually intriguing

alternative to rotation matrices. To concatenate rotations represented

by quaternions, you just multiply the quaternions, just like you do

with rotation matrices. Before continuing, please read over the set

of supplemental slides on quaternions that we have prepared

(slides,

4-up), particularly the slide on

quaternion multiplication.

We will create a Cell program to do some computations with quaternions.

We will create two sets of quaternions, which we will call A and B, each

containing 1,000 quaternions, numbered 0 through 999.

You can “hard code” the number of quaternions as 1000,

i.e. declare arrays the size

you need, and not worry about “mallocing” variable amounts of space for

an arbitary numbers of quaternions.

We will

initialize quaternions 0 through 399 of set A as

A(n) = {cos([2*pi*n/400]/2),sin([2*pi*n/400]/2)*(1,0,0)},

quaternions 400 to 699 of set A as

A(n) = {cos([2*pi*(n-400)/300]/2),sin([2*pi*(n-400)/300]/2)*(0,1,0)},

and

700 to 999 of set A as

A(n) = {cos([2*pi*(n-700)/300]/2),sin([2*pi*(n-700)/300]/2)*(0,0,1)}

We will initialize quaternion set B

as

B(n) =

{cos([2*pi*n/1000]/2),sin([2*pi*n/1000]/2)*(sqrt(1/6),sqrt(1/3),sqrt(1/2)}

**Initialize your quaternions on the PPE, and then send that data to one of
the SPEs**.

On the SPE, iterate the computation

A(n) = A(n)*B(n), for all n=0…999 quaternions,

where the equal sign represents assignment (not mathematical equality)

and the * sign represents the

*quaternion multiplication*

as defined on slide 3 of

the supplemental slides, 20 times. Do all computations in single-precision

floating point. Note that A will change

as a result of these iterations, but B will remain constant. This models a

situation in which A represents the orientations of objects,

and B represents

an amount by which they rotate over a given time interval. Organize

your computation such that the 20 iterations is the “outer loop” and

the computation on the 1,000 quaternion pairs is the “inner loop” – the idea

is that in a real game, some of the other SPEs may be doing other kinds

of computations at each iteration of the “outer loop.”

**When your iterations are done, ship the data for the final state of the A**

quaternions back to the PPE.

quaternions back to the PPE.

**Do the same calculation on the PPE.**

If you are clever you can have

the PPE and the SPE calculating simultaneously; alternatively, you can

do the calculations sequentially, doing the PPE calculations first and the

SPE calculations second or vice-versa. (I’m not picky about this since

just getting the DMA working and getting a handle on the SIMD instructions

is difficult enough.)

When all the computations are done, first

have the PPE print out the final quaternions A(n)

for n=0, 600, and 999 **that were computued on the PPE**, and then have

the PPE print out the final quaternions A(n)

for n=0, 600, and 999 **that were computued on the SPE**. Note that these

results may differ slightly due to the SPEs handling floating point

numbers differently than the PPEs.

**Make extensive and clever use of SIMD intrinsics, particularly in your
quaternion multiplications,
both on the PPE and the SPE, to make
the Cell purr.**

If you do not use SIMD intrinsics and strictly

write old-fashioned

scalar code, not only will your

Cell code not purr, but you will lose points since

getting a hang of using those instrinsics is one of the main points

of the assignment. Note that although the SIMD intrinsics on

PPE and SPE are similar, they are not the same. (You don’t need to use

SIMD intrinsics in the quaternion initialization phase. You can if you want,

but I won’t be picky about that since that code is only run once.)

In setting up your code, you will need to give careful thought to how

you want to organize your data. Mike Acton, Engine Director for

Insomniac Games, is fond of saying that “data is more

important than code,” and noting this is particularly true on the Cell.

You will need to think about whether

want to adopt an “array-of-structures” (AOS) style or a

“structure-of-arrays” (SOA) style. An AOS style is natural on a GPU,

since you could put the four elements of the quaternions into the

(r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you

have “free”

swizzle operations that would make implementing quaternion multiplication

easy with that format. However, on the Cell, data permutation can be costly,

so it might be better to take an SOA approach and pack all instance of one

quaternion coordinate in one giant array, all instances

of a second coordinate in another array, etc., and have your quaternion

multiply essentially operate on four quaternion pairs at a time.

Although will you probably want to use the Cell simulator for you

initial development and debugging

(perhaps lowering the number of rotation iterations,

at first, for the sake of speed), your final run of the code should be

on the CellBuzz cluster.

Note we

are not

requiring you to use multiple SPEs or do any SPESPE communication.

For this part,

upload a ZIP file with your code, makefile, executable, and a screenshot

showing your code running on the CellBuzz

cluster, showing your results.

**Deliverables:** See the descriptions above.

Be sure to finish sufficiently in

advance of the deadline that you will

be able to work around any troubles

T-square gives you to successfully

submit before the deadline. If you have

trouble getting T-square to work,

please e-mail your compressed

file to lanterma@ece.gatech.edu,

with “MPG HW #8A” or “MPG HW #8B” and your full

name in the header line;

please only use

this e-mail submission as a last resort if T-square isn’t working.

**Playstation 3 option**: If you happen to

have Linux installed on your own Playstation 3,

you are welcome to try using it instead of the CellBuzz cluster. Please

contact me if you intent to take that route.

**Discussion board**: As an experiment, we

have set up a “HW #8” discussion board where

students can discuss the homework and in particular ask questions that

the professors and other students can help answer. (Try to avoid posting

significant chunks of code on the discussion board; those are probably best

directly e-mailed to the professors.)

**Ground rules:**

You are welcome to discuss

high-level implementation issues

with your fellow students,

but you should avoid actually looking at

one another student’s

code as whole, and

under no circumstances should you be copying

any portion of another student’s code.

However, asking another student to

focus on a few lines of your code to discuss

why you are getting a particular

kind of error is reasonable.

Basically, these “ground rules” are

intended to prevent a student from

“freeloading” off another student,

even accidentally, since they

won’t get the full yummy nutritional

educational goodness out of the assignment if they do.

**Looking at
code from homeworks done in previous years is strictly prohibited.**