# ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

## Fall 2008

## Homework #7: The Cell

## 7A checkpoint due: Friday, Dec. 5 (in class at the start of class)

## 7B boss battle due: Sunday, Dec. 7 at 5:00 PM (via T-square)

**Late policy:** The “7A checkpoint” will be graded out of 20 points;

the “7B boss battle” will be graded out of 100 points. We will

accept late submissions; however,

for every day that a submission is overdue (including weekend days),

we will subtract 10 points (for 7A) or 20 points from

the total (for 7B).

(We understand that sometimes multiple assignments hit at once, or other

life events intervene, and hence you have to make some tough choices. We’d

rather let you turn something in

late, with some points off, than have a “no late assignments

accepted at all”

policy, since the former encourages you to still do the assignment

and learn something from it, while the latter just grinds down your soul.)

**Note about the unusual Sunday due date**:

Having anything due during finals

week has proved to be wildly unpopular in previous semesters; I thought

it would be good to give people more time, and give them flexibility in

managing their time, but it seems that it

distracts people from studying for finals.

In particular,

having somethig due near the end of finals week was said to put people with

flight plans, family coming into town for graduation, etc. at a severe

disadvantage. On the other hand, many students I

have talked to have projects

due or exams on Friday of dead week, were begging to give them at

least the

weekend before finals

to work on it, so they

could spend this week working on their other projects.

The Sunday 5:00 date/time is a compromise between those two

positions; you an turn it in and then spend Sunday evening studying for your

Monday exams.

If you do not like the Sunday due date, simply pretend that it is

due the preceeding Friday.

**Review:**

- You can download the VMWare

Player and the Cell workshop VMWare image from

here.

You log in as root. The password is inn0vate. - To start the Cell simulator, go to the

directory`/opt/ibm/systemsim-cell/run/cell/linux`and run

`../run_gui`. Clicking the “Go” button will start the Cell Linux running

in the simulator takes a quite while to start up. - To import executables (or whatever)

from the main Linux image running in the VMWare Player into

the Cell Linux image running in the simulator, type

`callthru source directoryinmainlinux/myfilename > myfilename`. - After the import, you’ll need to use
`chmod`to make

a+x myfilename

the file executable. - Be sure to set various options to the “fast” modes, or you will be

waiting all eternity for the Cell Linux to boot and your program to run.

**7A Checkoff:**

Modify the

example in

`/opt/cell_class/Hands-on-30/basicDMA/DMA_getbuf_libspe2-async`

to replace the

strings “Good morning!” and “Guten Morgen!”

with your name and the name of

your

favorite

video game, respectively (for instance, “Aaron Lanterman” and “Metal

Gear Solid”.)

Hand in *hard copies* of two screenshots:

(1) a screenshots of your complete VMWare screen

showing you running the program in the simulator and (2) a screenshot of a

terminal window showing you run it on the

CellBuzz

cluster.

This is checkoff is

obviously intended to make sure you understand

the mechanics of compiling and

running cell code out of the way, and

that you’re not still trying to figure out how to compile and run code the day

before 7B is due. (If you

manage to complete 7B by the 7A deadline, you can directly turn in 7B early

and not bother with 7A.)

This may seem trivial, and it is; but if I don’t assign this as a separate

early turn in, believe me, I will

be getting a dozen e-mails Saturday night saying “I can’t download the

image/I

can’t get VMWare Player to work/I can’t log in to Cellbuzz/I forgot to sign

up for a Cellbuzz account what should I do now/etc.”

**Warning about compiling other examples:**

The example used in 7A seems to compile fine, but

I think some of the examples in the cell_class directory may have been

made for an earlier version of the SDK, so you need to do some tweaking to

get them to work. In particular, IBM seems to have moved the locations of

needed files around just to annoy us.

**7B Boss Battle**: *Quaternions* are a computationally efficient

and conceptually intriguing

alternative to rotation matrices. To concatenate rotations represented

by quaternions, you just multiply the quaternions, just like you do

with rotation matrices. Before continuing, please read over the set

of supplemental slides on quaternions that we have prepared

(slides,

4-up), particularly the slide on

quaternion multiplication.

We will create a Cell program to do some computations with quaternions.

We will create two sets of quaternions, which we will call A and B, each

containing 1,000 quaternions, numbered 0 through 999.

You can “hard code” the number of quaternions as 1000,

i.e. declare arrays the size

you need, and not worry about “mallocing” variable amounts of space for

an arbitary numbers of quaternions.

We will

initialize quaternions 0 through 299 of set A as

A(n) = {cos([2*pi*n/300]/2),sin([2*pi*n/300]/2)*(1,0,0)},

quaternions 300 to 599 of set A as

A(n) = {cos([2*pi*(n-300)/300]/2),sin([2*pi*(n-300)/300]/2)*(0,1,0)},

and

600 to 999 of set A as

A(n) = {cos([2*pi*(n-600)/400]/2),sin([2*pi*(n-600)/400]/2)*(0,0,1)}

We will initialize quaternion set B

as

B(n) =

{cos([2*pi*n/1000]/2),sin([2*pi*n/1000]/2)*(sqrt(1/6),sqrt(1/3),sqrt(1/2)}

Initialize your quaternions on the PPE, and then send that data to one of

the SPEs.

On the SPE, iterate the computation

A(n) = A(n)*B(n), for all n=0…999 quaternions,

where the equal sign represents assignment (not mathematical equality)

and the * sign represents the *quaternion multiplication*

as defined on slide 3 of

the supplemental slides, 2,000 times. Note that A will change

as a result of these iterations, but B will remain constant. This models a

situation in which A represents the orientations of objects,

and B represents

an amount by which they rotate over a given time interval. Organize

your computation such that the 2,000 iterations is the “outer loop” and

the computation on the 1,000 quaternion pairs is the “inner loop” – the idea

is that in a real game, some of the other SPEs may be doing other kinds

of computations at each iteration of the “outer loop.”

When

your iterations is done, ship the data for the final state of the A

quaternions back to the PPE.

Then, on the PPE, compute a 4×4 matrix for each quaternion A(n)

that contains a rotation matrix in the upper 3×3 submatrix derived

from the quaternion representations (see Slide 7 of supplemental slides.)

Put 0s along the bottommost row and

rightmost row, except for a 1 in the lower right corner. The idea here

is that this is the representation that would be passed to the RSX (if we

had access to the RSX). Each row of the matrix should consist of a

128-bit word with the usual x,y,z,w ordering of four 32-bit bytes, since

that is how the data would be typically be presented

to the GPU. (As an aside, quaternion

elements are usually labeled w,x,y,z, so don’t let that confuse you;

they’re all just labels.)

When all the computations are done, have the PPE print out the quaternion

and associated rotation matrix for quaternions 0, 500, and 999.

**Make extensive and clever use of SIMD intrinsics throughout your code,
both on the PPE and the SPE, to make
the Cell purr.**

If you do not use SIMD intrinsics and strictly

write old-fashioned

scalar code, not only will your

Cell code not purr, but you will lose points since

getting a hang of using those instrinsics is one of the main points

of the assignment.

In setting up your code, you will need to give careful thought to how

you want to organize your data. Mike Acton, Engine Director for

Insomniac Games, is fond of saying that “data is more

important than code,” and noting this is particularly true on the Cell.

You will need to think about whether

want to adopt an “array-of-structures” (AOS) style or a

“structure-of-arrays” (SOA) style. An AOS style is natural on a GPU,

since you could put the four elements of the quaternions into the

(r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you

have “free”

swizzle operations that would make implementing quaternion multiplication

easy with that format. However, on the Cell, data permutation can be costly,

so it might be better to take an SOA approach and pack all instance of one

quaternion coordinate in one giant array, all instances

of a second coordinate in another array, etc., and have your quaternion

multiply essentially operate on four quaternion pairs at a time.

Although will you probably want to use the Cell simulator for you

initial development and debugging

(perhaps lowering the number of rotation iterations,

at first, for the sake of speed), your final run of the code should be

on the Cellbuzz cluster.

When you run your code for the screenshot, run it via

“time yourprogramnamehere” at the Linux prompt. This will present

some extremely course timing information. (The Cell SDK provides a wide

variety of sophisticated profiling tools, but we did not have time to

cover them this semester.)

Note we

are not

requiring you to use multiple SPEs or do any SPESPE communication.

For this part,

upload a ZIP file with your code, makefile, executable, and a screenshot

showing your code running on the CellBuzz

cluster and which shows the desired

output

and timing information.

**Deliverables:** See the descriptions above.

For both 7A and 7B, please tell us an approximate number of hours

you spent working it,

as well your thoughts on the homework, particularly

suggestions for improving it in the future. (If you don’t have any suggestions,

that’s OK, just tell us an approximate number of hours.)

Be sure to finish sufficiently in

advance of the deadline that you will

be able to work around any troubles

T-square gives you to successfully

submit before the deadline. If you have

trouble getting T-square to work,

please e-mail your compressed

file to lanterma@ece.gatech.edu,

with “MPG HW #7B” and your full

name in the header line;

please only

use this e-mail submission as a last resort if T-square isn’t working.