I posted this to the mailing list a while back, but while I'm letting the 3D renderer design settle, I've moved on to video to fill the time. This is a copy/paste of the original post. Note that this post describes the general idea but that the actual implementation is already beginning to differ in some of the fine details.
This video controller design is intended to be both insanely simple and incredibly flexible. We do this by trading off logic gates for memory cells. The video controller uses an embedded memory block that contains what amounts to a simple program.
There are two major things that the video controller must do: request blocks of memory and control video output signals.
My proposed design is a processor which executes one instruction per pixel clock cycle. Each instruction has explicit control over video timing signals and can optionally cause some other control- or data-oriented event to happen. Those events include:
- Calling a subroutine (in a loop) - Returning from a subroutine - Unconditional branch - Requesting pixels from the framebuffer - Sending pixels (hard-coded or from queue) - Asserting interrupt signal
To see where I'm going with this, let's consider a high-level pseudo-code example for a progressive-scan 640x480 mode with positive-assertion sync signals. We'll get into implementation later.
vertical_start:
# 16 scanlines of vertical sync
+vsync, +hsync, call vertical_sync 16 times
# first 16 lines of vertical back porch
+hsync, call vertical_back_porch 15 times
# request first active scanline data (one line early)
+hsync, set video address to 0
+hsync, request 640 pixels
# remaining cycles of last vbp line
+hsync, call vertical_pack_porch_short 1 time
# first 479 lines of active video
+hsync, call vertical_active_video 479 times
# last line of active video, omitting the data request
+hsync, nop for two cycles
+hsync, call vertical_active_video_short 1 time
# first 15 lines of vertical front porch
+hsync, call vertical_front_porch 15 times
# last line, minus the time it takes to loop this routine
+hsync, call vertical_front_porch_short 1 time
# loop
jump to vertical_start
vertical_sync:
# horizontal sync
+vsync, +hsync, nop for 15 cycles
# horizontal back porch, active period, and front porch, minus 2
+vsync, nop for 670 cycles
# return
+vsync, return
vertical_back_porch:
vertical_front_porch:
# horizontal sync
+hsync, nop for 2 cycles
vertical_back_porch_short:
+hsync, nop for 13 cycles
# horizontal back porch, active period, and front porch, minus 2
nop for 670 cycles
return
vertical_front_porch_short:
# hsync
+hsync, nop for 15 cycles
# hbp, h-active, hfp, minus 2
nop for 670 cycles
return
vertical_active_video:
# request data for following scanline
+hsync, increment video address by 640
+hsync, request 640 pixels
vertical_active_video_short:
# rest of hsync
+hsync, nop for 13 cycles
# horizontal back porch
nop for 16 cycles
# active video
+active, send pixels for 640 cycles
# horizontal front porch
nop for 14 cycles
return
FYEW! That took a while to write. Fortunately, if this seems overly complicated to you, do keep in mind that for 99% of all video modes, pre-written code will be available which computes this for you based on the above template. Some of it may seem a bit unintuitive, like having to subtract off cycles here and there to make up for branches, returns, etc. I can't even be sure I got all my numbers right (not important to check right now). Also, there may be some modes with degenerate cycle counts that become too difficult to do.
One possible alteration to the design would be to make concurrently-running horizontal and vertical engines so that the numbers come out to more expected values. Indeed, I think I may do that for my next draft, but this is just to incite discussion about the basic idea.
What I'm intending here is to make it possible to do any kind of video mode you want, including interlaced with proper 1/2 first and last scanline, serration pulses, etc. Even DPVL packets can be encoded with the packet header inline in the program rather than requiring that we put it in the framebuffer or the cursor.
Here's a sketch of the instruction set:
Bits [31:28] are connected to external sync signals. The rest are defined based on the instruction.
Bit Meaning --- ------------------- 31 Horizontal sync 30 Vertical sync 29 Active video 28 other
There are 8 basic instructions:
Command Meaning ------- ----------------- 0 nop 1 call subroutine 2 jump/return 3 interrupt/hard-code pixel 4 fetch address 5 increment address 6 fetch count 7 dequeue and send pixel
Here is how the bit fields work out for the instructions:
NOP [31:28 signals][27:25==0][12:0 iteration count]
control only signals for N cycles
CALL [31:28 signals][27:25==1][22:9 count][8:0 program address]
repeatedly call subroutine N times
JUMP [31:28 signals][27:25==2][24==0][8:0 program address]
RETURN [31:28 signals][27:25==2][24==1]
INT [31:28 signals][27:25==3][24==0]
PIXEL [31:28 signals][27:25==3][24==1][23:0 hard-coded pixel]
ADDR [31:28 signals][27:25==4][24:0 address]
address is shifted left by 3 to make it 28 bits
INC [31:28 signals][27:25==5][13:0 address increment]
no shift applied to increment
FETCH [31:28 signals][27:25==6][13:0 fetch count]
SEND [31:28 signals][27:25==7][13:0 send count]
Actually, these can be rearranged a bit better, considering which commands don't have a data field, etc.
neat
in a single word?
NEAT
thinking about Cell....
Well, that's it.I've read your interview thru Slasdot.Great work Timothy.
I knew about this from last year but never imagined it went so fast till today.
I work at the Cinema post industry and I always hate the lack of support for higher than 8 bit per color channel screen display.I know my CRT's support higher than that but the Video Cards don't no matter what they could advertise.
Reading your comments about some limitations on #D hardware support I came to this question.Have you planned to include (sometime in future) a CELL processor to help your GPU on such tasks?
I firmly believe it would be a Kick-Ass product, cause it would also be very usefull for other tasks as accelerating video compression, audio and many not video related tasks.So this board could be sold as french fries!!!
Hope you the best.
than you for your work and thank to your company.
Well, something like that.
So far, it looks like the CELL processor is a lot of hype. Not that it wouldn't be fast when VERY carefully programmed, but it's just another way of doing parallel processing. I could be wrong, though.
For graphics, you need a specialized processor that's fast at doing vector processing. I'm always going to have my eye on that, and I am thinking about future designs having that for geometry processing. But it turns out that vertex/geometry processing isn't a huge burden for the GPU and sometimes, it's faster to do it in the CPU than to have a specialized geometry engine in the GPU. That's something to be considered when the time comes.
That's right Vector processing
Vector processing is exactly the best thing for a Cell processor (at least that is what it says after understanding its patent).That is why I thought about it.
It will run at 4.6 Ghz, will have 9 cores inside it.So I guess given the high number of units that they are expecting to produce it will not be too much expensive.....