The only application I can currently dream up where base+index might be useful, is if INDA points to a block of registers allocated per task in COG RAM and you want it to choose the right block using the task ID for either the base or offset etc. Does INDA and register remapping work together already or does INDA always dereference to 0-511 as absolute addresses without the remapping? It may already have the capability without needing base+index approach.
That's the only variable case that I can come up with, too. It would be coded like this:
TASKID reg
SHL reg,4
LODINDA reg
That would figure out where a task's set of 16 registers were actually located.
INDA/INDB override register remapping, so you need the actual address. Register remapping only affects hard-coded register addresses.
Anyway, all my use cases just involve shifting a value left by some static number of bits. Maybe, instead of an index+base, we should do base<<shift using a 'D,S/#' instruction. That might be more useful.
If INDA/INDB get used a lot for random type indirect accesses with task separation then it could be useful to have the base+index version, but if they are used to setup the base once and then mostly increment/decrement from there, the extra per task ID setup overhead is probably not such a big deal. I'm guessing the latter approach is okay if you want to reserve some more instruction slots for putting to better use later.
Lets take the simple approach for now using LODINDx reg and see where that leads us. We are all eager to try hubexec mode and actual use will hopefully show up any deficiencies.
BTW we don't seem to have a SETINDx D/# so wouldn't that be a better mnemonic?
...BTW we don't seem to have a SETINDx D/# so wouldn't that be a better mnemonic?
No, because SETINDA # already exists and it executes in stage 2 of the pipeline, so that its effect is immediate. These new instructions are just to get register contents into the INDA/INDB pointers, and registers are only available in stage 4, so these instructions don't take effect until on the 3rd instruction after the LODINDA/B.
No, because SETINDA # already exists and it executes in stage 2 of the pipeline, so that its effect is immediate. These new instructions are just to get register contents into the INDA/INDB pointers, and registers are only available in stage 4, so these instructions don't take effect until on the 3rd instruction after the LODINDA/B.
Does this mean we could just use SETINDA # first to set the base (or index), and have an instruction called ADDINDA D to add in a register which could be the index (or base)? So you don't create LODINDA but ADDINDA instead. Does that work better?
Having the ability to add positive or negative values may also allow INDA, INDB to be rolled back/forward after incrementing/decrementing, which may be useful in certain cases where you search and reach the end and have to backtrack one or some other known value etc.
SETINDA/B works in stage 2 of the pipeline, so that it takes immediate effect. These usages are supported:
SETINDA #value
SETINDA ++value
SETINDA --value
The new instructions are just for loading INDA/B with register values, not constants, as SETINDA/B/S already takes care of constant cases:
LODINDA D
I think loading D is the most imporant, but it might be good to set it up so that we load D, plus a constant, or by some constant left-shift. If the 2nd operand is #0 in either of those cases, it would effectively just load D.
I just realized something that could possibly be very useful when having the INDA/INDB registers common to the COG and not separated per task ID.
With a common INDA/INDB we can have both hardware tasking and still implement a simple small FIFO inside COG RAM which automatically wraps according to configured FIXINDx boundaries. I see this as potentially useful in a PASM driver for real time I/O pin data streaming to/from hub memory. One of the COG's hardware tasks can be a data producer (reading directly from I/O pins for example) and another hardware task can be the consumer (acting as the hub access task and writing to hub RAM in this case). The producer task could write data to the COG FIFO using INDA and the consumer task would then read this data via INDB before writing it to the hub RAM. With very careful coding that avoids any pipeline stalls and a 1:8 or 1:16 timeslot allocation setting applied to the hub access task, once started I think this potentially allows this hub access task to remain synchronized to the COG's hub window every time it runs and the other I/O pin task in the COG can be doing something asynchronous to the hub window but still get its data sent off to hub memory when required. This only really benefits cases where the I/O pin sampling task can already compensate for the known lost timeslots allocated to the other hub access task in its pin sampling period but yet at the same time it cannot tolerate any unknown hub window access related jitter or latency relative to its first randomly arriving bit which would then skew or upset its input or output timing from that point on.
I'm actually thinking of applications here something like USB data packet reception, assuming we managed to get that very useful GETXP instruction we all want . The producer task could be capturing bits off the wire at its incoming rate, accumulating into bytes and eventually feeding this data into the INDA/INDB FIFO always with a single cycle register write. The consumer task reads this data from the FIFO when it sees the INDA/INDB pointers as different values and then writes it to hub RAM at a slightly later time from when it was originally produced but again with a single cycle write because it is aligned to the hub window. We would have to carefully code things to always maintain alignment for the consumer to a hub window in this case by avoiding all multi-cycle instructions that would add jitter to the task scheduling. I am assuming that type of coding is possible here using REP loops and delayed JMPs etc, but don't know if it always is. It is a big assumption and perhaps it is impossible to write the required code like that.
We could also try to use stack RAM for a FIFO between tasks instead of using INDA/INDB and the COG RAM for storage but this stack RAM might be also become very useful for other purposes such as a 256 entry CRC lookup table buffer if we didn't get extra CRC support in hardware. Also if you use stack RAM you need to burn the entire 256 entries for the FIFO pointer wrapping to work I think.
@Chip, for implementing a FIFO in COG RAM it would be very good to quickly compare INDA and INDB pointer values and set the zero flag if they match. That way the reader can easily see if there is more data left in the FIFO in a single operation. I can't seem to find how you get the actual pointer values of INDA and INDB in the version of the P2 document I was reading so do we also need a "CMPINDS" type of instruction for doing this comparison and setting the Z flag accordingly? If not, how else can such an INDA/INDB pointer comparison be done today?
Oh WOW.
It's taken me nearly 7 days to get caught up on just this one thread! And now it seems that I've forgotten my original question. Go figure. I can honestly say I've gotten a good class in processor design and theory just by reading the suggestions and the resulting discussion on each either proving or disproving a suggestion's feasibility or "use-ability". What a great learning experience. Now I guess I go about catching up else where and see if I can remember my original question.
@Chip, for implementing a FIFO in COG RAM it would be very good to quickly compare INDA and INDB pointer values and set the zero flag if they match. That way the reader can easily see if there is more data left in the FIFO in a single operation. I can't seem to find how you get the actual pointer values of INDA and INDB in the version of the P2 document I was reading so do we also need a "CMPINDS" type of instruction for doing this comparison and setting the Z flag accordingly? If not, how else can such an INDA/INDB pointer comparison be done today?
Roger.
This is a mind bender because INDA/INDB incrementing/decrementing takes place two instructions before execution occurs. What if you just had a register that the producer increments and the consumer decrements? If it's 0, there's nothing to pull out. Also, keep in mind that INDA/INDB-using instructions cannot be conditional, so if you don't want them to execute, you need to branch around them.
I was thinking of this whole issue, too, this morning. The only hub read instruction that takes one clock is RDWIDE, since it doesn't return a value to a D register. I've thought for a while that it would be good to somehow decouple timing of hub accesses from instructions, but trying to do so creates a wall of problems.
Looks good to me, however since PNut can easily distinguish between "#value", "++value", "--value" and a valid D (register number, or symbolic name) can I suggest that it simply be called
SETINDA D
SETINDB D
I don't think there is any need for another mnemonic, and this would fit the syntax of the other ops. Using a different bit encoding is irrelevant in this case, and it would increase readability.
SETINDA/B works in stage 2 of the pipeline, so that it takes immediate effect. These usages are supported:
SETINDA #value
SETINDA ++value
SETINDA --value
The new instructions are just for loading INDA/B with register values, not constants, as SETINDA/B/S already takes care of constant cases:
LODINDA D
I think loading D is the most imporant, but it might be good to set it up so that we load D, plus a constant, or by some constant left-shift. If the 2nd operand is #0 in either of those cases, it would effectively just load D.
For writes, it is easily done with a one-long buffer, so WRxxxx actually writes to a holding buffer, which is written out at the next available hub cycle. This would turn the first WRxxx in a sequance into a single cycle op. (for P3, we could do "write gathering", "delayed writes" and all sorts of other nice tricks)
For reads... too big a can of worms... actually a 55lb barrel of worms...
RDxxxC is decoupled most of the time. On P3 more dcache lines will help greatly.
On P2 "green" hub "recycling" (ie using unclaimed slots, previously suggested as "hungry mode" is now renamed "recycling mode") will help IMMENSELY.
This is a mind bender because INDA/INDB incrementing/decrementing takes place two instructions before execution occurs. What if you just had a register that the producer increments and the consumer decrements? If it's 0, there's nothing to pull out. Also, keep in mind that INDA/INDB-using instructions cannot be conditional, so if you don't want them to execute, you need to branch around them.
I was thinking of this whole issue, too, this morning. The only hub read instruction that takes one clock is RDWIDE, since it doesn't return a value to a D register. I've thought for a while that it would be good to somehow decouple timing of hub accesses from instructions, but trying to do so creates a wall of problems.
Looks good to me, however since PNut can easily distinguish between "#value", "++value", "--value" and a valid D (register number, or symbolic name) can I suggest that it simply be called
SETINDA D
SETINDB D
I don't think there is any need for another mnemonic, and this would fit the syntax of the other ops. Using a different bit encoding is irrelevant in this case, and it would increase readability.
I am concerned that by using the same mnemonic, people will not be reminded that a D operand will only take effect on the 3rd instruction after, while all other forms take effect instantly.
I am concerned that by using the same mnemonic, people will not be reminded that a D operand will only take effect on the 3rd instruction after, while all other forms take effect instantly.
In the CORDIC block, it was necessary to make a special arithmetic shifter that could produce a result fast enough to feed the adder in time, so that the CORDIC block could operate as fast as the rest of the cog. This was done by pre-decoding the mux selector bits that fed the shifter, and then storing them in flipflops. Not only were those signals going to be ready at the start of the next cycle, when they would be needed, but I even made multiple copies of them, so that for every 8 bits that needed mux'ing, I had a unique flipflop Q output to drive their mux selector inputs. This made things way faster because it reduced the load that each signal would have to drive, enabling them to toggle their loads faster.
There are two things that take time in silicon logic: computation through gates (AND/OR/ADD/MUL) and buffering (or amplification) of signals, so that they can, in turn, drive many other inputs (fan-out), or even just long wires en route to other inputs. Buffering is accomplished by simply running a signal through a series of inverters of increasing size, and for optimal speed, an increase factor of e (~2.7, the base of the natural log) can be used. Practically, buffering inverter chains just use a factor of 3, because it's an integer close to e.
Anyway, the ALU result mux's in the Prop2 are huge, with the largest (slowest) having upwards of 64 inputs, each 34 bits wide (Z override, Z, C, 32-bit result). For some reason, I never thought to apply the idea of increasing parallelism in these mux's selector signals, while these mux's have long formed the bulk of critical paths in the design. It's very inexpensive to do this, too, since even the biggest mux just uses 6 unique signals to select 1 of 64 results. So, at the cost of a mere 48 new flipflops, I increased the mux driver signals by a factor of 4, so that instead of each signal routing 34 bits, four signals will route ~9 bits each. This should increase the Fmax of the design. It's compiling right now. I'm anxious to see how the timing improves on the FPGA, as this will translate directly into an improvement in the speed of the synthesized logic in the final chip.
In doing all that synthesis work before our last tape-out, I realized that speeding up even non-critical-path circuits makes the overall design faster because it allows the design compiler to spread them out more (while still meeting timing requirements), so that critical-path circuits can be grouped more closely together, reducing their need for signal-buffering and long wiring.
I'm about half-way understanding this. How easily can you show a before/after bit of verilog to demonstrate this?Anyway, the ALU result mux's in the Prop2 are huge, with the largest (slowest) having upwards of 64 inputs, each 34 bits wide (Z override, Z, C, 32-bit result). For some reason, I never thought to apply the idea of increasing parallelism in these mux's driver signals, while these mux's have long formed the bulk of critical paths in the design. It's very inexpensive to do this, too, since even the biggest mux just uses 6 unique signals to select 1 of 64 results. So, at the cost of a mere 48 new flipflops, I increased the mux driver signals by a factor of 4, so that instead of each signal routing 34 bits, four signals will route ~9 bits each. This should increase the Fmax of the design. It's compiling right now. I'm anxious to see how the timing improves on the FPGA, as this will translate directly into an improvement in the speed of the synthesized logic in the final chip.
In doing all that synthesis work before our last tape-out, I realized that speeding up even non-critical-path circuits makes the overall design faster because it allows the design compiler to spread them out more (while still meeting timing requirements), so that critical-path circuits can be grouped more closely together, reducing their need for signal-buffering and long wiring.
I'm about half-way understanding this. How easily can you show a before/after bit of verilog to demonstrate this?
Thanks for sharing that Chip, it is amazing how many things you have going on and can keep it all in place.
Chris Wardell
Well, it's about all I've been doing for a while now, so it has become easier to think about.
You guys have opened a huge new dimension on this effort. My own brain is a lot more relaxed now because I'm not the only one thinking about all the details, anymore. This has made things a lot less stressful and more fun.
Chip,
Does any of this allow you to reduce the number of clocks needed for the 64bit Mul/Div stuff, or is that a different issue (I suspect so)?
On a separate note, what do you think about having the Spin2 compiler generate PASM2 code directly (using hubexec and/or cogexec modes as needed), instead of doing an interpreter? Seems very doable, and probably simpler overall than doing what we (Jeff, You and I) have talked about before. I think it would make supporting all the features (like tasks, etc.) easier and cleaner too. It just seems like a natural thing to do.
Chip,
Does any of this allow you to reduce the number of clocks needed for the 64bit Mul/Div stuff, or is that a different issue (I suspect so)?
On a separate note, what do you think about having the Spin2 compiler generate PASM2 code directly (using hubexec and/or cogexec modes as needed), instead of doing an interpreter? Seems very doable, and probably simpler overall than doing what we (Jeff, You and I) have talked about before. I think it would make supporting all the features (like tasks, etc.) easier and cleaner too. It just seems like a natural thing to do.
The multiplier and divider are already radix-4 designs, so that they process 2-bits of multiplicand or quotient at a time. To make them any faster would make them a lot bigger.
I agree about generated PASM, instead of interpreted code. At least, where speed is important that would best. Byte codes are very memory efficient, but much slower.
Here is how a mux might be coded with a single set of mux selector signals...
If I'm understanding correctly, this approach doesn't decrease the size of the mux circuitry (each of the 32 outputs is still being driven by 6 inputs). Rather, by splitting the outputs into 4 groups (using duplicate selector flops for each group to drive 1/4th the number of outputs), each output is being selected by a signal that is ~4x stronger, meaning the propagation of the select signals are effectively faster (i.e. the select signal edge rises/falls more quickly). Does that sound correct?
(Edit: Also, I just realized that this reduces the fanout of each of the select signals, which I am guessing contributes to the increased flexibility in placement. I guess this could be taken to the extreme of 32 groups, one for each output. I wonder how you figure out the correct balance.)
If I'm understanding correctly, this approach doesn't decrease the size of the mux circuitry (each of the 32 outputs is still being driven by 6 inputs). Rather, by splitting the outputs into 4 groups (using duplicate selector flops for each group to drive 1/4th the number of outputs), each output is being selected by a signal that is ~4x stronger, meaning the propagation of the select signals are effectively faster (i.e. the select signal edge rises/falls more quickly). Does that sound correct?
The multiplier and divider are already radix-4 designs, so that they process 2-bits of multiplicand or quotient at a time. To make them any faster would make them a lot bigger.
I agree about generated PASM, instead of interpreted code. At least, where speed is important that would best. Byte codes are very memory efficient, but much slower.
I imagine much of the size win of "byte code" could be achieved with making a kernel of functions that the main code calls (the opposite of inlining small stuff). Thinking more about it, we could quite easily just have one of those functions be a "byte code" interpreter in the form of treating the next X bytes as jump table indices. So the Spin2 code would just be PASM2 along with this byte based jump table function, which is essentially the byte code interpreter, but thinking about it this way lends itself to thinking about that table being custom made for the program being compiled based on what it uses (which circles back around to the things we talked about way back when). Ultimately, it always generates PASM2 code, it just has some helpers to allow it to be small.
I could see the GCC stuff doing the same thing for it's CMM model (or just when you optimize for code size vs speed).
I can only barely get a grip of what you are describing but I imagine it's akin to building a Barrel Shifter to shift bits in a word. Rather than shift them along through a chain of flops one does this: http://en.wikipedia.org/wiki/Barrel_shifter. Or perhaps it's nothing like that:)
In one fell swoop you may have gained more speed, with no added complexity for the user, than all the proposed little tweaks for hungry COG HUB slots and other warts that might get added to the design in the name of one percent gain in some special cases.
I imagine much of the size win of "byte code" could be achieved with making a kernel of functions that the main code calls (the opposite of inlining small stuff). Thinking more about it, we could quite easily just have one of those functions be a "byte code" interpreter in the form of treating the next X bytes as jump table indices. So the Spin2 code would just be PASM2 along with this byte based jump table function, which is essentially the byte code interpreter, but thinking about it this way lends itself to thinking about that table being custom made for the program being compiled based on what it uses (which circles back around to the things we talked about way back when). Ultimately, it always generates PASM2 code, it just has some helpers to allow it to be small.
I could see the GCC stuff doing the same thing for it's CMM model (or just when you optimize for code size vs speed).
Roy, this sounds similar to something I tried in P1 called trimspin, where the interpreter only supported the bytecodes that a program used. I basically unrolled the ROM version of the interpreter so that each bytecode had it's own piece of optimized code, and I used a jump table to jump to the code. I'm currently working on this for P2, but instead of using a subset of the bytecodes all of the bytecodes will be supported. The code for the bytecode used most frequently will reside in cog memory, and the rest of the interpreter executes from hub ram.
In the CORDIC block, it was necessary to make a special arithmetic shifter that could produce a result fast enough to feed the adder in time, so that the CORDIC block could operate as fast as the rest of the cog. This was done by pre-decoding the mux selector bits that fed the shifter, and then storing them in flipflops. Not only were those signals going to be ready at the start of the next cycle, when they would be needed, but I even made multiple copies of them, so that for every 8 bits that needed mux'ing, I had a unique flipflop Q output to drive their mux selector inputs. This made things way faster because it reduced the load that each signal would have to drive, enabling them to toggle their loads faster.
There are two things that take time in silicon logic: computation through gates (AND/OR/ADD/MUL) and buffering (or amplification) of signals, so that they can, in turn, drive many other inputs (fan-out), or even just long wires en route to other inputs. Buffering is accomplished by simply running a signal through a series of inverters of increasing size, and for optimal speed, an increase factor of e (~2.7, the base of the natural log) can be used. Practically, buffering inverter chains just use a factor of 3, because it's an integer close to e.
Anyway, the ALU result mux's in the Prop2 are huge, with the largest (slowest) having upwards of 64 inputs, each 34 bits wide (Z override, Z, C, 32-bit result). For some reason, I never thought to apply the idea of increasing parallelism in these mux's selector signals, while these mux's have long formed the bulk of critical paths in the design. It's very inexpensive to do this, too, since even the biggest mux just uses 6 unique signals to select 1 of 64 results. So, at the cost of a mere 48 new flipflops, I increased the mux driver signals by a factor of 4, so that instead of each signal routing 34 bits, four signals will route ~9 bits each. This should increase the Fmax of the design. It's compiling right now. I'm anxious to see how the timing improves on the FPGA, as this will translate directly into an improvement in the speed of the synthesized logic in the final chip.
In doing all that synthesis work before our last tape-out, I realized that speeding up even non-critical-path circuits makes the overall design faster because it allows the design compiler to spread them out more (while still meeting timing requirements), so that critical-path circuits can be grouped more closely together, reducing their need for signal-buffering and long wiring.
I can only barely get a grip of what you are describing but I imagine it's akin to building a Barrel Shifter to shift bits in a word. Rather than shift them along through a chain of flops one does this: http://en.wikipedia.org/wiki/Barrel_shifter. Or perhaps it's nothing like that:)
In one fell swoop you may have gained more speed, with no added complexity for the user, than all the proposed little tweaks for hungry COG HUB slots and other warts that might get added to the design in the name of one percent gain in some special cases.
I love it.
Well, it turned out to be less than glorious, after all. Quartus was eliminating duplicate registers. I just switched that feature back 'off'. It's recompiling now. I don't know how that switch had been turned on, because I know when I was designing the CORDIC circuit, I was seeing the benefit of duplicate registers. We'll see what happens with this compile.
....
Or skip the hardware multipliers and stuff and do it in "long hand" as on the P1 thus allowing your other threads to run free.
Anyone know how much slower a 32 bit multiply is when done manually rather than using MUL32?
That leads to the conclusion that for ease of use, maximum flexibility etc one would not use the math hardware at all.
I can imagine a large multi-threaded (preemtively or otherwise) program running from HUB, as built by GCC for example. Performance will already be sucky enough that the hardware math offers almost no advantage. Might as well not bother with it.
Because on the DE0-Nano the big multiplier and divider are removed, I had to find an alternative.
To my surprise the mult32 solution I found is even faster than the big hardware multiplier! And it should be multitasking-safe.
The following subroutine multiplies two 32 bit factors in register A and B and returns a 32 bit result in RES. The trick is to split the 32bit numbers into two 16 bits and use the MUL instruction which can do 16x16 in 2 clock cycles. With 3 MULs and a bit of shifting and adding we get a 32x32 bit result:
mult32 'a * b -> res
getword res,a,#0 '1
getword tmp,b,#0 '1
getword a,a,#1 '1
getword b,b,#1 '1
mul a,tmp '2
mul b,res '2
mul res,tmp '2
shl a,#16 '1
retd '1
shl b,#16 '1
add res,a '1
add res,b '1 = 15 cycles
This seems to work also with signed numbers in A and B.
If you do the same subroutine with the big hardware multiplier and task-lock it takes 7 cycles more:
Comments
That's the only variable case that I can come up with, too. It would be coded like this:
TASKID reg
SHL reg,4
LODINDA reg
That would figure out where a task's set of 16 registers were actually located.
INDA/INDB override register remapping, so you need the actual address. Register remapping only affects hard-coded register addresses.
Anyway, all my use cases just involve shifting a value left by some static number of bits. Maybe, instead of an index+base, we should do base<<shift using a 'D,S/#' instruction. That might be more useful.
BTW we don't seem to have a SETINDx D/# so wouldn't that be a better mnemonic?
No, because SETINDA # already exists and it executes in stage 2 of the pipeline, so that its effect is immediate. These new instructions are just to get register contents into the INDA/INDB pointers, and registers are only available in stage 4, so these instructions don't take effect until on the 3rd instruction after the LODINDA/B.
Could the AUGD #someval
be used to add the lower 9 bits to the SETINDx D/#
instead of extending the upper bits as AUGD normally does ???
eg.
AUGD #base 'only uses the lower 9 bits
SETINDx reg ' the lower 9 bits of AUGD are added to the value in reg and stored in INDx.
Maybe we should have a pseudo name for AUGD here called BASED or similar??
Could this work simply to get a 9bit cog address offset???
Does this mean we could just use SETINDA # first to set the base (or index), and have an instruction called ADDINDA D to add in a register which could be the index (or base)? So you don't create LODINDA but ADDINDA instead. Does that work better?
Having the ability to add positive or negative values may also allow INDA, INDB to be rolled back/forward after incrementing/decrementing, which may be useful in certain cases where you search and reach the end and have to backtrack one or some other known value etc.
SETINDA #value
SETINDA ++value
SETINDA --value
The new instructions are just for loading INDA/B with register values, not constants, as SETINDA/B/S already takes care of constant cases:
LODINDA D
I think loading D is the most imporant, but it might be good to set it up so that we load D, plus a constant, or by some constant left-shift. If the 2nd operand is #0 in either of those cases, it would effectively just load D.
With a common INDA/INDB we can have both hardware tasking and still implement a simple small FIFO inside COG RAM which automatically wraps according to configured FIXINDx boundaries. I see this as potentially useful in a PASM driver for real time I/O pin data streaming to/from hub memory. One of the COG's hardware tasks can be a data producer (reading directly from I/O pins for example) and another hardware task can be the consumer (acting as the hub access task and writing to hub RAM in this case). The producer task could write data to the COG FIFO using INDA and the consumer task would then read this data via INDB before writing it to the hub RAM. With very careful coding that avoids any pipeline stalls and a 1:8 or 1:16 timeslot allocation setting applied to the hub access task, once started I think this potentially allows this hub access task to remain synchronized to the COG's hub window every time it runs and the other I/O pin task in the COG can be doing something asynchronous to the hub window but still get its data sent off to hub memory when required. This only really benefits cases where the I/O pin sampling task can already compensate for the known lost timeslots allocated to the other hub access task in its pin sampling period but yet at the same time it cannot tolerate any unknown hub window access related jitter or latency relative to its first randomly arriving bit which would then skew or upset its input or output timing from that point on.
I'm actually thinking of applications here something like USB data packet reception, assuming we managed to get that very useful GETXP instruction we all want . The producer task could be capturing bits off the wire at its incoming rate, accumulating into bytes and eventually feeding this data into the INDA/INDB FIFO always with a single cycle register write. The consumer task reads this data from the FIFO when it sees the INDA/INDB pointers as different values and then writes it to hub RAM at a slightly later time from when it was originally produced but again with a single cycle write because it is aligned to the hub window. We would have to carefully code things to always maintain alignment for the consumer to a hub window in this case by avoiding all multi-cycle instructions that would add jitter to the task scheduling. I am assuming that type of coding is possible here using REP loops and delayed JMPs etc, but don't know if it always is. It is a big assumption and perhaps it is impossible to write the required code like that.
We could also try to use stack RAM for a FIFO between tasks instead of using INDA/INDB and the COG RAM for storage but this stack RAM might be also become very useful for other purposes such as a 256 entry CRC lookup table buffer if we didn't get extra CRC support in hardware. Also if you use stack RAM you need to burn the entire 256 entries for the FIFO pointer wrapping to work I think.
@Chip, for implementing a FIFO in COG RAM it would be very good to quickly compare INDA and INDB pointer values and set the zero flag if they match. That way the reader can easily see if there is more data left in the FIFO in a single operation. I can't seem to find how you get the actual pointer values of INDA and INDB in the version of the P2 document I was reading so do we also need a "CMPINDS" type of instruction for doing this comparison and setting the Z flag accordingly? If not, how else can such an INDA/INDB pointer comparison be done today?
Roger.
It's taken me nearly 7 days to get caught up on just this one thread! And now it seems that I've forgotten my original question. Go figure. I can honestly say I've gotten a good class in processor design and theory just by reading the suggestions and the resulting discussion on each either proving or disproving a suggestion's feasibility or "use-ability". What a great learning experience. Now I guess I go about catching up else where and see if I can remember my original question.
KK
This is a mind bender because INDA/INDB incrementing/decrementing takes place two instructions before execution occurs. What if you just had a register that the producer increments and the consumer decrements? If it's 0, there's nothing to pull out. Also, keep in mind that INDA/INDB-using instructions cannot be conditional, so if you don't want them to execute, you need to branch around them.
I was thinking of this whole issue, too, this morning. The only hub read instruction that takes one clock is RDWIDE, since it doesn't return a value to a D register. I've thought for a while that it would be good to somehow decouple timing of hub accesses from instructions, but trying to do so creates a wall of problems.
SETINDA D
SETINDB D
I don't think there is any need for another mnemonic, and this would fit the syntax of the other ops. Using a different bit encoding is irrelevant in this case, and it would increase readability.
For reads... too big a can of worms... actually a 55lb barrel of worms...
RDxxxC is decoupled most of the time. On P3 more dcache lines will help greatly.
On P2 "green" hub "recycling" (ie using unclaimed slots, previously suggested as "hungry mode" is now renamed "recycling mode") will help IMMENSELY.
I am concerned that by using the same mnemonic, people will not be reminded that a D operand will only take effect on the 3rd instruction after, while all other forms take effect instantly.
There are two things that take time in silicon logic: computation through gates (AND/OR/ADD/MUL) and buffering (or amplification) of signals, so that they can, in turn, drive many other inputs (fan-out), or even just long wires en route to other inputs. Buffering is accomplished by simply running a signal through a series of inverters of increasing size, and for optimal speed, an increase factor of e (~2.7, the base of the natural log) can be used. Practically, buffering inverter chains just use a factor of 3, because it's an integer close to e.
Anyway, the ALU result mux's in the Prop2 are huge, with the largest (slowest) having upwards of 64 inputs, each 34 bits wide (Z override, Z, C, 32-bit result). For some reason, I never thought to apply the idea of increasing parallelism in these mux's selector signals, while these mux's have long formed the bulk of critical paths in the design. It's very inexpensive to do this, too, since even the biggest mux just uses 6 unique signals to select 1 of 64 results. So, at the cost of a mere 48 new flipflops, I increased the mux driver signals by a factor of 4, so that instead of each signal routing 34 bits, four signals will route ~9 bits each. This should increase the Fmax of the design. It's compiling right now. I'm anxious to see how the timing improves on the FPGA, as this will translate directly into an improvement in the speed of the synthesized logic in the final chip.
In doing all that synthesis work before our last tape-out, I realized that speeding up even non-critical-path circuits makes the overall design faster because it allows the design compiler to spread them out more (while still meeting timing requirements), so that critical-path circuits can be grouped more closely together, reducing their need for signal-buffering and long wiring.
I'm about half-way understanding this. How easily can you show a before/after bit of verilog to demonstrate this?
Chris Wardell
Well, it's about all I've been doing for a while now, so it has become easier to think about.
You guys have opened a huge new dimension on this effort. My own brain is a lot more relaxed now because I'm not the only one thinking about all the details, anymore. This has made things a lot less stressful and more fun.
Here is how a mux might be coded with a single set of mux selector signals:
Here it is with 4x the selectors:
Does any of this allow you to reduce the number of clocks needed for the 64bit Mul/Div stuff, or is that a different issue (I suspect so)?
On a separate note, what do you think about having the Spin2 compiler generate PASM2 code directly (using hubexec and/or cogexec modes as needed), instead of doing an interpreter? Seems very doable, and probably simpler overall than doing what we (Jeff, You and I) have talked about before. I think it would make supporting all the features (like tasks, etc.) easier and cleaner too. It just seems like a natural thing to do.
The multiplier and divider are already radix-4 designs, so that they process 2-bits of multiplicand or quotient at a time. To make them any faster would make them a lot bigger.
I agree about generated PASM, instead of interpreted code. At least, where speed is important that would best. Byte codes are very memory efficient, but much slower.
If I'm understanding correctly, this approach doesn't decrease the size of the mux circuitry (each of the 32 outputs is still being driven by 6 inputs). Rather, by splitting the outputs into 4 groups (using duplicate selector flops for each group to drive 1/4th the number of outputs), each output is being selected by a signal that is ~4x stronger, meaning the propagation of the select signals are effectively faster (i.e. the select signal edge rises/falls more quickly). Does that sound correct?
(Edit: Also, I just realized that this reduces the fanout of each of the select signals, which I am guessing contributes to the increased flexibility in placement. I guess this could be taken to the extreme of 32 groups, one for each output. I wonder how you figure out the correct balance.)
You got it.
I imagine much of the size win of "byte code" could be achieved with making a kernel of functions that the main code calls (the opposite of inlining small stuff). Thinking more about it, we could quite easily just have one of those functions be a "byte code" interpreter in the form of treating the next X bytes as jump table indices. So the Spin2 code would just be PASM2 along with this byte based jump table function, which is essentially the byte code interpreter, but thinking about it this way lends itself to thinking about that table being custom made for the program being compiled based on what it uses (which circles back around to the things we talked about way back when). Ultimately, it always generates PASM2 code, it just has some helpers to allow it to be small.
I could see the GCC stuff doing the same thing for it's CMM model (or just when you optimize for code size vs speed).
I can only barely get a grip of what you are describing but I imagine it's akin to building a Barrel Shifter to shift bits in a word. Rather than shift them along through a chain of flops one does this: http://en.wikipedia.org/wiki/Barrel_shifter. Or perhaps it's nothing like that:)
In one fell swoop you may have gained more speed, with no added complexity for the user, than all the proposed little tweaks for hungry COG HUB slots and other warts that might get added to the design in the name of one percent gain in some special cases.
I love it.
I wonder how much faster the FPGA's, and more importantly, the real silicon will run because of this?
Can't wait for your results...
Well, it turned out to be less than glorious, after all. Quartus was eliminating duplicate registers. I just switched that feature back 'off'. It's recompiling now. I don't know how that switch had been turned on, because I know when I was designing the CORDIC circuit, I was seeing the benefit of duplicate registers. We'll see what happens with this compile.
Because on the DE0-Nano the big multiplier and divider are removed, I had to find an alternative.
To my surprise the mult32 solution I found is even faster than the big hardware multiplier! And it should be multitasking-safe.
The following subroutine multiplies two 32 bit factors in register A and B and returns a 32 bit result in RES. The trick is to split the 32bit numbers into two 16 bits and use the MUL instruction which can do 16x16 in 2 clock cycles. With 3 MULs and a bit of shifting and adding we get a 32x32 bit result: This seems to work also with signed numbers in A and B.
If you do the same subroutine with the big hardware multiplier and task-lock it takes 7 cycles more:
Andy