movs and movd is another thing that gets complicated with cog byte addresses. Maybe there's really no issue. I'll just wait to see what Chip comes up with rather than speculating about possible problems.
Yes, I'm optimistic that Chip will come up with a good solution that will "solve" the problem for PropGCC as well as Spin. :-)
The complication happens when you do something like mov temp, #100 versus mov temp, #label. If the compiler automatically shifts the value of label right by 2 bits, what does it do with the constant 100. How would the compiler know whether 100 is actually a cog byte address, and I want it shifted down by 2 bits also...
Yes, this is the kind of trouble that must be overcome.
Right now, I think the only strange things you must code, as someone pointed out, are:
MOVD inst,#reg/4 'reg is an address label for a register whose two address LSB's equal %00
MOV $1F2*4,#value '$1F2 is the 9-bit register address, but it must be multiplied by 4 to give it two disposable LSB's equal to %00.
Just in case, I would have lost the overview:
Is an implementation of a MAC still planned? If yes, how many cycles are needed?
Is there a feature list as for the "old P2"?
I did not think there were "cog byte addresses", all COG access is on 32b basis ?
OK, maybe cog byte address is not the right term, but the new scheme looks like a byte address. Instead of referring to the cog locations as 0, 1 and 2 they will now be 0, 4 and 8. So instead of the range being 0 to 511 it is now 0 to 2044 with the two LSBs being zero.
Just in case, I would have lost the overview:
Is an implementation of a MAC still planned? If yes, how many cycles are needed?
Is there a feature list as for the "old P2"?
Got a ways to go yet. There was talk about having a singular hub based engine for those bulky features. We'll see a new FPGA image before that sort of feature gets considered, me thinks.
Next list should come with the FPGA image. That will keep the feature suggestions rooted in real experimenting.
The MAC in the "old P2" was a fast 20x20 bit multiplier with a 64bit adder.
This possibly might be one of those features that is not suitable as a hub based resource as speed is important.
So 16 of these circuits would be required, as well as support instructions.
A big chunk of silicon perhaps?
couldn't the assembler just do the /4 for the movd instruction?
Maybe I don't get the issue... Seems like you could just switch everything to byte addresses and have the assembler remove the two LS zero bits...
Yep, they are needed - otherwise C performance would suffer greatly, as would every access to bytes/words in any language.
Many RISC architectures tried to get rid of byte/word access, but today - all of them have byte/word. The performance hit of having to use read/shift/mask/modify/write is just too great.
Yep, they are needed - otherwise C performance would suffer greatly, as would every access to bytes/words in any language.
Many RISC architectures tried to get rid of byte/word access, but today - all of them have byte/word. The performance hit of having to use read/shift/mask/modify/write is just too great.
I would think Spin performance would also suffer greatly or at least code density would go way down.
"as would every access to bytes/words in any language"
It would also slow down graphics, array access, basically everything that was not a pure long read/write. Terrible idea not to support bytes/words.
I saw that but was actually thinking of the VM itself which fetches bytes I believe. I thought your comment referred to data references. In any case, we are agreed!
Functionally, it's not really any different to nibble addressing. Nibbles aren't supported so they aren't used. If bytes weren't supported then the correct thing to do is not use them. Instead, scale all data up to minimum addressable word size.
However, byte sized data is common, and byte addressability is expected by average coder dudes, so it's a good idea to make it directly addressable.
The performance hit of having to use read/shift/mask/modify/write is just too great.
Then don't. Use the data in its natural state. It takes just as long to transmit 8 or 16 bits of data over a 32 bit bus as it does 32 bits so not much of a hit there. All the constants, including TRUE and FALSE, are already defined as longs.
The tables ( Anti-Log and Sine ) are defined as words so a table of the same size but using longs would only have half the resolution. I'm not seeing that as a huge problem. Most people interpolate those values now, don't they?
While the VM might fetch bytes I'm betting it takes four bytes to make up a complete instruction. I'm talking out of my hat on that last one but I figure there would have to be some serious magic involved to get an instruction, destination address, source address and condition flags out of 8 bits. Passing data 8 bits at a time across a 32 bit bus doesn't seem very efficient to me. It is a 32 bit bus... right?
While the VM might fetch bytes I'm betting it takes four bytes to make up a complete instruction.
The Spin VM uses byte codes not 32 bit instructions. That is one reason it achieves very high code density compared with the PropGCC LMM memory model. Switching it to using 32 bit instructions would probably drastically reduce code density although maybe that would be okay on the P2 which has far more hub memory.
Then don't. Use the data in its natural state. It takes just as long to transmit 8 or 16 bits of data over a 32 bit bus as it does 32 bits so not much of a hit there. All the constants, including TRUE and FALSE, are already defined as longs.
The tables ( Anti-Log and Sine ) are defined as words so a table of the same size but using longs would only have half the resolution. I'm not seeing that as a huge problem. Most people interpolate those values now, don't they?
While the VM might fetch bytes I'm betting it takes four bytes to make up a complete instruction. I'm talking out of my hat on that last one but I figure there would have to be some serious magic involved to get an instruction, destination address, source address and condition flags out of 8 bits. Passing data 8 bits at a time across a 32 bit bus doesn't seem very efficient to me. It is a 32 bit bus... right?
Sandy
Stack based addressing is essentially a zero operand instruction. The runtime part of Spin shares some similarities with my Tachyon Forth which uses bytescodes. Have a look at this snippet from a Spin file.
Addr : 3509: 48 : Variable Operation Global Offset - 2 Read
Addr : 350A: 36 : Constant 2 $00000001
Addr : 350B: E3 : Math Op <<
Addr : 350C: 64 : Variable Operation Local Offset - 1 Read
Addr : 350D: 36 : Constant 2 $00000001
Addr : 350E: E8 : Math Op &
Addr : 350F: EC : Math Op +
- all string operations would be exceedingly slow
- all byte/word opcode size VM's would be exceedingly slow
- any byte/word buffer access would be very slow
The above is why basically all 32 bit RISC designs now have byte/word access. The performance hit is too large.
It is not feasible to just use longs and throw away 3/4th of all the memory.
Then don't. Use the data in its natural state. It takes just as long to transmit 8 or 16 bits of data over a 32 bit bus as it does 32 bits so not much of a hit there. All the constants, including TRUE and FALSE, are already defined as longs.
The tables ( Anti-Log and Sine ) are defined as words so a table of the same size but using longs would only have half the resolution. I'm not seeing that as a huge problem. Most people interpolate those values now, don't they?
While the VM might fetch bytes I'm betting it takes four bytes to make up a complete instruction. I'm talking out of my hat on that last one but I figure there would have to be some serious magic involved to get an instruction, destination address, source address and condition flags out of 8 bits. Passing data 8 bits at a time across a 32 bit bus doesn't seem very efficient to me. It is a 32 bit bus... right?
- all string operations would be exceedingly slow
- all byte/word opcode size VM's would be exceedingly slow
- any byte/word buffer access would be very slow
The above is why basically all 32 bit RISC designs now have byte/word access. The performance hit is too large.
It is not feasible to just use longs and throw away 3/4th of all the memory.
Also, it seems we've gone all the way to the other extreme where we can fetch word and long values from unaligned hub addresses.
- all string operations would be exceedingly slow
- all byte/word opcode size VM's would be exceedingly slow
- any byte/word buffer access would be very slow
I don't think string operations form a significant portion of any program written for the prop.
Don't use a VM. You have to use a PC to generate the byte codes anyway so why not just finish the job on the PC. The spin interpretor, as nice and clever as it is, seems to be the thing backing the new processor design into a corner.
Byte / word access would be slow IF your data could only be represented by a byte or word. Longs might be a waste in a few cases but the benefits would exceed the costs in most cases.
One of my many character flaws is not being too sentimental about existing things :-).
I don't think string operations form a significant portion of any program written for the prop.
Don't use a VM. You have to use a PC to generate the byte codes anyway so why not just finish the job on the PC. The spin interpretor, as nice and clever as it is, seems to be the thing backing the new processor design into a corner.
Byte / word access would be slow IF your data could only be represented by a byte or word. Longs might be a waste in a few cases but the benefits would exceed the costs in most cases.
One of my many character flaws is not being too sentimental about existing things :-).
Sandy
I suppose if there was a strong argument that eliminating byte and word would make the core substantially smaller or simpler and hopefully also faster then it would be worth considering. There really aren't that many byte and word operations so I don't think removing them would help much. In fact, in P1 I think the only byte and word operations are rdword, rdbyte, wrword, and wrbyte. Probably wouldn't save that much logic removing those as long as hub memory has byte enables on writes.
Last night (4am) I chased down the final bug that was inhibiting the fast-write setup for hub streaming. Fast-read had already been working for a few days, but now the whole system seems complete.
The next thing to do is to adapt the hub streaming to both hub exec and the NCO-driven pin/DAC I/O. Those NCO modes are going to be fun, because those are features which never existed on the prior Prop2.
The streaming is already set up so that you can create a loop in hub memory that automatically wraps for reading or writing. The only other thing we need is a way to redirect the start/size on a block loop. That would enable page-flipping, so to speak, for high-speed analog output, so that you can be writing one buffer while playing another - at 200MHz on the real chip, and hopefully 160MHz on the FPGA.
The streaming is already set up so that you can create a loop in hub memory that automatically wraps for reading or writing. The only other thing we need is a way to redirect the start/size on a block loop. That would enable page-flipping, so to speak, for high-speed analog output, so that you can be writing one buffer while playing another - at 200MHz on the real chip, and hopefully 160MHz on the FPGA.
Sounds great - does that also work in either direction (eg for things like camera capture, as well as Analog/Video out ?)
Comments
Yes, this is the kind of trouble that must be overcome.
Right now, I think the only strange things you must code, as someone pointed out, are:
MOVD inst,#reg/4 'reg is an address label for a register whose two address LSB's equal %00
MOV $1F2*4,#value '$1F2 is the 9-bit register address, but it must be multiplied by 4 to give it two disposable LSB's equal to %00.
Yes, see my #141 and Chip's #143
Is an implementation of a MAC still planned? If yes, how many cycles are needed?
Is there a feature list as for the "old P2"?
Got a ways to go yet. There was talk about having a singular hub based engine for those bulky features. We'll see a new FPGA image before that sort of feature gets considered, me thinks.
Next list should come with the FPGA image. That will keep the feature suggestions rooted in real experimenting.
This possibly might be one of those features that is not suitable as a hub based resource as speed is important.
So 16 of these circuits would be required, as well as support instructions.
A big chunk of silicon perhaps?
Maybe I don't get the issue... Seems like you could just switch everything to byte addresses and have the assembler remove the two LS zero bits...
Sandy
Many RISC architectures tried to get rid of byte/word access, but today - all of them have byte/word. The performance hit of having to use read/shift/mask/modify/write is just too great.
"as would every access to bytes/words in any language"
It would also slow down graphics, array access, basically everything that was not a pure long read/write. Terrible idea not to support bytes/words.
However, byte sized data is common, and byte addressability is expected by average coder dudes, so it's a good idea to make it directly addressable.
Then don't. Use the data in its natural state. It takes just as long to transmit 8 or 16 bits of data over a 32 bit bus as it does 32 bits so not much of a hit there. All the constants, including TRUE and FALSE, are already defined as longs.
The tables ( Anti-Log and Sine ) are defined as words so a table of the same size but using longs would only have half the resolution. I'm not seeing that as a huge problem. Most people interpolate those values now, don't they?
While the VM might fetch bytes I'm betting it takes four bytes to make up a complete instruction. I'm talking out of my hat on that last one but I figure there would have to be some serious magic involved to get an instruction, destination address, source address and condition flags out of 8 bits. Passing data 8 bits at a time across a 32 bit bus doesn't seem very efficient to me. It is a 32 bit bus... right?
Sandy
Stack based addressing is essentially a zero operand instruction. The runtime part of Spin shares some similarities with my Tachyon Forth which uses bytescodes. Have a look at this snippet from a Spin file.
- all byte/word opcode size VM's would be exceedingly slow
- any byte/word buffer access would be very slow
The above is why basically all 32 bit RISC designs now have byte/word access. The performance hit is too large.
It is not feasible to just use longs and throw away 3/4th of all the memory.
I don't think string operations form a significant portion of any program written for the prop.
Don't use a VM. You have to use a PC to generate the byte codes anyway so why not just finish the job on the PC. The spin interpretor, as nice and clever as it is, seems to be the thing backing the new processor design into a corner.
Byte / word access would be slow IF your data could only be represented by a byte or word. Longs might be a waste in a few cases but the benefits would exceed the costs in most cases.
One of my many character flaws is not being too sentimental about existing things :-).
Sandy
The next thing to do is to adapt the hub streaming to both hub exec and the NCO-driven pin/DAC I/O. Those NCO modes are going to be fun, because those are features which never existed on the prior Prop2.
The streaming is already set up so that you can create a loop in hub memory that automatically wraps for reading or writing. The only other thing we need is a way to redirect the start/size on a block loop. That would enable page-flipping, so to speak, for high-speed analog output, so that you can be writing one buffer while playing another - at 200MHz on the real chip, and hopefully 160MHz on the FPGA.
Thanks for your continued patience, Everyone.
Sounds great - does that also work in either direction (eg for things like camera capture, as well as Analog/Video out ?)
Nice title update David.