Here is the interrupt-related code from the new ROM booter. I got everything to work in one cog using interrupts. At first, I was using one cog for the booter and another cog just for auto baud detection, with LUT sharing as conduit.
I figured this could be done in one cog, but I needed to make MORE edge events, and add STATE events, so that you don't get chicken-and-egg problems with smart pin event detection and AKPIN response.
Here is the related part of the booter code. Note that interrupt 1 responds to state changes on the RX pin (via smart pin 0), looking for a space ($20), while interrupt 2 handles RX data reception (via smart pin 63). Interrupt 1 actually forces interrupt 2, in case interrupt 2 didn't fire in time, when a space ($20) is detected via interrupt 1. This was a problem at higher baud rates. Now it's rock solid:
CON
rx_pin = 63 'pins
tx_pin = 62
spi_cs = 61
spi_ck = 60
spi_di = 59
spi_do = 58
rx_msr = 0
lut_buff = $000 'serial receive buffer
lut_btop = $07F 'serial receive buffer top
chr_to = 0 'mode bits
did_spi = 1
key_on = 2
DAT org
'
'
' Enable autobaud and serial receive interrupts
'
wrpin msta,#rx_msr 'measure states on rx_pin via rx_msr
setse1 #%110<<6+rx_msr 'event on rx_msr high
dirh #rx_msr 'enable measurement
mov ijmp1,#autobaud 'set interrupt vector
setint1 #4 'enable interrupt
wrpin #%00_11111_0,#rx_pin 'set rx pin for asynchronous receive
setse2 #%110<<6+rx_pin 'set se2 to trigger on rx_pin high
mov ijmp2,#receive 'set int2 jump vector
setint2 #5 'set receiver ISR to trigger on se2 (rx_pin high)
(main program here)
'
'
' Get rx byte
'
get_rx pollct1 wz 'if timeout, error
if_nz jmp #command_err
cmp head,tail wz 'loop until byte is received
if_z jmp #get_rx
testb mode,#chr_to wz 'clear timeout?
if_nz call #clear_timeout
rdlut x,tail 'get byte from lut
incmod tail,#lut_btop 'update head
ret
'
'
' Clear timeout
'
clear_timeout getct x
addct1 x,timeout_per
ret
'
'
' Send string
'
tx_string waitx ##30_000_000/100 'wait 10ms
wrpin #%01_11110_0,#tx_pin 'configure tx pin
wxpin baud,#tx_pin 'set baud
dirh #tx_pin 'enable tx pin
mov x,#3 'initialize byte counter
tx_loop incmod x,#3 wc 'if initial or 4th byte,
tx_ptr if_c mov y,0 '..get 4 bytes (start address set by caller)
if_c add tx_ptr,#1 '..point to next 4 bytes
test y,#$FF wz 'if not end of string,
if_nz wypin y,#tx_pin '..send byte
.wait if_nz testin #tx_pin wc '..wait for buffer empty
if_nc_and_nz jmp #.wait
if_nz akpin #tx_pin '..acknowledge pin
if_nz shr y,#8 '..ready next byte
if_nz jmp #tx_loop '..loop for next byte
.busy rdpin x,#tx_pin wc 'end of string,
if_c jmp #.busy '..wait for tx to finish
dirl #tx_pin '..disable tx pin
wrpin #0,#tx_pin '..unconfigure tx pin
ret
'
'
' Autobaud ISR
'
autobaud akpin #rx_msr 'acknowledge rx state change
rdpin buf2,#rx_msr wc 'get sample, measure ($20 -> 10000001001 -> ..1, 6x 0, 1x 1, 2x 0, 1..)
clrb buf2,#31 'clear msb in case 1 sample
if_c jmp #.scroll 'if 1 sample, just scroll
mov limh,buf0 '0 sample,
shr limh,#4 '..make window from 1st 0 (6x if $20)
neg liml,limh
add limh,buf0
add liml,buf0
mov comp,buf1 '0 sample,
mul comp,#6 '..normalize last 1 (1x if $20) to 6x
cmpr comp,limh wc '..check if last 1 within window
if_nc cmp comp,liml wc
if_nc mov comp,buf2 '0 sample and last 1 within window,
if_nc mul comp,#3 '..normalize last 0 (2x if $20) to 6x
if_nc cmpr comp,limh wc '..check if last 0 within window
if_nc cmp comp,liml wc
if_c jmp #.scroll 'if not $20, just scroll
add buf0,buf2 '$20 (space),
shl buf0,#16-3 '..compute bit period from 6x 0 and 2x 0
or buf0,#7 '..set 8 bits
wxpin buf0,#rx_pin '..set rx pin baud
dirl #rx_pin '..reset rx pin
dirh #rx_pin '..(re)enable rx pin to (re)register frame
mov baud,buf0 '..save baud for transmit
mov rxbyte,#$120 '..signal receiver ISR to ignore pin, enter space
trgint2 '..trigger serial receiver ISR in case it wasn't, already (<50k baud)
.scroll mov buf0,buf1 'scroll sample buffer
mov buf1,buf2
reti1 'if $20 (space), serial receiver ISR executes next
'
'
' Serial receiver ISR
'
receive clrb rxbyte,#8 wc 'triggered by autobaud? if so, rxbyte = $20 (space)
if_nc akpin #rx_pin 'triggered by receive, acknowledge rx byte
if_nc rdpin rxbyte,#rx_pin 'triggered by receive, get rx byte
wrlut rxbyte,head 'write byte to circular buffer in lut
incmod head,#lut_btop 'increment buffer head
reti2
'
'
' Constants / initialized variables
'
timeout_per long 30_000_000/10 'initial 100ms timeout
msta long %0111<<28+%00_10000_0 'read states on lower pin (pin 63 in case of pin 0)
mode long 0 'serial mode
head long 0 'serial buffer head
tail long 0 'serial buffer tail
'
'
' Unitialized variables
'
i res 1 'universal
x res 1
y res 1
z res 1
rxbyte res 1 'ISR serial receive
buf0 res 1 'ISR autobaud
buf1 res 1
buf2 res 1
limh res 1
liml res 1
comp res 1
baud res 1
The early "maybe interrupts make sense" discussion boiled down to what we experienced using the "hot" edition. One cog, just polling was inefficient, just like all but one cog using the hub was.
It was either a Tasker, or interrupts.
The hot chip tasaker solidified the idea of the COG as atomic, not the task, or in our case, ISR.
As long as hub access and events do not impact other cogs, we've still got what we like in p1. People can grab object and run with them pretty easy.
As I recall, that discussion ended quietly, Chip saying, "let us not talk about this." Was the right call.
I may be slow and out of date and I may have missed a point. But...
On the P2 can my code running on its cog(s) modulate the execution rate of your code running on its cog(s) as we both hammer on HUB access through the "egg beater" ?
On the P2, cogs have dedicated hub access slots, but the slot time also depends on the 4 LSB of the long address. Hub accesses from a cog will not interfere with the timing of other cogs.
So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.
So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.
Right?
Not really. The bottom 4 bits (EDIT: actually bits 5..2) that a cog has access to goes up once per clock, so that a cog can read the next long of hubram each clock - this is the whole beauty of the egg beater, that every cog at once can read sequential longs every per clock. There's a FIFO to smooth out the accesses, since (except for the streamer, which can use a long every clock) a cog can't use a long every clock. The FIFO is big enough so that, once it synchronizes once, it can never underflow when reading or overflow when writing. The FIFO can be used for manual access, hubexec, or the streamer. The FIFO can only do one of these things at a time.
However, if you want random and not sequential access, or if you're already using the FIFO for something else (e.g. hubexec), you might be better off aligning your data based on cog id. But you don't need to actually check your cog id to do this - the first access's timing may be wrong, but the first one will stall the cog so that the rest are timed perfectly (supposing you wrote your code properly). Put a dummy access before any time-critical accesses if you can't afford for the first one to be off.
So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.
That sure seems like a high price to pay for maximum throughput. It divides up the hub memory in a weird (though regular) way, potentially wasting 15/16ths of it (unless you also used more code at other times or in other cogs to use those skipped over portions). I know programmers go to extremes at times for maximum throughput, but the boost from such an access scheme would seem to come at the expense of programming sanity, for lack of a better term. Of course, the P2 shines best when it's doing sequential access. But when it comes to random access, I'd guess that it's generally best to just live with the lower throughput rate and not divvy up memory in such a tricky way. But yeah, for the "maximum random r/w access" that you mentioned, I believe that such random access usage would be the fastest, but others can comment more confidently. Update: Okay, another just did comment with regards to the synching up part and not needing to worry about the exact cog number.
On the P2 can my code running on its cog(s) modulate the execution rate of your code running on its cog(s) as we both hammer on HUB access through the "egg beater" ?
Such influence among cogs is *electrically impossible* due to the chip's design, wherein each cog can only access one particular slice of memory at any one time. So, as Dave Hein said, no interference can occur. Now if two or more cogs were exchanging messages or otherwise using results calculated in another cog, then, of course, they could affect each other through the expected ways, but that's not what you were considering. For what you mentioned, it's refreshing to know that a cog will be totally done with reading or writing a particular long/word/byte in a slice of memory before the "trailing" cog gets access to that slice (and the same long/word/byte).
So, if I want maximum random read/write access speed to HUB I would arrange for all my accesses to have the same 4 LSB of the LONG address. Those 4 bits being dependent on my COG ID.
Right?
I think this is a yes and no / it depends type case.
There are burst HUB operations, but if you really want 'random' that infers no control at all of address.
If you can accept some LSB control, that is no longer quite random, then yes, careful sync of LSB to the Slot index, can avoid a wait-for-next-go-around.
I don't think that is COG ID relevant after the first access, so if you carefully interleave opcodes(N Cycles) and Hub access(Adr+N), you could craft higher bandwidths.
Given the high bandwidth already there, and the burst ops, actual need of this case would be quite rare, but it can be constructed.
I still can't work out if your code can modulate the speed of my code though....
Not via Hub-Slots.
The HUB already is allocating 1/16 time to every COG, so it is hard-coded jitter free. (from a COG-COG interaction viewpoint, if they want, every other COG might use the slot available to it)
Only if it somehow could allocate N/16, could there be a jitter effect.
Addit:
The HUB-Slot rotate effect does mean there is a preferred INC or DEC direction.
(I forget which way Chip has the interaction working.)
Even in a HLL, you might get slightly faster data flows in small buffers, with a sparse-array design.
The hub RAM is divided into 16 banks of memory. Bits 2 through 4 of the hub RAM address are used to select the bank. The hub slots for bank 0 are allocated as 0, 1, 2, ... ,14, 15. The hub slots for bank 1 are allocated as 1, 2, 3, ... , 15, 0. The hub slots for the rest of the banks are shifted in the same manner. This allows for a max transfer rate of 1 long/cycle. So if all of the cogs were using their FIFOs at the same time you could get 16*4*160MHz = 10.24 Giga-bytes/second transferred to/from the hub RAM.
Reading sequential longs in a tight loop is a bit different. Since instructions take 2 cycles you would not be able to read longs a full speed, but instead the speed would be 1 long/17 cycles. Also, it will be difficult to design a loop that reads hub RAM with deterministic timing like the P1 unless the data address are deterministic as well. Hopefully, the higher speed of the P2 will help to compensate for the lack of determinism.
EDIT: I meant bits 2 through 5 instead of bits 2 though 4. Four bits from the hub address are used to select the RAM bank.
As someone pointed out, if two cogs were to communicate through addresses whose bits 4..2 were static, timing would become deterministic, since the hub slice (physical RAM instance) would remain constant, coming around on every 16th clock to each cog.
It all kind of, sort of makes sense. Sometimes. I have to study the egg beater "magic roundabout' diagram some more.
I still can't work out if your code can modulate the speed of my code though....
That does not happen.
There are 16 cogs. There are 16 banks of hub RAM, addressed by the lower nibble.
Each cog gets exclude access to one bank. Every clock, that bank increments, modulo style, which insures a given COG will get access to a given bank within a given time.
All COGS get HUB access all the time, and it's uniform.
No heater. One cog cannot interfere with any other cogs hub access!
There are actually 16 possible cog accesses to hub in every clock. Each cog's access is skewed by one long for the same clock.
This permits a cog to transfer a long on every clock pulse.
But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!! If you are reading successive bytes, beginning on a long boundary, you will get byte 0, byte 1 will be 16 clocks later, byte 2 another 16 clocks, byte 3 another 16 clocks, and then byte 4 will actually be 16+1=17 clocks, followed by the next bytes 5, 6 & 7 each 16 clocks, then byte 8 at 17 clocks with the next 3 bytes at 16 clocks, and so on.
No heater. One cog cannot interfere with any other cogs hub access!
There are actually 16 possible cog accesses to hub in every clock. Each cog's access is skewed by one long for the same clock.
This permits a cog to transfer a long on every clock pulse.
But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!! If you are reading successive bytes, beginning on a long boundary, you will get byte 0, byte 1 will be 16 clocks later, byte 2 another 16 clocks, byte 3 another 16 clocks, and then byte 4 will actually be 16+1=17 clocks, followed by the next bytes 5, 6 & 7 each 16 clocks, then byte 8 at 17 clocks with the next 3 bytes at 16 clocks, and so on.
so when you want it (the reading of BYTEs) fast you use the FIFO
or read at least longs and do the shift mask manually and still being faster ...
code snippets like for doing this (FIFO / Streamer .. ) could go into a document to help beginners.
But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!!
I think that is 17 or 15, depending on the INC/DEC relative to Slot-Spin.
Given INC is the more common code style, should the Slot-Spin be tuned to give the better access number for INC ?
( I think that means Slot decrements) -
Has that been done on P2 ?
It would be every 17 clocks. The slot visible to a cog increments every cycle, so that the FIFO can do forward sequential access; if it decremented instead, the FIFO wouldn't be able to provide the one long per clock sequential access that it does provide.
It would be every 17 clocks. The slot visible to a cog increments every cycle, so that the FIFO can do forward sequential access; if it decremented instead, the FIFO wouldn't be able to provide the one long per clock sequential access that it does provide.
Right, FIFO access would have to run backward through hub RAM, instead of forward.
It would be every 17 clocks. The slot visible to a cog increments every cycle, so that the FIFO can do forward sequential access; if it decremented instead, the FIFO wouldn't be able to provide the one long per clock sequential access that it does provide.
Yes, I forgot about the need to also support the FIFO.
Comments
Here is the interrupt-related code from the new ROM booter. I got everything to work in one cog using interrupts. At first, I was using one cog for the booter and another cog just for auto baud detection, with LUT sharing as conduit.
I figured this could be done in one cog, but I needed to make MORE edge events, and add STATE events, so that you don't get chicken-and-egg problems with smart pin event detection and AKPIN response.
Here is the related part of the booter code. Note that interrupt 1 responds to state changes on the RX pin (via smart pin 0), looking for a space ($20), while interrupt 2 handles RX data reception (via smart pin 63). Interrupt 1 actually forces interrupt 2, in case interrupt 2 didn't fire in time, when a space ($20) is detected via interrupt 1. This was a problem at higher baud rates. Now it's rock solid:
It was either a Tasker, or interrupts.
The hot chip tasaker solidified the idea of the COG as atomic, not the task, or in our case, ISR.
As long as hub access and events do not impact other cogs, we've still got what we like in p1. People can grab object and run with them pretty easy.
As I recall, that discussion ended quietly, Chip saying, "let us not talk about this." Was the right call.
On the P2 can my code running on its cog(s) modulate the execution rate of your code running on its cog(s) as we both hammer on HUB access through the "egg beater" ?
Right?
Not really. The bottom 4 bits (EDIT: actually bits 5..2) that a cog has access to goes up once per clock, so that a cog can read the next long of hubram each clock - this is the whole beauty of the egg beater, that every cog at once can read sequential longs every per clock. There's a FIFO to smooth out the accesses, since (except for the streamer, which can use a long every clock) a cog can't use a long every clock. The FIFO is big enough so that, once it synchronizes once, it can never underflow when reading or overflow when writing. The FIFO can be used for manual access, hubexec, or the streamer. The FIFO can only do one of these things at a time.
However, if you want random and not sequential access, or if you're already using the FIFO for something else (e.g. hubexec), you might be better off aligning your data based on cog id. But you don't need to actually check your cog id to do this - the first access's timing may be wrong, but the first one will stall the cog so that the rest are timed perfectly (supposing you wrote your code properly). Put a dummy access before any time-critical accesses if you can't afford for the first one to be off.
That sure seems like a high price to pay for maximum throughput. It divides up the hub memory in a weird (though regular) way, potentially wasting 15/16ths of it (unless you also used more code at other times or in other cogs to use those skipped over portions). I know programmers go to extremes at times for maximum throughput, but the boost from such an access scheme would seem to come at the expense of programming sanity, for lack of a better term. Of course, the P2 shines best when it's doing sequential access. But when it comes to random access, I'd guess that it's generally best to just live with the lower throughput rate and not divvy up memory in such a tricky way. But yeah, for the "maximum random r/w access" that you mentioned, I believe that such random access usage would be the fastest, but others can comment more confidently. Update: Okay, another just did comment with regards to the synching up part and not needing to worry about the exact cog number.
It all kind of, sort of makes sense. Sometimes. I have to study the egg beater "magic roundabout' diagram some more.
I still can't work out if your code can modulate the speed of my code though....
Such influence among cogs is *electrically impossible* due to the chip's design, wherein each cog can only access one particular slice of memory at any one time. So, as Dave Hein said, no interference can occur. Now if two or more cogs were exchanging messages or otherwise using results calculated in another cog, then, of course, they could affect each other through the expected ways, but that's not what you were considering. For what you mentioned, it's refreshing to know that a cog will be totally done with reading or writing a particular long/word/byte in a slice of memory before the "trailing" cog gets access to that slice (and the same long/word/byte).
I think this is a yes and no / it depends type case.
There are burst HUB operations, but if you really want 'random' that infers no control at all of address.
If you can accept some LSB control, that is no longer quite random, then yes, careful sync of LSB to the Slot index, can avoid a wait-for-next-go-around.
I don't think that is COG ID relevant after the first access, so if you carefully interleave opcodes(N Cycles) and Hub access(Adr+N), you could craft higher bandwidths.
Given the high bandwidth already there, and the burst ops, actual need of this case would be quite rare, but it can be constructed.
Not via Hub-Slots.
The HUB already is allocating 1/16 time to every COG, so it is hard-coded jitter free. (from a COG-COG interaction viewpoint, if they want, every other COG might use the slot available to it)
Only if it somehow could allocate N/16, could there be a jitter effect.
Addit:
The HUB-Slot rotate effect does mean there is a preferred INC or DEC direction.
(I forget which way Chip has the interaction working.)
Even in a HLL, you might get slightly faster data flows in small buffers, with a sparse-array design.
Reading sequential longs in a tight loop is a bit different. Since instructions take 2 cycles you would not be able to read longs a full speed, but instead the speed would be 1 long/17 cycles. Also, it will be difficult to design a loop that reads hub RAM with deterministic timing like the P1 unless the data address are deterministic as well. Hopefully, the higher speed of the P2 will help to compensate for the lack of determinism.
EDIT: I meant bits 2 through 5 instead of bits 2 though 4. Four bits from the hub address are used to select the RAM bank.
That does not happen.
There are 16 cogs. There are 16 banks of hub RAM, addressed by the lower nibble.
Each cog gets exclude access to one bank. Every clock, that bank increments, modulo style, which insures a given COG will get access to a given bank within a given time.
All COGS get HUB access all the time, and it's uniform.
There are actually 16 possible cog accesses to hub in every clock. Each cog's access is skewed by one long for the same clock.
This permits a cog to transfer a long on every clock pulse.
But when using normal instructions to access sequential longs, each successive long will be 17 clocks apart!!! If you are reading successive bytes, beginning on a long boundary, you will get byte 0, byte 1 will be 16 clocks later, byte 2 another 16 clocks, byte 3 another 16 clocks, and then byte 4 will actually be 16+1=17 clocks, followed by the next bytes 5, 6 & 7 each 16 clocks, then byte 8 at 17 clocks with the next 3 bytes at 16 clocks, and so on.
or read at least longs and do the shift mask manually and still being faster ...
code snippets like for doing this (FIFO / Streamer .. ) could go into a document to help beginners.
Given INC is the more common code style, should the Slot-Spin be tuned to give the better access number for INC ?
( I think that means Slot decrements) -
Has that been done on P2 ?
Right, FIFO access would have to run backward through hub RAM, instead of forward.
Now where is my DE0-Nano ...