Wow! You guys should watch that video of Steve Teig talking. Very interesting stuff, I thought. Thanks for posting that, Kye.
I think he's probably onto what the future of computing will be. He points out that we are mired in excruciating proceduralism, whereas much of what we need to make a program do does not require it, though we are forced to specify it due to the current computing paradigm.
His premise is that if a computational goal can be defined somewhat loosely, freedom can be had by a compiler to arrange all the sub-computations in parallel, and only implement sequentiality where causal boundaries exist. Of course, there needs to be special hardware to run such code efficiently, which he has already built. He sees place-and-route tools as the compiler for his reconfigurable fabric, which iterates through configuration states in the GHz. The point of his fabric is to keep the data bound to the computational elements, so that the "straw" that exists today between the CPU and the memory goes away - nullifying access time, which is the bane of current systems (caches, speculative execution, etc., exist just to overcome the memory bottleneck). In his fabric, there is a field of computational elements with transparent latches for data and some cycling mechanism to reconfigure the interconnections among elements in a round-robin fashion. The simpler the sub-computations, the more effective parallel 'CPU's of today you would have, though the granularity is much finer than what makes up a CPU now. He's determined that math operators on the order of 4 or 8 bits are adequate, as 32-bit or 64-bit computations are rare, in practice, but could be built from lesser elements with some sequentiality.
He gave a great analogy about cooking a box of macaroni and cheese. Today, you would iterate through each piece of macaroni as it goes into the pot, then execute the rest of the directions in some exact order. In his way, all the macaroni gets thrown into the pot, then the remaining ingredients are mixed together in another bowl, in no particular order, as it doesn't matter, and at the end, the macaroni is stirred into the other ingredients. What program would be easier to express? He cited Haskell as a language which compiles such code for current CPUs. Can anyone state how many times and what language he cited as NOT up to the task of expressing this new kind of code, though the professional community blithely supposes that it should be used?
...
I watched as my FPGA development tools inflate like Mr. Creosote and grew from less than a gig to over 3 gigs. I'm just waiting for them to issue a release that uses a entire HDD. I say that as a half joke, because software shows no signs of slimming, just growing like the Blob.
That time is now. We use Allan Bradley PLC tools at work and it comes on a 320G external HD, no CD no DVD just a HD.
His premise is that if a computational goal can be defined somewhat loosely, freedom can be had by a compiler to arrange all the sub-computations in parallel, and only implement sequentiality where causal boundaries exist.
Nice aspiration, but the real world intrudes.
On a chip like the Prop, (unlike much larger alternatives) there are no "causal boundaries" that can be ignored/put off for later, you will need to communicate to your compiler, just WHAT you want to compile into a COG at PASM level, and what can run LMM.
MOST of the code will be running only quasi compiled, with the speed penalty that incurs, but SOME small portions can 'drop into' a COG as native ASM level code.
It is managing that slice/dice that will matter on a Prop tool flow, and the small COG code .size, demands nimble tools.
Wow! You guys should watch that video of Steve Teig talking. Very interesting stuff, I thought. Thanks for posting that, Kye.
I think he's probably onto what the future of computing will be. He points out that we are mired in excruciating proceduralism, whereas much of what we need to make a program do does not require it, though we are forced to specify it due to the current computing paradigm.
His premise is that if a computational goal can be defined somewhat loosely, freedom can be had by a compiler to arrange all the sub-computations in parallel, and only implement sequentiality where causal boundaries exist. Of course, there needs to be special hardware to run such code efficiently, which he has already built. He sees place-and-route tools as the compiler for his reconfigurable fabric, which iterates through configuration states in the GHz. The point of his fabric is to keep the data bound to the computational elements, so that the "straw" that exists today between the CPU and the memory goes away - nullifying access time, which is the bane of current systems (caches, speculative execution, etc., exist just to overcome the memory bottleneck). In his fabric, there is a field of computational elements with transparent latches for data and some cycling mechanism to reconfigure the interconnections among elements in a round-robin fashion. The simpler the sub-computations, the more effective parallel 'CPU's of today you would have, though the granularity is much finer than what makes up a CPU now. He's determined that math operators on the order of 4 or 8 bits are adequate, as 32-bit or 64-bit computations are rare, in practice, but could be built from lesser elements with some sequentiality.
He gave a great analogy about cooking a box of macaroni and cheese. Today, you would iterate through each piece of macaroni as it goes into the pot, then execute the rest of the directions in some exact order. In his way, all the macaroni gets thrown into the pot, then the remaining ingredients are mixed together in another bowl, in no particular order, as it doesn't matter, and at the end, the macaroni is stirred into the other ingredients. What program would be easier to express? He cited Haskell as a language which compiles such code for current CPUs. Can anyone state how many times and what language he cited as NOT up to the task of expressing this new kind of code, though the professional community blithely supposes that it should be used?
It is why I postulated to build in in Propeller II --- Instructions to handle bytes in 32bits-Longs.
To simplify that computation.
I made an instruction that lets you move bytes and words between longs, with static/incrementing/decrementing field pointers and rotate-left, but there's no simple mechanism for addressing cog RAM as a sequence of bytes. It would take 2-3 more instructions.
No necessary address them in sequence IF You have Indexed instruction that can move byte from one Long to another.
MOV_B source,Index (3 downto 0) else (0 to 3) .. That have already sequence of 4 Bytes. NOW I only need address next LONG in source field to have SEQUENCED addresing of BYTES.
I made an instruction that lets you move bytes and words between longs, with static/incrementing/decrementing field pointers and rotate-left, but there's no simple mechanism for addressing cog RAM as a sequence of bytes. It would take 2-3 more instructions.
I have been looking at all versions of the PDF on Prop2 since it has been released... I have a number of interesting ideas, but I need to test them on a P2 (or FPGA emulation) before I will know which will work best (and which will not work). The P2 has a lot of additional instructions to think outside the box with
(I should become active on the forum again in 1-2 weeks; been too busy to participate - sorry.)
I guess it's a bit late to be thinking about all this for the Prop I.
Although I think I did read that Prop II has had some consideration put into speeding up the LMM loop at least.
Now all we need is Bill Henning to look at the new PII instructions and find cunning ways to use them that no one has though of yet, like he did when inventing the LMM technique.
I have been looking at all versions of the PDF on Prop2 since it has been released... I have a number of interesting ideas, but I need to test them on a P2 (or FPGA emulation) before I will know which will work best (and which will not work). The P2 has a lot of additional instructions to think outside the box with
(I should become active on the forum again in 1-2 weeks; been too busy to participate - sorry.)
I made an instruction that lets you move bytes and words between longs, with static/incrementing/decrementing field pointers and rotate-left, but there's no simple mechanism for addressing cog RAM as a sequence of bytes. It would take 2-3 more instructions.
Hmm, language wars. Well, I started with Basic and over the last 20 years have been following Microsoft as they gradually changed Basic into Qbasic and VB and VB.net and each step of the way I resisted the changes and ended up liking them. And then I realised they were turning Basic into C. So I guess that means that deep down, I like C
Having said that, I think Spin might be even better. Take pointers in C for instance, which are represented with *. To my Basic eye, * means multiply. Sometimes you get * all by itself in C. I still don't quite know what that means, and google searches for "*" don't help. But Spin uses @, which to me makes a lot more sense. The location of a variable is "at" some memory location.
I'd love to see Spin being used more on a PC. Compile and run your Spin program on a PC and not even use the propeller. It would simplify development. One could think of visual spin, with text boxes, and check boxes etc. I've got these working on the propeller so why not on a PC too?
What do I want on a Prop II? Well, reading Cluso's thread, I think it comes down to memory. I've got a Z80 chip sitting on my desk here. Ancient technology, yet it has 32 times the memory of a propeller cog. But technically, it has no memory at all, as memory on a Z80 is external. This has me thinking - imagine what you could do if you had pasm able to quickly access external ram? Allocate 24 pins for external ram as a simplistic model - 8 for data and 16 for address, and you can read 64k instead of 2k. Add in some instructions to quickly read and write memory (eg rd low, read memory, rd high) and you don't have to worry about trying to squeeze more ram into the propeller chip. LMM took the propeller from a 2k micro to a 32k micro, but with external ram, you can go to a 512k micro with a $4 sram chip.
Maybe it is just a matter of having more pins?
One can think of all sorts of cool things with a Prop I and twice the number of pins, because that would give you enough for a decent external ram system and still lots of general purpose I/O pins.
I'm very much looking forward to the Prop II of course. The sooner the better!
How current is spinsim in implementing the Propeller 2 instruction set? Does it simulate the SDRAM interface? It would be great to be able to start Propeller 2 development before the actual silicon is available and your simulator seems like our best shot at that!
... In his way, all the macaroni gets thrown into the pot, then the remaining ingredients are mixed together in another bowl, in no particular order, as it doesn't matter, and at the end, the macaroni is stirred into the other ingredients. What program would be easier to express?
Ladder Logic does this to a limited extent. But it's an exact interpreted language so doesn't allow for any sort of reorganising. On the plus side you do end up with a rather rigorous program.
I've occasionally mulled over how well it'd meld with multicore CPUs. One factor is there just isn't any need for multi-threading the upper logic in typical factory use. Power consumption is low already and the execution speed plenty good enough even on ten year old processors. The logic just isn't that complex and a lot of the tight loop control is running from coprocessors in the remote I/O or expansion rack.
How current is spinsim in implementing the Propeller 2 instruction set? Does it simulate the SDRAM interface? It would be great to be able to start Propeller 2 development before the actual silicon is available and your simulator seems like our best shot at that!
spinsim implements the basic P2 instruction set, but it doesn't support the peripheral devices, such as counters and the additional I/O ports. It also doesn't do the multi-cycle math operations, basically because there isn't enough detail in the spec at this point.
As the spec gets updated I plan on updating spinsim to match it. I could easily add support for the SDRAM interface if there was more detail in the spec on how it works. However, even just the basic I/O port operation is still not fully defined in the spec, so I currently only suport Port A in the same way that P1 accesses it.
spinsim implements the basic P2 instruction set, but it doesn't support the peripheral devices, such as counters and the additional I/O ports. It also doesn't do the multi-cycle math operations, basically because there isn't enough detail in the spec at this point.
As the spec gets updated I plan on updating spinsim to match it. I could easily add support for the SDRAM interface if there was more detail in the spec on how it works. However, even just the basic I/O port operation is still not fully defined in the spec, so I currently only suport Port A in the same way that P1 accesses it.
Thanks for the update! I guess we just have to hope that Parallax will post a more detailed description now that the hardware design is complete.
I'm sorry that I don't have any documentation for much of the new stuff in the Prop II yet. I'm racing to get the ROM code done now, which is probably working, already, but I need to clean it up before testing it. It looks like we will have a separately-executable SHA-256 program, which I will post a little later here, and hopefully, a similar AES-128 program in ROM, for general use outside the ROM and user boot loaders.
Watch that Steve Teig video posted by Kye. I really think that what he's talking about is the future. It's hard for us to discuss it, though, because we are so boxed in by our current thinking that we are very likely to build incomplete mental constructs out of stuff we already know, in response to what he's saying, and miss the point. The idea of the logic carrying the data is the reciprocal of what we've got today, and totally takes a hammer to the low-level bottlenecks we suffer with. Maybe I'm just easily influenced. I wasn't swayed too much by all his Newtonian and Einsteinian talk, though it was all interesting observation, and maybe put forth to tickle investors' minds, and what he cited as cache maintenance being the downfall of multiprocessing is not real to me, because we know multiple processors are best used to perform unique tasks, not gang up on one - maybe it's the common linear thinking that dictates to many people that multiple processors MUST work in alliance on a single goal. Anyway, we know that's not so. What he brings up, though, is this notion of spreading the memory through the logic, and in his implementation, he switches the configuration 8 or 16 times as the data floats upward in time. I am so bound by linear thinking, that I can't grasp yet what all could be done, or how you might express it, because it is a total paradigm shift. I'm going to be thinking about it, though.
Chip, do a quick Google search for "Map Reduce" and have a look at the Wikipedia article. This is the current method that companies are using to harness multi-processing capacity. It's an extremely simple notion, it has some interesting implications (yes, I meant implications).
I think there are uses for both types of Multiple processor usages. All of them doing different things is obviously useful, and typically much easier to grasp and code for. However, all of them working on one task has it's uses. Especially in the case of reducing the time needed to complete the task. Reducing a multi-minute task to seconds, or a multi-hour/day task to minutes are both very compelling things. These cases are where you run into the cache maintenance issue. One prime example is with Modern GPUs, where they get massive parallelization (1536 and 2048 cores in the newest ones), but also share a lot of the same data.
One way to better utilize multiprocessing is to take more care in designing your data layout and makeup. If you factor in multiprocessing from the start, you tend to avoid doing things with the data that break things.
Here is the SHA-256 code that pedward was instrumental in making happen. This will be in ROM for general use:
'****************
'* *
'* SHA-256 *
'* *
'****************
org
sha_256 setf #%0_1000_1111 'configure movf for endian swap
'
'
' Init hash
'
sha_init reps #8,#1 'copy hash_init[0..7] into hash[0..7]
setinds hash,hash_init
mov indb++,inda++
'
'
' Command loop
'
sha_command rdlong x,ptra 'wait for long pointer in bits[16..2] and command in bits[1..0]
tjz x,#sha_command
setptrb x 'get long pointer into ptrb
and x,#1 '1 = hash block (pointer to 16 longs)
tjz x,#sha_read '2 = read hash (pointer to 8 longs)
'
'
' Hash block
'
cachex 'invalidate cache for fresh rdlongc
reps #16,@:load 'load 16 longs at pointer into w[0..15]
setinda w
rdlongc x,ptrb++ 'do endian swap on each long
movf inda,x
movf inda,x
movf inda,x
:load movf inda++,x
' Extend w[0..15] into w[16..63] to generate schedule
reps #48,@:sch 'i = 16..63
setinds w+16,w+16-15+7 'indb = @w[i], inda = @w[i-15+7]
setinda --7 's0 = (w[i-15] -> 7) ^ (w[i-15] -> 18) ^ (w[i-15] >> 3)
mov indb,inda--
mov x,indb
rol x,#18-7
xor x,indb
ror x,#18
shr indb,#3
xor indb,x
add indb,inda 'w[i] = s0 + w[i-16]
setinda ++14 's1 = (w[i-2] -> 17) ^ (w[i-2] -> 19) ^ (w[i-2] >> 10)
mov x,inda
mov y,x
rol y,#19-17
xor y,x
ror y,#19
shr x,#10
xor x,y
add indb,x 'w[i] = s0 + w[i-16] + s1
setinda --5 'w[i] = s0 + w[i-16] + s1 + w[i-7]
:sch add indb++,inda
' Load variables from hash
reps #8,#1 'copy hash[0..7] into a..h
setinds a,hash
mov indb++,inda++
' Do 64 iterations on variables
reps #64,@:itr 'i = 0..63
setinds k+0,w+0 'indb = @k[i], inda = @w[i]
mov x,g 'ch = (e & f) ^ (!e & g)
xor x,f
and x,e
xor x,g
mov y,e 's1 = (e -> 6) ^ (e -> 11) ^ (e -> 25)
rol y,#11-6
xor y,e
rol y,#25-11
xor y,e
ror y,#25
add x,y 't1 = ch + s1
add x,indb++ 't1 = ch + s1 + k[i]
add x,inda++ 't1 = ch + s1 + k[i] + w[i]
add x,h 't1 = ch + s1 + k[i] + w[i] + h
mov y,c 'maj = (a & b) ^ (b & c) ^ (c & a)
and y,b
or y,a
mov h,c
or h,b
and y,h
mov h,a 's0 = (a -> 2) ^ (a -> 13) ^ (a -> 22)
rol h,#13-2
xor h,a
rol h,#22-13
xor h,a
ror h,#22
add y,h 't2 = maj + s0
mov h,g 'h = g
mov g,f 'g = f
mov f,e 'f = e
mov e,d 'e = d
mov d,c 'd = c
mov c,b 'c = b
mov b,a 'b = a
add e,x 'e = e + t1
mov a,x 'a = t1 + t2
:itr add a,y
' Add variables back into hash
reps #8,#1 'add a..h into hash[0..7]
setinds hash,a
add indb++,inda++
wrlong zero,ptra 'clear command to signal done
jmp #sha_command 'get next command
'
'
' Read hash
'
sha_read reps #8,#1 'store hash[0..7] at pointer
setinda hash
wrlong inda++,ptrb++
wrlong zero,ptra 'clear command to signal done
jmp #sha_init 'init hash, get next command
'
'
' Defined data
'
hash_init long $6A09E667, $BB67AE85, $3C6EF372, $A54FF53A, $510E527F, $9B05688C, $1F83D9AB, $5BE0CD19 'fractionals of square roots of primes 2..19
k long $428A2F98, $71374491, $B5C0FBCF, $E9B5DBA5, $3956C25B, $59F111F1, $923F82A4, $AB1C5ED5 'fractionals of cube roots of primes 2..311
long $D807AA98, $12835B01, $243185BE, $550C7DC3, $72BE5D74, $80DEB1FE, $9BDC06A7, $C19BF174
long $E49B69C1, $EFBE4786, $0FC19DC6, $240CA1CC, $2DE92C6F, $4A7484AA, $5CB0A9DC, $76F988DA
long $983E5152, $A831C66D, $B00327C8, $BF597FC7, $C6E00BF3, $D5A79147, $06CA6351, $14292967
long $27B70A85, $2E1B2138, $4D2C6DFC, $53380D13, $650A7354, $766A0ABB, $81C2C92E, $92722C85
long $A2BFE8A1, $A81A664B, $C24B8B70, $C76C51A3, $D192E819, $D6990624, $F40E3585, $106AA070
long $19A4C116, $1E376C08, $2748774C, $34B0BCB5, $391C0CB3, $4ED8AA4A, $5B9CCA4F, $682E6FF3
long $748F82EE, $78A5636F, $84C87814, $8CC70208, $90BEFFFA, $A4506CEB, $BEF9A3F7, $C67178F2
zero long 0
'
'
' Undefined data
'
hash res 8
w res 64
a res 1
b res 1
c res 1
d res 1
e res 1
f res 1
g res 1
h res 1
x res 1
y res 1
I saw MarkT's SHA-256 on Obex where he initialized the hash at the start and after each hash read. I just changed this implementation to work like that and it saved a few longs, as well as some code on the calling side.
It turns out that there are no flag writes or flag tests in the whole program. Also, all looping was done with the REPS instruction, which saves an instruction over using DJNZ and takes no time for the loop's return-to-top.
I've also thought about the best way to add external storage to a prop (I or II) and one approach I'd like to try to dedicate a cog to transferring "objects", data, pasm routines, etc to and from eeprom, sram, sdcards, etc. A request gets past to that cog by putting needed info into hub ram someplace and then uses a lock as a semaphore to indicate the request. When the transfer is complete the lock is cleared to let the asking routine "know" that request has been completed.
Always a thought on how to di it if I needed to but that need has never actually come up. Back in the day reading and writing to ram on a sub-MHZ based 8bit proc couldn't have been that much faster than SPI at prop speeds!
I can see using a few cogs on a plc to help with the timers and other functions. Don't want the ladder to be split among cogs though. While the plc runs fast enough to make it appear to be all happening in parallel I very frequently depended on the fact that execution was serial from "top to bottom" of the ladder.
I can see using a few cogs on a plc to help with the timers and other functions. Don't want the ladder to be split among cogs though. While the plc runs fast enough to make it appear to be all happening in parallel I very frequently depended on the fact that execution was serial from "top to bottom" of the ladder.
Using the PLC model, sometimes you would be forced to split across COGS, as you may run out of room in just one.
Telling users they can only have ONE COG is too restrictive, ( they have paid for 8!) and yet they will have little idea where a COG boundary is, so the tools need to manage that.
There, you would just need some language tag, to indicate where the tools could or could not split.
If you used the PLC edge commands to 'pace' anything that had to be sequential, then that would (should?) split safely ?
Do you have some explicit examples where sequential was important ?
Chip, the SHA-256 code looks really nice! But now your taunting us with yet another new instruction that's not in the latest published P2 spec. In addition to the new reps and setinds instructions there's also a movf. Please, post an updated P2 spec. Pretty please. Please, please, please.
Using the PLC model, sometimes you would be forced to split across COGS, as you may run out of room in just one.
Telling users they can only have ONE COG is too restrictive, ( they have paid for 8!) and yet they will have little idea where a COG boundary is, so the tools need to manage that.
There, you would just need some language tag, to indicate where the tools could or could not split.
If you used the PLC edge commands to 'pace' anything that had to be sequential, then that would (should?) split safely ?
Do you have some explicit examples where sequential was important ?
Granted it's been a long time since I've worked with PLCs but most of the time I spent with Omron and would write "subroutines" that were called by having up to a large amount of logic run or not run by way of a set/reset "relay". Often times after the "subroutine" the last rung would reset the "calling relay" so that the next scan would pass it up. Very important that the logic would only "execute" the exact number of times required based on top to bottom scans. If the logic truly ran in parallel a lot of my PLC programs wouldn't have worked properly!
Chip, the SHA-256 code looks really nice! But now your taunting us with yet another new instruction that's not in the latest published P2 spec. In addition to the new reps and setinds instructions there's also a movf. Please, post an updated P2 spec. Pretty please. Please, please, please.
Dave, I would post one, but I don't have time to make one right now. You can probably infer the REPS/SETINDx instructions, but here's SETF/MOVF:
SETF D/# - set up field mover
%w_xxdd_yyss
w: 0=byte, 1=word
xx: destination field control, 00=static, 01=rotate left by 8/16 bits, 10=increment, 11=decrement
dd: initial destination field, 00=byte0/word0, 01=byte1/word0, 10=byte2/word1, 11=byte3/word1
yy: source field control, 0x=static, 10=increment, 11=decrement
ss: initial source field, 00=byte0/word0, 01=byte1/word0, 10=byte2/word1, 11=byte3/word1
MOVF D,S - moves a byte/word from S into a byte/word in D, leaving other bits unchanged (except in the case of xx=01, in which bits rotate by 8 or 16)
After authenticating the signed user boot loader (from serial or SPI flash) using HMAC/SHA-256, the loader gets handed the keys and executes. The plan has been that AES-128 would then be employed by the user boot loader to decipher the off-chip main program as it loads, using the keys. What if instead of using AES-128, we use the already-in-ROM SHA-256 program as a one time pad generator (XOR against incoming data to decipher), where we initialize it with the keys, then keep rehashing the hash to generate the pad, and maybe also employ some form of block chaining?
I think this would be sound, but does anyone else have an opinion?
After authenticating the signed user boot loader (from serial or SPI flash) using HMAC/SHA-256, the loader gets handed the keys and executes. The plan has been that AES-128 would then be employed by the user boot loader to decipher the off-chip main program as it loads, using the keys. What if instead of using AES-128, we use the already-in-ROM SHA-256 program as a one time pad generator (XOR against incoming data to decipher), where we initialize it with the keys, then keep rehashing the hash to generate the pad, and maybe also employ some form of block chaining?
I think this would be sound, but does anyone else have an opinion?
Since we are not going to be able to boot from SD what will be the cheapest SPI Flash usable (generic part no, lowest size) ?
I am starting to use 24LC64 SOT23-5 $0.33/100 on some prop1 projects that have microSD - it just contains a minimal SD FAT bootloader. I would have loved to ditch the flash on P2.
Since we are not going to be able to boot from SD what will be the cheapest SPI Flash usable (generic part no, lowest size) ?
I am starting to use 24LC64 SOT23-5 $0.33/100 on some prop1 projects that have microSD - it just contains a minimal SD FAT bootloader. I would have loved to ditch the flash on P2.
Comments
I think he's probably onto what the future of computing will be. He points out that we are mired in excruciating proceduralism, whereas much of what we need to make a program do does not require it, though we are forced to specify it due to the current computing paradigm.
His premise is that if a computational goal can be defined somewhat loosely, freedom can be had by a compiler to arrange all the sub-computations in parallel, and only implement sequentiality where causal boundaries exist. Of course, there needs to be special hardware to run such code efficiently, which he has already built. He sees place-and-route tools as the compiler for his reconfigurable fabric, which iterates through configuration states in the GHz. The point of his fabric is to keep the data bound to the computational elements, so that the "straw" that exists today between the CPU and the memory goes away - nullifying access time, which is the bane of current systems (caches, speculative execution, etc., exist just to overcome the memory bottleneck). In his fabric, there is a field of computational elements with transparent latches for data and some cycling mechanism to reconfigure the interconnections among elements in a round-robin fashion. The simpler the sub-computations, the more effective parallel 'CPU's of today you would have, though the granularity is much finer than what makes up a CPU now. He's determined that math operators on the order of 4 or 8 bits are adequate, as 32-bit or 64-bit computations are rare, in practice, but could be built from lesser elements with some sequentiality.
He gave a great analogy about cooking a box of macaroni and cheese. Today, you would iterate through each piece of macaroni as it goes into the pot, then execute the rest of the directions in some exact order. In his way, all the macaroni gets thrown into the pot, then the remaining ingredients are mixed together in another bowl, in no particular order, as it doesn't matter, and at the end, the macaroni is stirred into the other ingredients. What program would be easier to express? He cited Haskell as a language which compiles such code for current CPUs. Can anyone state how many times and what language he cited as NOT up to the task of expressing this new kind of code, though the professional community blithely supposes that it should be used?
Wow, how much of that 320GB does it use ?! - and does it out perform the Prop when running ?
Nice aspiration, but the real world intrudes.
On a chip like the Prop, (unlike much larger alternatives) there are no "causal boundaries" that can be ignored/put off for later, you will need to communicate to your compiler, just WHAT you want to compile into a COG at PASM level, and what can run LMM.
MOST of the code will be running only quasi compiled, with the speed penalty that incurs, but SOME small portions can 'drop into' a COG as native ASM level code.
It is managing that slice/dice that will matter on a Prop tool flow, and the small COG code .size, demands nimble tools.
It is why I postulated to build in in Propeller II --- Instructions to handle bytes in 32bits-Longs.
To simplify that computation.
I made an instruction that lets you move bytes and words between longs, with static/incrementing/decrementing field pointers and rotate-left, but there's no simple mechanism for addressing cog RAM as a sequence of bytes. It would take 2-3 more instructions.
No necessary address them in sequence IF You have Indexed instruction that can move byte from one Long to another.
MOV_B source,Index (3 downto 0) else (0 to 3) .. That have already sequence of 4 Bytes. NOW I only need address next LONG in source field to have SEQUENCED addresing of BYTES.
I have been looking at all versions of the PDF on Prop2 since it has been released... I have a number of interesting ideas, but I need to test them on a P2 (or FPGA emulation) before I will know which will work best (and which will not work). The P2 has a lot of additional instructions to think outside the box with
(I should become active on the forum again in 1-2 weeks; been too busy to participate - sorry.)
I appreciate all the info you posted to this thread. Looking forward to "playing" with Prop2.
Having said that, I think Spin might be even better. Take pointers in C for instance, which are represented with *. To my Basic eye, * means multiply. Sometimes you get * all by itself in C. I still don't quite know what that means, and google searches for "*" don't help. But Spin uses @, which to me makes a lot more sense. The location of a variable is "at" some memory location.
I'd love to see Spin being used more on a PC. Compile and run your Spin program on a PC and not even use the propeller. It would simplify development. One could think of visual spin, with text boxes, and check boxes etc. I've got these working on the propeller so why not on a PC too?
What do I want on a Prop II? Well, reading Cluso's thread, I think it comes down to memory. I've got a Z80 chip sitting on my desk here. Ancient technology, yet it has 32 times the memory of a propeller cog. But technically, it has no memory at all, as memory on a Z80 is external. This has me thinking - imagine what you could do if you had pasm able to quickly access external ram? Allocate 24 pins for external ram as a simplistic model - 8 for data and 16 for address, and you can read 64k instead of 2k. Add in some instructions to quickly read and write memory (eg rd low, read memory, rd high) and you don't have to worry about trying to squeeze more ram into the propeller chip. LMM took the propeller from a 2k micro to a 32k micro, but with external ram, you can go to a 512k micro with a $4 sram chip.
Maybe it is just a matter of having more pins?
One can think of all sorts of cool things with a Prop I and twice the number of pins, because that would give you enough for a decent external ram system and still lots of general purpose I/O pins.
I'm very much looking forward to the Prop II of course. The sooner the better!
How current is spinsim in implementing the Propeller 2 instruction set? Does it simulate the SDRAM interface? It would be great to be able to start Propeller 2 development before the actual silicon is available and your simulator seems like our best shot at that!
Thanks,
David
Ladder Logic does this to a limited extent. But it's an exact interpreted language so doesn't allow for any sort of reorganising. On the plus side you do end up with a rather rigorous program.
I've occasionally mulled over how well it'd meld with multicore CPUs. One factor is there just isn't any need for multi-threading the upper logic in typical factory use. Power consumption is low already and the execution speed plenty good enough even on ten year old processors. The logic just isn't that complex and a lot of the tight loop control is running from coprocessors in the remote I/O or expansion rack.
Now for PLC wars ... Omron all the way!
As the spec gets updated I plan on updating spinsim to match it. I could easily add support for the SDRAM interface if there was more detail in the spec on how it works. However, even just the basic I/O port operation is still not fully defined in the spec, so I currently only suport Port A in the same way that P1 accesses it.
- with a quote that rang a familiar "beyond von neuman" bell: "Computer scientist Reiner Hartenstein describes reconfigurable computing in terms of an anti machine that, according to him, represents a fundamental paradigm shift away from the more conventional von Neumann machine."
Which then pointed me to - http://deimos.eos.uoguelph.ca/sareibi/TEACHING_dr/ENG6530_RCS_html_dr/outline_W2010/docs/LECTURE_dr/TOPIC13_dr/PAPERS_dr/ADecadeofRCS.pdf
Haven't read it myself.
I'm sorry that I don't have any documentation for much of the new stuff in the Prop II yet. I'm racing to get the ROM code done now, which is probably working, already, but I need to clean it up before testing it. It looks like we will have a separately-executable SHA-256 program, which I will post a little later here, and hopefully, a similar AES-128 program in ROM, for general use outside the ROM and user boot loaders.
Watch that Steve Teig video posted by Kye. I really think that what he's talking about is the future. It's hard for us to discuss it, though, because we are so boxed in by our current thinking that we are very likely to build incomplete mental constructs out of stuff we already know, in response to what he's saying, and miss the point. The idea of the logic carrying the data is the reciprocal of what we've got today, and totally takes a hammer to the low-level bottlenecks we suffer with. Maybe I'm just easily influenced. I wasn't swayed too much by all his Newtonian and Einsteinian talk, though it was all interesting observation, and maybe put forth to tickle investors' minds, and what he cited as cache maintenance being the downfall of multiprocessing is not real to me, because we know multiple processors are best used to perform unique tasks, not gang up on one - maybe it's the common linear thinking that dictates to many people that multiple processors MUST work in alliance on a single goal. Anyway, we know that's not so. What he brings up, though, is this notion of spreading the memory through the logic, and in his implementation, he switches the configuration 8 or 16 times as the data floats upward in time. I am so bound by linear thinking, that I can't grasp yet what all could be done, or how you might express it, because it is a total paradigm shift. I'm going to be thinking about it, though.
One way to better utilize multiprocessing is to take more care in designing your data layout and makeup. If you factor in multiprocessing from the start, you tend to avoid doing things with the data that break things.
I saw MarkT's SHA-256 on Obex where he initialized the hash at the start and after each hash read. I just changed this implementation to work like that and it saved a few longs, as well as some code on the calling side.
It turns out that there are no flag writes or flag tests in the whole program. Also, all looping was done with the REPS instruction, which saves an instruction over using DJNZ and takes no time for the loop's return-to-top.
I've also thought about the best way to add external storage to a prop (I or II) and one approach I'd like to try to dedicate a cog to transferring "objects", data, pasm routines, etc to and from eeprom, sram, sdcards, etc. A request gets past to that cog by putting needed info into hub ram someplace and then uses a lock as a semaphore to indicate the request. When the transfer is complete the lock is cleared to let the asking routine "know" that request has been completed.
Always a thought on how to di it if I needed to but that need has never actually come up. Back in the day reading and writing to ram on a sub-MHZ based 8bit proc couldn't have been that much faster than SPI at prop speeds!
Using the PLC model, sometimes you would be forced to split across COGS, as you may run out of room in just one.
Telling users they can only have ONE COG is too restrictive, ( they have paid for 8!) and yet they will have little idea where a COG boundary is, so the tools need to manage that.
There, you would just need some language tag, to indicate where the tools could or could not split.
If you used the PLC edge commands to 'pace' anything that had to be sequential, then that would (should?) split safely ?
Do you have some explicit examples where sequential was important ?
Granted it's been a long time since I've worked with PLCs but most of the time I spent with Omron and would write "subroutines" that were called by having up to a large amount of logic run or not run by way of a set/reset "relay". Often times after the "subroutine" the last rung would reset the "calling relay" so that the next scan would pass it up. Very important that the logic would only "execute" the exact number of times required based on top to bottom scans. If the logic truly ran in parallel a lot of my PLC programs wouldn't have worked properly!
Dave, I would post one, but I don't have time to make one right now. You can probably infer the REPS/SETINDx instructions, but here's SETF/MOVF:
SETF D/# - set up field mover
%w_xxdd_yyss
w: 0=byte, 1=word
xx: destination field control, 00=static, 01=rotate left by 8/16 bits, 10=increment, 11=decrement
dd: initial destination field, 00=byte0/word0, 01=byte1/word0, 10=byte2/word1, 11=byte3/word1
yy: source field control, 0x=static, 10=increment, 11=decrement
ss: initial source field, 00=byte0/word0, 01=byte1/word0, 10=byte2/word1, 11=byte3/word1
MOVF D,S - moves a byte/word from S into a byte/word in D, leaving other bits unchanged (except in the case of xx=01, in which bits rotate by 8 or 16)
After authenticating the signed user boot loader (from serial or SPI flash) using HMAC/SHA-256, the loader gets handed the keys and executes. The plan has been that AES-128 would then be employed by the user boot loader to decipher the off-chip main program as it loads, using the keys. What if instead of using AES-128, we use the already-in-ROM SHA-256 program as a one time pad generator (XOR against incoming data to decipher), where we initialize it with the keys, then keep rehashing the hash to generate the pad, and maybe also employ some form of block chaining?
I think this would be sound, but does anyone else have an opinion?
Thanks.
I think that for satisfies all people --- You need support both possibility's.
The MOVF instruction is very nice thanks.
Since we are not going to be able to boot from SD what will be the cheapest SPI Flash usable (generic part no, lowest size) ?
I am starting to use 24LC64 SOT23-5 $0.33/100 on some prop1 projects that have microSD - it just contains a minimal SD FAT bootloader. I would have loved to ditch the flash on P2.