High Performance Needed
I need to process 10 to 20 million bytes per second. Needless to say that the Propeller is not powerful enough to perform this.
I have a true random numbers generator that outputs bytes and I need to rapidly count them to see which bytes occur more often out of 1 trillion bytes.
If you needed to count 20 million bytes per second all the way up to 1 trillion bytes, what hardware would you use ?
Thanks
I have a true random numbers generator that outputs bytes and I need to rapidly count them to see which bytes occur more often out of 1 trillion bytes.
If you needed to count 20 million bytes per second all the way up to 1 trillion bytes, what hardware would you use ?
Thanks

Comments
In a cryptography project a few years ago I plopped several propellers on a perfboard, and with surprisingly little hassle and time broke up a very big problem very nicely.
Anyway, you'll probably have to provide more info for anyone to make specific recommendations.
If the random numbers are a byte, that's 0-255 or 0-FF so I'd setup a data table in RAM with 256 slots. You'll need a pointer to keep track of your current read in the buffer.
MYTABLE[read value]=+1
increment buffer pointer one byte
get next number
When you are all done, you'll have a data table with 256 numbers in it. Each position in the table corresponds to the number and it's number of occurrence. That would be a thousand times faster than IF - THEN statements.
disclaimer: it's been a couple months since I've touched my prop unfortunately, so the "code" above is not code. You'll have to work that out. It'll be slightly more complicated than that since you'll have to deal with overrunning and overwriting your data table, but you get the idea.
The snapshot could also be triggered based on elapsed time since there's a 32-bit system-wide clock counter with a 12.5ns resolution. If it's important to never drop any samples, two cogs could alternate the counting of samples with the copying of the counts to the hub memory buffer and the counts could be combined there.
How about wait for Prop II ?
First you need to nail down some corners,
10^12/2^32 = 232.830
10^12/20^6 = 15625s 10^12/20^6/60/60 = 4.340 hours
The first number shows 32 bit counters could be marginal, on big skews, and 10^12 needs careful control.
You could consider 40 bits, and more likely 48 bits ?
The code is then a repeat of
INC(CounterArray[ByteValue])
and
CounterArray is 256 x 48 bit Memory.
A PC can just stream 20Mbytes/s over high speed USB, so a FTDI High speed + PC is one pathway.
Or, you could use a small FPGA/CPLD - they can easily increment memory at 20 MHz, and you need just 1536 bytes of memory.
Streaming rates of 50MHz-100MHz+ should be possible here.
["Needless to say that the Propeller is not powerful enough to perform this."]
Well, certainly not easily, at a student level.... but...
At 20 MBytes/sec, data is arriving at exactly a Prop 1's nominal opcode rate per cog.
- but lets imagine we can phase-lock 4 identical coded cogs, each sampling at 200ns, but 50ns apart.
Here is a rough once-over of how this might work :
You now have spread your total over 4 arrays of numbers, adding at 1/4 the rate, and have 4 opcodes to play with...
Candidates would be
MOVD D,S Insert S[8..0] into D[17..9]
This allows the Port pins, to self-modify a Destination byte ready for a fixed S=1 add
100000 001i 1111 ddddddddd sssssssss ADD D,S Add S into D
and a small fix is needed to avoid the very-ends of COG RAM address range
and finally
DJNZ D,S Dec D, jump if not zero to S (no jump = 8 clocks, jump = 4 clocks)
On average, each 32 bit counter bucket will be 22.7% full at 10^12, but to reach 10^12/2 in each cog, we need 58.207 * 2*32
- so that's another problem....
Choices are to unroll to 59 or 60, or work on less than 10^12, and collect later.
MOVD AddressOfADDOpCode,ByteFromPort
ADD AddressOfADDOpCode,OffsetAroundCode
ADD Dummy_Address,1 ; INC the actual array, note Dummy_Address has been changed each time.
NOP / DJNZ ; unroll as needed
We have 512 x 32 bit registers, but the top 16 are special, and code starts at bottom, so 512-16-60*4 = 256 (!!)
- but that does not allow cog-sync ; unroll of 59, allows 4 opcodes to sync - Possible ?
Of course, this is just a quick scribble, but it looks like 4 cogs could _just_ cope, and manage 10^12/4 loop as well.
I make each cell with an ideal value of 976562500, over the (almost) 10^12 samples.
A Prop II would gain in MHz and also the new Loop opcode saves 25%
Notice this counting code, will return a prefect random score card, for a binary counter - so perhaps a distance check would catch that - but that has a time-penalty.
Oops, I'm indeed off by an order of magnitude. Sorry.
With truly random distribution a trillion bytes distributed over 8 cogs would be 125 billion per cog, and distributed over 256 longs that would be a count of 488,281,250 per long. Even if the distribution is not perfectly random each long can count up to 4,294,967,295 which is 8 times what the average count should be for a perfectly random distribution.
You have 8 cogs
1. The data is eight bits parallel.
2. The data can be synchronized to a 10 MHz clock provided by the Propeller.
The trick here is to input the data on P16..9, which is the lower eight bits of a PASM instruction's 9-bit destination field. Then setup dira and outa, so that an instruction executed from ina adds one to the bin selected by the 8-bit input and sets the carry flag on overflow. The next register after ina is inb, a normal register on the P8X32A, and it will contain a jump-on-no-carry instruction back to ina. If a bin overflows, the jump will not be taken, and the instruction in outa will be executed. But that just adds an extra one to bin zero, which can be subtracted from the results. Then the instruction in outb gets executed: a jump to the report section of the code, which writes all of the bin contents to the hub.
Here's the skeletal code:
CON _clkmode = xtal1 + pll16x _xinfreq = 5_000_000 VAR long histo[257] PUB start cognew(@acquire, @histo) repeat until histo[256] 'Wait for "done" flag. 'Start serial I/O here, and output the results. DAT acquire jmp #setup long 0[255] setup mov 0,1 wz 'Clear first instruction and set Z flag. mov dira,dira0 mov inb,inb0 mov outa,outa0 mov outb,outb0 mov frqa,frqa0 mov phsa,phsa0 mov ctra,ctra0 jmp #ina report 'Write the accumulated data back to the hub, then set "done" flag in histo[256]. dira0 long %111111_1111_1111_100000000_111111111 outa0 if_z add 0-0,#1 wc 'Could be if_nc_or_z if CLK is high, but it doesn't matter, since Z is set anyway. 'Same as long %100000_0111_101*_0********_000000001 '  ├─DATA─┤ Bits 16..9 ' └CLK output Bit 18 inb0 if_nc jmp #ina outb0 jmp #report frqa0 long $2000_0000 phsa0 long 0 'Set to correct value so data is stable during ina instruction fetch. ctra0 long %00100 << 26 | 18 'Output a 10 MHz clock on P18. {{ When configured, this is what SFRs will look like, in order: ina if_z add 0-0,#1 wc inb if_nc jmp #ina outa if_z add 0,#1 wc 'Accumulates one extra count. outb jmp #report }}The clock signal comes out on P18, which is one of the condition code bits of the add instruction. The condition for this instruction is chosen such that the state of this bit does not matter. phsa must be initialized so that the data is stable when the add instruction in ina is fetched.
Anyway, I haven't tried this, but I believe the principle is sound. It's not. Se post #19.
-Phil
Nifty. ( I guess this Prop is doing nothing else anyway )
I think that means it loops until any one counter saturates (overflows), and then all others will be fractions from that ?
Not quite the original spec, but a quite nice way to normalise results.
How does that help in this situation?
You can spread the load, by interleaving samples over multiple COGS ?
It means more housekeeping, and you need to sum the totals from each cog before you are done, but it does buy
some cycles.
Why do they 'need to be be processed by a single cog' ?
All the OP wants to do, is sum the bytes as a histogram, for the nominal 1 trillion bytes.
So they do not care if that is 4 x 5 MByte streams summed, or one 20MB stream.
-Phil
{{ 20 MHz data analyzer. Parallel data is input on P7..P0, synchronized to 20 MHz clock on CLK_PIN. Four cogs take turns accumulating data in four 20 MHz bursts each, as determined by the quadrature clocks on I_PIN and Q_PIN:  CLK_PIN (20 MHz)  I_PIN (1.25 MHz)  Q_PIN (1.25 MHz) ┌───────┬───────┬───────┬───────┐ │ READ │ Accum │ Loop & wait...│ histo2 ├───────┼───────┼───────┼───────┤ │wait...│ READ │ Accum │ Loop &│ histo3 ├───────┼───────┼───────┼───────┤ │ Loop & wait...│ READ │ Accum │ histo1 ├───────┼───────┼───────┼───────┤ │ Accum │ Loop & wait...│ READ │ histo0 └───────┴───────┴───────┴───────┘ }} CON _clkmode = xtal1 + pll16x _xinfreq = 5_000_000 Q_PIN = 9 I_PIN = Q_PIN + 1 CLK_PIN = 11 VAR long histo0[257], histo1[257], histo2[257], histo3[257] PUB start dira[CLK_PIN]~~ 'Drive clock pins low to start. dira[I_PIN]~~ dira[Q_PIN]~~ cognew(@acquire, @histo0) cognew(@acquire, @histo1) cognew(@acquire, @histo2) cognew(@acquire, @histo3) cognew(@quadrature, 0) repeat until histo0[256] and histo1[256] and histo2[256] and histo3[256] 'Wait for all "done" flags. 'Code to output accumulated data. DAT ''-------[ 5 MHz Quadrature Output ]------------------------------------------- org 0 quadrature mov dira,:dira0 mov frqa,:frqx0 mov frqb,:frqx0 mov phsa,:phsa0 'ctra will lead ctrb by 90°. mov ctra,:ctra0 mov ctrb,:ctrb0 jmp $ 'Keep cog alive. :dira0 long 1 << I_PIN | 1 << Q_PIN :frqx0 long $0200_0000 '1.25 MHz. :phsa0 long $4000_0000 - $0800_0000 'Quadrature, less count in one instruction time. :ctra0 long %00100 << 26 | I_PIN 'ctra outputs on I_PIN. :ctrb0 long %00100 << 26 | Q_PIN 'ctrb outputs on Q_PIN. ''-------[ Data Acquisition and Accumulation ]--------------------------------- org 0 acquire jmp #:start long 255[0] 'Data bins, including address 0. :start mov 0,#0 'Clear bin 0 of jmp instruction. mov :quad,par 'Each successive hub array is offset by %100 in last three bits. shr :quad,#2 'Isolate them. and :quad,#3 wz 'Cog with histo0 drives the data clock. shl :quad,#Q_PIN 'Shift match mask into position. if_z mov dira,:dira0 'If histo0 set up to output data clock. if_z mov frqa,:frqa0 if_z mov phsa,:phsa0 waitpeq :mask,:mask 'Wait for quadrature cog to start. if_z mov ctra,:ctra0 'Start data clock, sync'd with quadrature. :loop waitpeq :quad,:mask 'Wait for our turn. movd :add0,ina 'Read data at 20 MHz burst. movd :add1,ina movd :add2,ina movd :add3,ina :add0 add 0-0,#1 wc 'Accumulate data that was read. :add1 if_nc add 0-0,#1 wc 'For shorter sampling epochs, something other than #1 can be used. :add2 if_nc add 0-0,#1 wc :add3 if_nc add 0-0,#1 wc if_nc jmp #:loop 'Loop back if no overflows. 'Code here to write counts back to hub and flag to histox[256]. :dira0 long 1 << CLK_PIN :frqa0 long $4000_0000 '20 MHz. :phsa0 long 0 'Adjust so that data is stable relative to waitpeq at :loop. :ctra0 long %00100 << 26 | CLK_PIN 'Data clock outputs on P11. :mask long 1 << I_PIN | 1 << Q_PIN :quad res 1I dunno, though. Rather than engaging in the discussion, the OP seems to have left the building. This may not have been worth the effort I put into it.
-Phil
Wow! A slick trick like that is never wasted. It's now in the forum archive and I, for one, have also squirreled it away for my own delectation and benefit.
Yup. A multiplicative congruential generator in an LPC1768 might do it. A hardware LFSR would surely do it. ...Just the first two thoughts that come to mind.
Maybe - notice that a simple 20Mhz counter will pass this 'Random Test', at least for a novice
It may be a Homework type question, but these are actually very good for defining just what is possible and where the limits are.
So the effort is certainly worthwhile, as it is now 'on record'.
Like my #6 this refined version also uses 4 cogs, but unlike #6 you do burst-sync.
If I read it right, this code also does not count a specific total number of bytes, but exits on the first overflow [full histogram] (which could get interesting with 4 cogs making different exit decisions - thus breaking the 'same counts' rule ? )
The outline in #6 relied on phased-starts of 4 COGS, which I presume is easy to do ?
If we skip the offset add, by the cleaner JMP replace you do, that is 3 opcodes including a total-sample counter.
So that could fit in 3 COGS and meet the data rate budget ?
Which raises a new question, of could this spread over 6 cogs, and sync to a half opcode, for a 40MHz effective rate ?
That would need 2 CLK granularity on first opcode launch in each COG, which I think is legal ?
The ideal would be to have ONE copy that is launched 4 times with a param tweak to set the go-time.
What 'real' random numbers are you referring to? The ones mentioned by the OP? Since he never clarified a thing, how do you know they were 'real'? And if they were truly 'real' why would he be interested in a histogram?
'assume a parallel data buss on pins P0-P7 'synchronised to a clock provided by the propeller org 0 start jmp #start2 hist long -1 long -1 'repeat the above line 255 times start2 mov start, hist 'reset first bin of histogram to -1 (should be at cog address 256 or higher) 'setup a counter for a 10MHz clock :loop movd :loop2, ina :loop2 djnz 0-0, #:loop 'one of the bins overflowed 'start the cleanup codeLawson
BTW,
will repeat the long for you.
-Phil