High Performance Needed
WaffenGeist
Posts: 5
I need to process 10 to 20 million bytes per second. Needless to say that the Propeller is not powerful enough to perform this.
I have a true random numbers generator that outputs bytes and I need to rapidly count them to see which bytes occur more often out of 1 trillion bytes.
If you needed to count 20 million bytes per second all the way up to 1 trillion bytes, what hardware would you use ?
Thanks
I have a true random numbers generator that outputs bytes and I need to rapidly count them to see which bytes occur more often out of 1 trillion bytes.
If you needed to count 20 million bytes per second all the way up to 1 trillion bytes, what hardware would you use ?
Thanks
Comments
In a cryptography project a few years ago I plopped several propellers on a perfboard, and with surprisingly little hassle and time broke up a very big problem very nicely.
Anyway, you'll probably have to provide more info for anyone to make specific recommendations.
If the random numbers are a byte, that's 0-255 or 0-FF so I'd setup a data table in RAM with 256 slots. You'll need a pointer to keep track of your current read in the buffer.
MYTABLE[read value]=+1
increment buffer pointer one byte
get next number
When you are all done, you'll have a data table with 256 numbers in it. Each position in the table corresponds to the number and it's number of occurrence. That would be a thousand times faster than IF - THEN statements.
disclaimer: it's been a couple months since I've touched my prop unfortunately, so the "code" above is not code. You'll have to work that out. It'll be slightly more complicated than that since you'll have to deal with overrunning and overwriting your data table, but you get the idea.
The snapshot could also be triggered based on elapsed time since there's a 32-bit system-wide clock counter with a 12.5ns resolution. If it's important to never drop any samples, two cogs could alternate the counting of samples with the copying of the counts to the hub memory buffer and the counts could be combined there.
How about wait for Prop II ? - or is the assignment due in, before then ?
First you need to nail down some corners,
10^12/2^32 = 232.830
10^12/20^6 = 15625s 10^12/20^6/60/60 = 4.340 hours
The first number shows 32 bit counters could be marginal, on big skews, and 10^12 needs careful control.
You could consider 40 bits, and more likely 48 bits ?
The code is then a repeat of
INC(CounterArray[ByteValue])
and
CounterArray is 256 x 48 bit Memory.
A PC can just stream 20Mbytes/s over high speed USB, so a FTDI High speed + PC is one pathway.
Or, you could use a small FPGA/CPLD - they can easily increment memory at 20 MHz, and you need just 1536 bytes of memory.
Streaming rates of 50MHz-100MHz+ should be possible here.
["Needless to say that the Propeller is not powerful enough to perform this."]
Well, certainly not easily, at a student level.... but...
At 20 MBytes/sec, data is arriving at exactly a Prop 1's nominal opcode rate per cog.
- but lets imagine we can phase-lock 4 identical coded cogs, each sampling at 200ns, but 50ns apart.
Here is a rough once-over of how this might work :
You now have spread your total over 4 arrays of numbers, adding at 1/4 the rate, and have 4 opcodes to play with...
Candidates would be
MOVD D,S Insert S[8..0] into D[17..9]
This allows the Port pins, to self-modify a Destination byte ready for a fixed S=1 add
100000 001i 1111 ddddddddd sssssssss ADD D,S Add S into D
and a small fix is needed to avoid the very-ends of COG RAM address range
and finally
DJNZ D,S Dec D, jump if not zero to S (no jump = 8 clocks, jump = 4 clocks)
On average, each 32 bit counter bucket will be 22.7% full at 10^12, but to reach 10^12/2 in each cog, we need 58.207 * 2*32
- so that's another problem....
Choices are to unroll to 59 or 60, or work on less than 10^12, and collect later.
MOVD AddressOfADDOpCode,ByteFromPort
ADD AddressOfADDOpCode,OffsetAroundCode
ADD Dummy_Address,1 ; INC the actual array, note Dummy_Address has been changed each time.
NOP / DJNZ ; unroll as needed
We have 512 x 32 bit registers, but the top 16 are special, and code starts at bottom, so 512-16-60*4 = 256 (!!)
- but that does not allow cog-sync ; unroll of 59, allows 4 opcodes to sync - Possible ?
Of course, this is just a quick scribble, but it looks like 4 cogs could _just_ cope, and manage 10^12/4 loop as well.
I make each cell with an ideal value of 976562500, over the (almost) 10^12 samples.
A Prop II would gain in MHz and also the new Loop opcode saves 25%
Notice this counting code, will return a prefect random score card, for a binary counter - so perhaps a distance check would catch that - but that has a time-penalty.
Oops, I'm indeed off by an order of magnitude. Sorry.
With truly random distribution a trillion bytes distributed over 8 cogs would be 125 billion per cog, and distributed over 256 longs that would be a count of 488,281,250 per long. Even if the distribution is not perfectly random each long can count up to 4,294,967,295 which is 8 times what the average count should be for a perfectly random distribution.
You have 8 cogs
1. The data is eight bits parallel.
2. The data can be synchronized to a 10 MHz clock provided by the Propeller.
The trick here is to input the data on P16..9, which is the lower eight bits of a PASM instruction's 9-bit destination field. Then setup dira and outa, so that an instruction executed from ina adds one to the bin selected by the 8-bit input and sets the carry flag on overflow. The next register after ina is inb, a normal register on the P8X32A, and it will contain a jump-on-no-carry instruction back to ina. If a bin overflows, the jump will not be taken, and the instruction in outa will be executed. But that just adds an extra one to bin zero, which can be subtracted from the results. Then the instruction in outb gets executed: a jump to the report section of the code, which writes all of the bin contents to the hub.
Here's the skeletal code:
The clock signal comes out on P18, which is one of the condition code bits of the add instruction. The condition for this instruction is chosen such that the state of this bit does not matter. phsa must be initialized so that the data is stable when the add instruction in ina is fetched.
Anyway, I haven't tried this, but I believe the principle is sound. It's not. Se post #19.
-Phil
Nifty. ( I guess this Prop is doing nothing else anyway )
I think that means it loops until any one counter saturates (overflows), and then all others will be fractions from that ?
Not quite the original spec, but a quite nice way to normalise results.
How does that help in this situation?
You can spread the load, by interleaving samples over multiple COGS ?
It means more housekeeping, and you need to sum the totals from each cog before you are done, but it does buy
some cycles.
Why do they 'need to be be processed by a single cog' ?
All the OP wants to do, is sum the bytes as a histogram, for the nominal 1 trillion bytes.
So they do not care if that is 4 x 5 MByte streams summed, or one 20MB stream.
-Phil
I dunno, though. Rather than engaging in the discussion, the OP seems to have left the building. This may not have been worth the effort I put into it.
-Phil
Wow! A slick trick like that is never wasted. It's now in the forum archive and I, for one, have also squirreled it away for my own delectation and benefit.
Yup. A multiplicative congruential generator in an LPC1768 might do it. A hardware LFSR would surely do it. ...Just the first two thoughts that come to mind.
Maybe - notice that a simple 20Mhz counter will pass this 'Random Test', at least for a novice
It may be a Homework type question, but these are actually very good for defining just what is possible and where the limits are.
So the effort is certainly worthwhile, as it is now 'on record'.
Like my #6 this refined version also uses 4 cogs, but unlike #6 you do burst-sync.
If I read it right, this code also does not count a specific total number of bytes, but exits on the first overflow [full histogram] (which could get interesting with 4 cogs making different exit decisions - thus breaking the 'same counts' rule ? )
The outline in #6 relied on phased-starts of 4 COGS, which I presume is easy to do ?
If we skip the offset add, by the cleaner JMP replace you do, that is 3 opcodes including a total-sample counter.
So that could fit in 3 COGS and meet the data rate budget ?
Which raises a new question, of could this spread over 6 cogs, and sync to a half opcode, for a 40MHz effective rate ?
That would need 2 CLK granularity on first opcode launch in each COG, which I think is legal ?
The ideal would be to have ONE copy that is launched 4 times with a param tweak to set the go-time.
What 'real' random numbers are you referring to? The ones mentioned by the OP? Since he never clarified a thing, how do you know they were 'real'? And if they were truly 'real' why would he be interested in a histogram?
Lawson
BTW,
will repeat the long for you.
-Phil