High Performance Needed

WaffenGeist · 2012-04-11 09:38

I need to process 10 to 20 million bytes per second. Needless to say that the Propeller is not powerful enough to perform this.

I have a true random numbers generator that outputs bytes and I need to rapidly count them to see which bytes occur more often out of 1 trillion bytes.

If you needed to count 20 million bytes per second all the way up to 1 trillion bytes, what hardware would you use ?

Thanks

User Name · 2012-04-11 09:45

Well, you might be surprised. It mostly boils down to how the bytes are being delivered. But in general, eight parallel processors go a long way toward decimating such a task.

In a cryptography project a few years ago I plopped several propellers on a perfboard, and with surprisingly little hassle and time broke up a very big problem very nicely.

Anyway, you'll probably have to provide more info for anyone to make specific recommendations.

rod1963 · 2012-04-11 09:54

You aren't giving much info but from what have given, see if you can host the programs on a PC/laptop/netbook.

Spiral_72 · 2012-04-11 12:35

I'll bet a Prop will rip right through those numbers actually. I think I'd use two cogs, one for communications and a read buffer, the other one for the "check"

If the random numbers are a byte, that's 0-255 or 0-FF so I'd setup a data table in RAM with 256 slots. You'll need a pointer to keep track of your current read in the buffer.

MYTABLE[read value]=+1
increment buffer pointer one byte
get next number

When you are all done, you'll have a data table with 256 numbers in it. Each position in the table corresponds to the number and it's number of occurrence. That would be a thousand times faster than IF - THEN statements.

disclaimer: it's been a couple months since I've touched my prop unfortunately, so the "code" above is not code. You'll have to work that out. It'll be slightly more complicated than that since you'll have to deal with overrunning and overwriting your data table, but you get the idea.

Mike Green · 2012-04-11 12:50

A Propeller could do this easily assuming that the bytes are available as 8 (parallel) I/O lines plus a clock whose leading edge indicates that a new byte value is available. The normal Propeller's clock is 80MHz and it can easily be overclocked with some simple attention to PCB layout and filtering. Instructions mostly take 50ns to execute. That leaves about 10 instruction per sample at the higher rate you mentioned. With byte values, you'd keep an occurance table of 256 32-bit longs in the cog's memory and increment one based on the 8-bit random number received. This could be easily done in 10 instructions and at a 20MHz rate. You would use an external pushbutton or other signal to stop the counting briefly (maybe 50ms) to take a snapshot of this table, copy to the "hub" shared memory, then resume the counting. The other cogs could transfer the data, either in raw form or sorted or otherwise analyzed to a PC or display it graphically on a VGA display, allow you to enter commands to modify the display using a PS/2 keyboard, etc.

The snapshot could also be triggered based on elapsed time since there's a 32-bit system-wide clock counter with a 12.5ns resolution. If it's important to never drop any samples, two cogs could alternate the counting of samples with the copying of the counts to the hub memory buffer and the counts could be combined there.

jmg · 2012-04-11 16:01

WaffenGeist wrote: »

I need to process 10 to 20 million bytes per second. Needless to say that the Propeller is not powerful enough to perform this.

I have a true random numbers generator that outputs bytes and I need to rapidly count them to see which bytes occur more often out of 1 trillion bytes.

If you needed to count 20 million bytes per second all the way up to 1 trillion bytes, what hardware would you use ?

Thanks

How about wait for Prop II ?

- or is the assignment due in, before then ?

First you need to nail down some corners,

10^12/2^32 = 232.830
10^12/20^6 = 15625s 10^12/20^6/60/60 = 4.340 hours

The first number shows 32 bit counters could be marginal, on big skews, and 10^12 needs careful control.
You could consider 40 bits, and more likely 48 bits ?

The code is then a repeat of
INC(CounterArray[ByteValue])
and
CounterArray is 256 x 48 bit Memory.

A PC can just stream 20Mbytes/s over high speed USB, so a FTDI High speed + PC is one pathway.

Or, you could use a small FPGA/CPLD - they can easily increment memory at 20 MHz, and you need just 1536 bytes of memory.
Streaming rates of 50MHz-100MHz+ should be possible here.

["Needless to say that the Propeller is not powerful enough to perform this."]
Well, certainly not easily, at a student level.... but...

At 20 MBytes/sec, data is arriving at exactly a Prop 1's nominal opcode rate per cog.

- but lets imagine we can phase-lock 4 identical coded cogs, each sampling at 200ns, but 50ns apart.
Here is a rough once-over of how this might work :

You now have spread your total over 4 arrays of numbers, adding at 1/4 the rate, and have 4 opcodes to play with...
Candidates would be
MOVD D,S Insert S[8..0] into D[17..9]
This allows the Port pins, to self-modify a Destination byte ready for a fixed S=1 add
100000 001i 1111 ddddddddd sssssssss ADD D,S Add S into D
and a small fix is needed to avoid the very-ends of COG RAM address range

and finally
DJNZ D,S Dec D, jump if not zero to S (no jump = 8 clocks, jump = 4 clocks)

On average, each 32 bit counter bucket will be 22.7% full at 10^12, but to reach 10^12/2 in each cog, we need 58.207 * 2*32
- so that's another problem....
Choices are to unroll to 59 or 60, or work on less than 10^12, and collect later.

MOVD AddressOfADDOpCode,ByteFromPort
ADD AddressOfADDOpCode,OffsetAroundCode
ADD Dummy_Address,1 ; INC the actual array, note Dummy_Address has been changed each time.
NOP / DJNZ ; unroll as needed

We have 512 x 32 bit registers, but the top 16 are special, and code starts at bottom, so 512-16-60*4 = 256 (!!)
- but that does not allow cog-sync ; unroll of 59, allows 4 opcodes to sync - Possible ?

Of course, this is just a quick scribble, but it looks like 4 cogs could _just_ cope, and manage 10^12/4 loop as well.
I make each cell with an ideal value of 976562500, over the (almost) 10^12 samples.
A Prop II would gain in MHz and also the new Loop opcode saves 25%

Notice this counting code, will return a prefect random score card, for a binary counter - so perhaps a distance check would catch that - but that has a time-penalty.

jmg · 2012-04-11 16:03

Mike Green wrote: »

Instructions mostly take 50ns to execute. That leaves about 10 instruction per sample at the higher rate you mentioned.

Not on my calculator ? - I make 20MBytes / sec, equal to the instruction rate in one cog ?

Mike Green · 2012-04-11 16:25

@jmg,
Oops, I'm indeed off by an order of magnitude. Sorry.

kwinn · 2012-04-11 21:14

As User Name says “*It mostly boils down to how the bytes are being delivered”. If they are coming at a fixed rate a single propeller programmed in PASM could probably do the job using several cogs. Even with a variable rate (50ns or more between bytes) it could be doable with the help of a little hardware to indicate a valid byte.

With truly random distribution a trillion bytes distributed over 8 cogs would be 125 billion per cog, and distributed over 256 longs that would be a count of 488,281,250 per long. Even if the distribution is not perfectly random each long can count up to 4,294,967,295 which is 8 times what the average count should be for a perfectly random distribution.

pik33 · 2012-04-11 23:27

Mike Green wrote: »

@jmg,
Oops, I'm indeed off by an order of magnitude. Sorry.

You have 8 cogs

Phil Pilgrim (PhiPi) · 2012-04-11 23:48

You can accumulate data at 10 MHz using a single cog, under the following conditions:

1. The data is eight bits parallel.
2. The data can be synchronized to a 10 MHz clock provided by the Propeller.

The trick here is to input the data on P16..9, which is the lower eight bits of a PASM instruction's 9-bit destination field. Then setup dira and outa, so that an instruction executed from ina adds one to the bin selected by the 8-bit input and sets the carry flag on overflow. The next register after ina is inb, a normal register on the P8X32A, and it will contain a jump-on-no-carry instruction back to ina. If a bin overflows, the jump will not be taken, and the instruction in outa will be executed. But that just adds an extra one to bin zero, which can be subtracted from the results. Then the instruction in outb gets executed: a jump to the report section of the code, which writes all of the bin contents to the hub.

Here's the skeletal code:

CON

  _clkmode      = xtal1 + pll16x
  _xinfreq      = 5_000_000

VAR

  long histo[257]

PUB  start

  cognew(@acquire, @histo)
  repeat until histo[256]       'Wait for "done" flag.
                                'Start serial I/O here, and output the results.
DAT

acquire       jmp       #setup
              long      0[255]
              
setup         mov       0,1 wz                  'Clear first instruction and set Z flag.
              mov       dira,dira0
              mov       inb,inb0
              mov       outa,outa0
              mov       outb,outb0
              mov       frqa,frqa0
              mov       phsa,phsa0
              mov       ctra,ctra0
              jmp       #ina

report        'Write the accumulated data back to the hub, then set "done" flag in histo[256].

dira0         long      %111111_1111_1111_100000000_111111111
outa0   if_z  add       0-0,#1 wc
'Could be if_nc_or_z if CLK is high, but it doesn't matter, since Z is set anyway.
'Same as      long      %100000_0111_101*_0********_000000001
'                                       &#61600;  &#9500;&#9472;DATA&#9472;&#9508; Bits 16..9
'                                       &#9492;CLK output Bit  18
inb0    if_nc jmp       #ina
outb0         jmp       #report

frqa0         long      $2000_0000
phsa0         long      0                       'Set to correct value so data is stable during ina instruction fetch.
ctra0         long      %00100 << 26 | 18       'Output a 10 MHz clock on P18.

{{ When configured, this is what SFRs will look like, in order:

ina     if_z  add       0-0,#1 wc
inb     if_nc jmp       #ina
outa    if_z  add       0,#1 wc               'Accumulates one extra count.
outb          jmp       #report
}}

The clock signal comes out on P18, which is one of the condition code bits of the add instruction. The condition for this instruction is chosen such that the state of this bit does not matter. phsa must be initialized so that the data is stable when the add instruction in ina is fetched.

Anyway, I haven't tried this, but I believe the principle is sound. It's not. Se post #19.

-Phil

LoopyByteloose · 2012-04-12 00:06

This really depends on the actual scope of the task. The Propeller may do quite nicely, but if you really want fast parallel processing in the extreme - Russian hackers have been using Nvida video cards as specifically tasked parallel processors to hack passwords. They happen to be faster than a PC in some contexts. The only drawback is that you will have to make your own way with learning their system and creating code.

jmg · 2012-04-12 01:08

Phil Pilgrim (PhiPi) wrote: »

The trick here is to input the data on P16..9, which is the lower eight bits of a PASM instruction's 9-bit destination field.

Nifty. ( I guess this Prop is doing nothing else anyway )

Phil Pilgrim (PhiPi) wrote: »

Then setup dira and outa, so that an instruction executed from ina adds one to the bin selected by the 8-bit input and sets the carry flag on overflow. The next register after ina is inb, a normal register on the P8X32A, and it will contain a jump-on-no-carry instruction back to ina

I think that means it loops until any one counter saturates (overflows), and then all others will be fractions from that ?
Not quite the original spec, but a quite nice way to normalise results.

Leon · 2012-04-12 02:08

pik33 wrote: »

You have 8 cogs

How does that help in this situation?

jmg · 2012-04-12 03:22

Leon wrote: »

How does that help in this situation?

You can spread the load, by interleaving samples over multiple COGS ?
It means more housekeeping, and you need to sum the totals from each cog before you are done, but it does buy
some cycles.

Leon · 2012-04-12 03:28

I can't see it being useful for this particular problem, though. The inputs need to be be processed by a single cog, don't they?

jmg · 2012-04-12 04:06

Leon wrote: »

I can't see it being useful for this particular problem, though. The inputs need to be be processed by a single cog, don't they?

Why do they 'need to be be processed by a single cog' ?

All the OP wants to do, is sum the bytes as a histogram, for the nominal 1 trillion bytes.

So they do not care if that is 4 x 5 MByte streams summed, or one 20MB stream.

Leon · 2012-04-12 04:12

It seems to be a somewhat academic exercise, anyway, as the OP hasn't commented on any of the suggested solutions. We don't even know whether the data are parallel or serial. Timing details would be useful, as well.

Phil Pilgrim (PhiPi) · 2012-04-12 08:36

Kuroneko has informed me by email that reading an instruction from the input pins will not work. Instructions are always fetched from the shadow registers, so nothing that appears on the pins will affect an instruction read from ina. Oh, well. The best laid plans ....

-Phil

Phil Pilgrim (PhiPi) · 2012-04-12 12:19

Here's another attempt, this time more canonical. It uses four interleaved cogs to read and accumulate data at 20 MHz. The data must be synchronized to the 20 MHz clock output from the Prop. Here's the skeletal code (untested):

{{
20 MHz data analyzer. Parallel data is input on P7..P0, synchronized to 20 MHz clock on CLK_PIN.
Four cogs take turns accumulating data in four 20 MHz bursts each, as determined by the quadrature
clocks on I_PIN and Q_PIN:

    &#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570;&#61573;&#61570; CLK_PIN (20 MHz)
    &#61570;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61573;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61570; I_PIN (1.25 MHz)
    &#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61570;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61574;&#61573;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569;&#61569; Q_PIN (1.25 MHz)
    
    &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
    &#9474; READ  &#9474; Accum &#9474; Loop & wait...&#9474; histo2
    &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
    &#9474;wait...&#9474;  READ &#9474; Accum &#9474; Loop &&#9474; histo3
    &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
    &#9474; Loop & wait...&#9474; READ  &#9474; Accum &#9474; histo1
    &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
    &#9474; Accum &#9474; Loop & wait...&#9474; READ  &#9474; histo0
    &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;

}}

CON

  _clkmode      = xtal1 + pll16x
  _xinfreq      = 5_000_000

  Q_PIN         = 9
  I_PIN         = Q_PIN + 1
  CLK_PIN       = 11

VAR

  long histo0[257], histo1[257], histo2[257], histo3[257]

PUB  start

  dira[CLK_PIN]~~               'Drive clock pins low to start.
  dira[I_PIN]~~
  dira[Q_PIN]~~
  cognew(@acquire, @histo0)
  cognew(@acquire, @histo1)
  cognew(@acquire, @histo2)
  cognew(@acquire, @histo3)
  cognew(@quadrature, 0)
  repeat until histo0[256] and histo1[256] and histo2[256] and histo3[256] 'Wait for all "done" flags.

  'Code to output accumulated data.

DAT

''-------[ 5 MHz Quadrature Output ]-------------------------------------------

              org       0
quadrature    mov       dira,:dira0
              mov       frqa,:frqx0
              mov       frqb,:frqx0
              mov       phsa,:phsa0             'ctra will lead ctrb by 90°.
              mov       ctra,:ctra0             
              mov       ctrb,:ctrb0
              jmp       $                       'Keep cog alive.

:dira0        long      1 << I_PIN | 1 << Q_PIN
:frqx0        long      $0200_0000              '1.25 MHz.
:phsa0        long      $4000_0000 - $0800_0000 'Quadrature, less count in one instruction time.
:ctra0        long      %00100 << 26 | I_PIN    'ctra outputs on I_PIN.
:ctrb0        long      %00100 << 26 | Q_PIN    'ctrb outputs on Q_PIN.

''-------[ Data Acquisition and Accumulation ]---------------------------------

              org       0
acquire       jmp       #:start
              long      255[0]                  'Data bins, including address 0.
              
:start        mov       0,#0                    'Clear bin 0 of jmp instruction.
              mov       :quad,par               'Each successive hub array is offset by %100 in last three bits.
              shr       :quad,#2                'Isolate them.
              and       :quad,#3 wz             'Cog with histo0 drives the data clock.
              shl       :quad,#Q_PIN            'Shift match mask into position.
        if_z  mov       dira,:dira0             'If histo0 set up to output data clock.
        if_z  mov       frqa,:frqa0
        if_z  mov       phsa,:phsa0
              waitpeq   :mask,:mask             'Wait for quadrature cog to start.
        if_z  mov       ctra,:ctra0             'Start data clock, sync'd with quadrature.

:loop         waitpeq   :quad,:mask             'Wait for our turn.
              movd      :add0,ina               'Read data at 20 MHz burst.
              movd      :add1,ina
              movd      :add2,ina
              movd      :add3,ina
:add0         add       0-0,#1 wc               'Accumulate data that was read.
:add1   if_nc add       0-0,#1 wc               'For shorter sampling epochs, something other than #1 can be used.
:add2   if_nc add       0-0,#1 wc
:add3   if_nc add       0-0,#1 wc
        if_nc jmp       #:loop                  'Loop back if no overflows.

'Code here to write counts back to hub and flag to histox[256].

:dira0        long      1 << CLK_PIN
:frqa0        long      $4000_0000              '20 MHz.
:phsa0        long      0                       'Adjust so that data is stable relative to waitpeq at :loop.
:ctra0        long      %00100 << 26 | CLK_PIN  'Data clock outputs on P11.
:mask         long      1 << I_PIN | 1 << Q_PIN

:quad         res       1

I dunno, though. Rather than engaging in the discussion, the OP seems to have left the building. This may not have been worth the effort I put into it.

-Phil

Leon · 2012-04-12 12:26

I'd like to see how he's generating the random bytes at 20 MHz. That isn't easy.

User Name · 2012-04-12 14:03

Phil Pilgrim (PhiPi) wrote: »

I dunno, though. Rather than engaging in the discussion, the OP seems to have left the building. This may not have been worth the effort I put into it.

Wow! A slick trick like that is never wasted. It's now in the forum archive and I, for one, have also squirreled it away for my own delectation and benefit.

User Name · 2012-04-12 14:11

Leon wrote: »

I'd like to see how he's generating the random bytes at 20 MHz. That isn't easy.

Yup. A multiplicative congruential generator in an LPC1768 might do it. A hardware LFSR would surely do it. ...Just the first two thoughts that come to mind.

Leon · 2012-04-12 14:18

They are real random numbers, though, not pseudo-random ones. That's a lot harder at that frequency.

jmg · 2012-04-12 15:26

Leon wrote: »

They are real random numbers, though, not pseudo-random ones. That's a lot harder at that frequency.

Maybe - notice that a simple 20Mhz counter will pass this 'Random Test', at least for a novice

jmg · 2012-04-12 16:14

Phil Pilgrim (PhiPi) wrote: »

I dunno, though. Rather than engaging in the discussion, the OP seems to have left the building. This may not have been worth the effort I put into it.
-Phil

It may be a Homework type question, but these are actually very good for defining just what is possible and where the limits are.
So the effort is certainly worthwhile, as it is now 'on record'.

Like my #6 this refined version also uses 4 cogs, but unlike #6 you do burst-sync.

If I read it right, this code also does not count a specific total number of bytes, but exits on the first overflow [full histogram] (which could get interesting with 4 cogs making different exit decisions - thus breaking the 'same counts' rule ? )

The outline in #6 relied on phased-starts of 4 COGS, which I presume is easy to do ?
If we skip the offset add, by the cleaner JMP replace you do, that is 3 opcodes including a total-sample counter.
So that could fit in 3 COGS and meet the data rate budget ?

Which raises a new question, of could this spread over 6 cogs, and sync to a half opcode, for a 40MHz effective rate ?
That would need 2 CLK granularity on first opcode launch in each COG, which I think is legal ?

The ideal would be to have ONE copy that is launched 4 times with a param tweak to set the go-time.

User Name · 2012-04-12 16:37

Leon wrote: »

They are real random numbers, though, not pseudo-random ones. That's a lot harder at that frequency.

What 'real' random numbers are you referring to? The ones mentioned by the OP? Since he never clarified a thing, how do you know they were 'real'? And if they were truly 'real' why would he be interested in a histogram?

Leon · 2012-04-12 17:25

He said they were true random numbers.

User Name · 2012-04-12 18:47

Then all I can imagine is that he's using something like an avalanche diode and an 8-bit flash ADC. Those two items are common and inexpensive; though I couldn't speculate as to any embellishments that might be involved. In fact it's just possible that's the reason for the histogram requirement. Maybe it's to look for missing codes or other non-linearity.

Lawson · 2012-04-12 19:20

Let me take a stab at doing this with one cog at 10MHz. I'm making the same assumptions that Phill did, except I put the data buss on P0-P7. In this case, the count starts at -1 and counts down until zero is reached. Once one bin hits zero the code should exit. This is untested and may blow up due to the Prop's pipeline. If it works, the bin array will be one count behind vs the input port so the "cleanup code" will need to check the destination address of the instruction at :loop2 and subtract one from that bin.

'assume a parallel data buss on pins P0-P7
'synchronised to a clock provided by the propeller       

org 0
start   jmp #start2
hist    long  -1
        long  -1
        'repeat the above line 255 times
start2  mov   start, hist       'reset first bin of histogram to -1 (should be at cog address 256 or higher)
        'setup a counter for a 10MHz clock

:loop   movd  :loop2, ina
:loop2  djnz  0-0,      #:loop

        'one of the bins overflowed
        'start the cleanup code

Lawson

Phil Pilgrim (PhiPi) · 2012-04-12 19:35

Hey, Lawson, that's pretty clever! You'll get an extra count in bin 0, due to pipelining the first time though, but you can always start that one at zero instead of -1.

BTW,

hist long -1[255]

will repeat the long for you.

-Phil

High Performance Needed

Comments