Propeller II update - BLOG

evanh · 2014-01-10 18:04

Cluso99 wrote: »

I thought it would be nice...
Source the Clock from an external pin.

Source FRQx from an external pin (ie mux FRQx output with an external pin)

With my mods (in red), and using the internal pins, it would be possible to chain CTRA APin to CTRB BPin to achieve your result above.

Cluso,
Putting the BPin into the adder like that doesn't work as you basically have a datatype mismatch. At any rate the equivalent is achieved just by placing a value of 0x01 in FRQx. So, it's good in original form.

The clock gating I'll leave for Chip to comment on but I suspect it's got stability problems.

EDIT: With regard to the "chaining", what's needed is a 32 bit connection from one counter to the other, not just a signal bit.

evanh · 2014-01-10 18:06

jmg wrote: »

That drawing is not quite the same as the verilog code someone linked to before ?
IIRC the verilog seemed to do 3 adds and 3 subtracts and a shift, on every CLK in.

Yep, that's the difference between second and third order filters. I'm only proposing an addition to provide second-order filtering. If Chip wants to put three counters in each Cog then fast third-order would also be viable.

PS: The subtracts are done per decimation interval, not per bit-stream clock. That's a big load off the CPU if it only has to deal with the slower decimation rate. The decimation rate is equivalent to traditional word sized sample rate. Eg: 8 bit phone calls sampled at 8kHz made for roughly 64kbit/sec data-streams. The sample rate in this case is usually referred to as 8kHz as that was the word sample rate, rather than the bit rate of the resulting binary stream.

Cluso99 · 2014-01-10 18:29

evanh wrote: »

Cluso,
Putting the BPin into the adder like that doesn't work as you basically have a datatype mismatch. At any rate the equivalent is achieved just by placing a value of 0x01 in FRQx. So, it's good in original form.

The clock gating I'll leave for Chip to comment on but I suspect it's got stability problems.

EDIT: With regard to the "chaining", what's needed is a 32 bit connection from one counter to the other, not just a signal bit.

OK, I had not realised you needed to pass the whole 32 bits of PHSA.

I have been trying to different things with the counters and have found using the internal clock to be a restriction, so was trying to get a generic way to use the output of one counter and feed it into another as the clock. This could be done (if we could use external clocking) by using an external pin on P1, or an internal/external pin on P2.

The more generic, the more uses.

evanh · 2014-01-10 18:38

There is some anticipation for a schematic of the P2 counters ... hint, hint, Chip :P

jmg · 2014-01-10 18:58

Cluso99 wrote: »

OK, I had not realised you needed to pass the whole 32 bits of PHSA.

I have been trying to different things with the counters and have found using the internal clock to be a restriction, so was trying to get a generic way to use the output of one counter and feed it into another as the clock. This could be done (if we could use external clocking) by using an external pin on P1, or an internal/external pin on P2.

The more generic, the more uses.

Certainly counters need to be more flexible than P1
P2 talk has mentioned new Quadrature modes in CTRs, so that indicates an external clock path, which we hope also works in non-quad.

Capture is another blind spot in P1, and ideally P2 will allow Atomic control of Pin-capture of TWO counters, where one is fsys clocked, and one is external fu clocked.

In the classic use I have in mind, one pin edge generates two captures, and SW enable/disable of that dual capture is a single atomic 1 clock action.

In practice, that may mean control bits in one register, or it may mean simple alias of some control bits between counters blocks.
The important detail is to avoid two lines of code, as that gives aperture issues.

evanh wrote:

There is some anticipation for a schematic of the P2 counters ... hint, hint, Chip

Yes, Counters info is 'still coming', but the fetch issues Chip is on, have to get a higher priority, for now.

Ariba · 2014-01-10 19:23

evanh wrote: »

Yep, that's the difference between second and third order filters. I'm only proposing an addition to provide second-order filtering. If Chip wants to put three counters in each Cog then fast third-order would also be viable.

PS: The subtracts are done per decimation interval, not per bit-stream clock. That's a big load of the CPU if it only has to deal with the slower decimation rate. The decimation rate is equivalent to traditional word sized sample rate. Eg: 8 bit phone calls sampled at 8kHz made for roughly 64kbit data-streams. The sample rate is this case is usually referred to as 8kHz as that was the word sample rate, rather than the bit rate of the resulting binary stream.

Can you not just filter the first order result in PHSx with one or two additional IIR Filter stages realized with the MAC instructions? This can be done at a much lower sample rate than the input bit stream rate.
The datasheet of the AD7401A recommends to use a FPGA or a DSP, and a Prop2 cog is definitly a very good DSP.

Andy

potatohead · 2014-01-10 19:36

Some may run screaming from the idea of even thinking about supporting existing code, but history tends to favour those who consider their customer investments, and see existing code as a resource, not a liability.

I'll leave it to Chip to declare "simple", again this chip has a lot of interactions many chips do not. However, my more basic point is I really don't want the expectation that there is hardware support for P1 compatibility on the table at all.

evanh · 2014-01-10 19:41

Ariba wrote: »

Can you not just filter the first order result in PHSx with one or two additional IIR Filter stages realized with the MAC instructions? This can be done at a much lower sample rate than the input bit stream rate.
The datasheet of the AD7401A recommends to use a FPGA or a DSP, and a Prop2 cog is definitly a very good DSP.

Heh, this is where the lack of the why it works kicks in. I don't know if there is a way to substitute the multi-stage accumulation (integration) with a slower rate, but presumably, more complex alternative. I doubt it though. As it stands, with AD7401's example, it requires all integration stages to be clocked per bitstream bit. I suspect this half of the hardware is expected to be built into a DSP based design also.

A Cog can do it all in software, but at a performance cost. It's the same deal as SERDES I guess.

evanh · 2014-01-10 20:05

For those that haven't read the old thread, here's an excellent equivalent schematic I found of the example third-order filter on pages 16 and 17 of the AD7401 datasheet:
Attachment not found.

EDIT: Dang it, I can't seem to get the piccy's to display in the forum thread. Oh well, it's only a click away.

Bob Lawrence (VE1RLL) · 2014-01-10 20:55

@evanh
Here you go but it don't make any difference really.

I did try to enlarge it as well.;

evanh · 2014-01-10 22:25

Here's an edited version with four pixel wide lines representing multi-bit data paths as opposed to single pixel wide single bit control lines.

evanh · 2014-01-11 00:33

Doing a bit of Googling I've found this - http://electronicdesign.com/analog/build-sincsupksup-decimators

It seems to have parts of the webpage missing for me ... at any rate the author points out this is a FIR design. That threw me a little as I'd figured the feedback in the accumulators counted for IIR. I guess integration obviously doesn't count on that front.

The good part is, at the bottom of the article, there is a step by step example of numerical states that a second-order two sample decimation could produce after eight input (bit-stream) clocks.

It also shows the filter shape of 1 to 4 orders, with second-order being triangular and first-order being box. That would suggest that third-order is quadratic and fourth-order is cubic.

The bad part is there is no comparison with other filter types, but I guess given it's a FIR filter then it behaves like any other FIR of the same shape.

cgracey · 2014-01-11 01:57

It took me half the day to locate the bug in the hub execution circuitry, where I had implemented a single icache line. It turned out I was getting the hub read address from one pipeline stage too early. These inter-pipeline-stage bugs are always a bear to resolve because the failure modes are quite sporadic and seem to make no sense. I was spending 99% of my time looking in the wrong section of code. I don't know how, but some impetus to go look where the problem actually was seemed to form out of dense white noise in my head. When I got to the other section, the problem almost presented itself. Those kinds of bugs always scare me when I'm chasing them, because they make everything seem flakey and untrustworthy, and make me suppose there's more where that's coming from. Things seem back on solid ground now, though, and I'm compiling a version with the whole 4 cache lines. I haven't implemented LRU yet, but I'm using the two task id bits as the icache line chooser, in case there's a cache miss. This will work well with 4 hub tasks, which is what I've been testing with.

evanh · 2014-01-11 02:23

cgracey wrote: »

... Those kinds of bugs always scare me when I'm chasing them, because they make everything seem flakey and untrustworthy, and make me suppose there's more where that's coming from. ...

Ya, it's like a proof checker is needed to verify each edit along the way. The more layers and buffers the more it consumes mental resources to make any inter-operative changes. It guess that's why so much is modularised wrappers, aka bloat, these days.

Heater. · 2014-01-11 02:25

What is Spin?

Dave Hein,

For the most part Spin is a high level language that could be implemented on any processor.

I don't see it that way.

We have the nice structured high level language we all know and love as Spin. Spin could no doubt be compiled to run on any machine with. I see no reason it could not be compiled to native x86 or ARM instructions.

BUT: That pesky PASM code we put in DAT sections is also Spin. It's defined in the same manual. It's written into the same source files. It's built with the same compiler. Many objects rely on that PASM being there.

PASM is Spin. Spin in PASM.

As such Spin/PASM is totally non-portable. Unless you want to write an Prop emulator to run on your target machine.

That includes P1 to P2 portability. It just isn't. No one will want a P1 emulator on a P2 to run those PASM parts. Makes no sense.

Heater. · 2014-01-11 02:27

Seairth,

Now, about Python...

... but I don't think it would be appropriate for the P2.

"Python" begins with a "P". As such it belongs in the set of languages one should never use. Along with Perl, Pascal, PHP, and Prolog:)

However, if you must I'm sure it's not unreasonable to port Micro Python to the P2 http://micropython.org/

Me, I want Javascript: http://www.espruino.com/

cgracey · 2014-01-11 02:34

evanh wrote: »

Ya, it's like a proof checker is needed to verify each edit along the way. The more layers and buffers the more it consumes mental resources to make any inter-operative changes. It guess that's why so much is modularised wrappers, aka bloat, these days.

I've often fantasized about some magic proof checker that would signal as soon as you got things right. By the time I 'sign off' on code I've written, I understand all that it does, but in getting there, sometimes I'll write code that feels right, but there's not a logical proof established in my head yet. It would be good to know if you got it right before YOU KNOW you got it right. It would save lots of time. The problem, of course, is that you are making something new that has no defined parameters. By the time computers could help you in such a way, they probably wouldn't need you to program them, anymore.

cgracey · 2014-01-11 02:36

Okay. The 4 cache-line hub exec seems to work great - 4 hub exec tasks are each running 4..8 times faster now. I just need to add the LRU next.

Cluso99 · 2014-01-11 02:48

cgracey wrote: »

Okay. The 4 cache-line hub exec seems to work great - 4 hub exec tasks are each running 4..8 times faster now. I just need to add the LRU next.

WTG Chip. Cannot wait to try this out - its going tb be awesome!

Heater. · 2014-01-11 03:32

There is some serious awesomeness going on here.

Sadly a formal proof of the correctness of that awesomeness is not available. I think Turing pointed out that it was impossible.

evanh · 2014-01-11 05:05

cgracey wrote: »

The problem, of course, is that you are making something new that has no defined parameters. By the time computers could help you in such a way, they probably wouldn't need you to program them, anymore.

Heh, Watson isn't there yet but IBM is certainly exploring the possibilities ... It's kind of scary and adventurous even contemplating if there might be a true self-aware reasoning AI in the not too distant future. Neuromancer might be truer than Mr Gibson gives himself credit for ... or maybe The Evitable Conflict ... or maybe Terminator ...

cgracey · 2014-01-11 05:06

The LRU turned out to be very simple and needed only 12 lines of Verilog, including declarations and begin/end lines. It's compiling now. I expect it to work, because there's not much to it. All I had to do was make four 5-bit counters to track icache-line usage. Each counter can be cleared on cache hit or reload, incremented w/saturation on cache miss, or left alone if an instruction from the hub is not being fetched. After that, I just needed to make a two-stage magnitude comparator for determining which of the four counters has the highest value, in order to know which icache line is the one to be reloaded in the event of a cache miss.

evanh · 2014-01-11 05:09

Love it when a plan comes together.

David Betz · 2014-01-11 05:26

cgracey wrote: »

The LRU turned out to be very simple and needed only 12 lines of Verilog, including declarations and begin/end lines. It's compiling now. I expect it to work, because there's not much to it. All I had to do was make four 5-bit counters to track icache-line usage. Each counter can be cleared on cache hit or reload, incremented w/saturation on cache miss, or left alone if an instruction from the hub is not being fetched. After that, I just needed to make a two-stage magnitude comparator for determining which of the four counters has the highest value, in order to know which icache line is the one to be reloaded in the event of a cache miss.

This is all sounding very encouraging! Congratulations!

cgracey · 2014-01-11 05:53

It works!

It spreads the cache lines evenly among four tasks, and also allows one task to cache 32 instructions (all four lines).

Next, I've got to add the hub-stack CALLA/CALLB/RETA/RETB instruction guts to complete hub execution.

David Betz · 2014-01-11 05:54

cgracey wrote: »

It works!

It spreads the cache lines evenly among four tasks, and also allows one task to cache 32 instructions (all four lines).

Next, I've got to add the hub-stack CALLA/CALLB/RETA/RETB instruction guts to complete hub execution.

Fantastic!!

Bill Henning · 2014-01-11 05:55

Excellent!

cgracey wrote: »

The LRU turned out to be very simple and needed only 12 lines of Verilog, including declarations and begin/end lines. It's compiling now. I expect it to work, because there's not much to it. All I had to do was make four 5-bit counters to track icache-line usage. Each counter can be cleared on cache hit or reload, incremented w/saturation on cache miss, or left alone if an instruction from the hub is not being fetched. After that, I just needed to make a two-stage magnitude comparator for determining which of the four counters has the highest value, in order to know which icache line is the one to be reloaded in the event of a cache miss.

Bill Henning · 2014-01-11 06:03

Does not a cache miss cause a cache line to be loaded?

If so, does not mean that the counter would never be incremented?

cgracey wrote: »

The LRU turned out to be very simple and needed only 12 lines of Verilog, including declarations and begin/end lines. It's compiling now. I expect it to work, because there's not much to it. All I had to do was make four 5-bit counters to track icache-line usage. Each counter can be cleared on cache hit or reload, incremented w/saturation on cache miss, or left alone if an instruction from the hub is not being fetched. After that, I just needed to make a two-stage magnitude comparator for determining which of the four counters has the highest value, in order to know which icache line is the one to be reloaded in the event of a cache miss.

cgracey · 2014-01-11 06:16

Bill Henning wrote: »

Does not a cache miss cause a cache line to be loaded?

If so, does not mean that the counter would never be incremented?

A cache miss causes a line to be reloaded. When a line is read (cache hit) or loaded and read (cache miss), its counter is reset to 0. The counter increments on every cache miss that wasn't its own, and saturates at $1F.

Bill Henning · 2014-01-11 06:24

cgracey wrote: »

A cache miss causes a line to be reloaded. When a line is read (cache hit) or loaded and read (cache miss), its counter is reset to 0.

Thanks, I got it now.

cgracey wrote: »

The counter increments on every cache miss that wasn't its own, and saturates at $1F.

Very elegant, the "miss that wasn't its own" made it click for me

Propeller II update - BLOG

Comments