Ruminations while awaiting an FPGA image (was "Hello...... Anyone out there?")

Cluso99 · 2014-07-27 19:33

Chip,
Great news. Thanks for the update.

David,
Nice title update.

cgracey · 2014-07-27 21:14

jmg wrote: »

Sounds great - does that also work in either direction (eg for things like camera capture, as well as Analog/Video out ?)

It will work for analog and pin output, and pin input. We'd have to have some flash ADC to realize analog input at those speeds.

Phil Pilgrim (PhiPi) · 2014-07-27 21:52

I see the title of this thread got changed to, "Ruminations while awaiting an FPGA image." My grandmother always used to say, "A watched pot never boils." Perhaps if we avert our gaze ...

-Phil

JRetSapDoog · 2014-07-28 00:43

cgracey wrote: »

It will work for analog and pin output, and pin input. We'd have to have some flash ADC to realize analog input at those speeds.

Any difference between saying "pin input" and "digital input"? I read that thrice. Guess flash/video ADC's are waaay out of the question, or are they?

ctwardell · 2014-07-28 03:56

Ruminations...hmmm, isn't that what cows do?

Moo.

C.W.

Peter Jakacki · 2014-07-28 04:30

ctwardell wrote: »

Ruminations...hmmm, isn't that what cows do?

Moo.

C.W.

Yes, and if you don't have the stomach for it then you shouldn't be here

ozpropdev · 2014-07-28 05:15

Moo!
On that note, cows have 4 cores stomachs!

ctwardell · 2014-07-28 05:20

ozpropdev wrote: »

On that note, cows have 4 cores stomachs!

We're going to need to milk that!

C.W.

Bill Henning · 2014-07-28 05:50

Sounds great!

Thanks for the update.

cgracey wrote: »

Last night (4am) I chased down the final bug that was inhibiting the fast-write setup for hub streaming. Fast-read had already been working for a few days, but now the whole system seems complete.

The next thing to do is to adapt the hub streaming to both hub exec and the NCO-driven pin/DAC I/O. Those NCO modes are going to be fun, because those are features which never existed on the prior Prop2.

The streaming is already set up so that you can create a loop in hub memory that automatically wraps for reading or writing. The only other thing we need is a way to redirect the start/size on a block loop. That would enable page-flipping, so to speak, for high-speed analog output, so that you can be writing one buffer while playing another - at 200MHz on the real chip, and hopefully 160MHz on the FPGA.

Thanks for your continued patience, Everyone.

Bill Henning · 2014-07-29 14:28

<humour mode: ON>
Yesterday I bought a large can of compressed air - you know those computer cleaning sprays - for my DE0-Nano's & DE2-115 in anticipation of ChipMas!
<humour mode: OFF>

As difficult it is to wait, I think all of us will find the wait worthwhile.

16 cogs and all that bandwidth with hubexec has me drooling.

evanh · 2014-07-29 15:16

cgracey wrote: »

The next thing to do is to adapt the hub streaming to both hub exec and the NCO-driven pin/DAC I/O. Those NCO modes are going to be fun, because those are features which never existed on the prior Prop2.

There is real potential for slimming down the Cogs even more. With buffering moving to the Hub like that, CogRAM is being freed up. Can 256 addresses work? Or, with HubExec, and presumably an instruction cache, is there a possibility for single-ported CogRAM?

(Yet another suggestion)

jmg · 2014-07-29 15:23

evanh wrote: »

TCan 256 addresses work? Or, with HubExec, and presumably an instruction cache, is there a possibility for single-ported CogRAM?

Single ported CogRAM would slow things down too much.

256 is interesting, but that's likely to break backward compatible operation with P1
? - depends on how HubExec works.

P2 really needs to swallow ANY P1 project to be a clear superset, if some apps fail to fit, that's a disappointed customer.
8 @ 512 and 8 @ 256 could be a possible split, but I think COGS are pin-mapped now, so all need to be equivalent to keeps pins equivalent.

Bill Henning · 2014-07-29 15:26

+1

keep all 16 with 512... remember everyone was asking "no more changes" and "keep all cogs the same"

jmg wrote: »

Single ported CogRAM would slow things down too much.

256 is interesting, but that's likely to break backward compatible operation with P1
? - depends on how HubExec works.

P2 really needs to swallow ANY P1 project to be a clear superset, if some apps fail to fit, that's a disappointed customer.
8 @ 512 and 8 @ 256 could be a possible split, but I think COGS are pin-mapped now, so all need to be equivalent to keeps pins equivalent.

evanh · 2014-07-30 02:26

Well, I'm looking towards a 32 core device. Drop-in software compatibility is of little concern. Keeping dual-porting with HubExec could be improved by having three source operands and one destination. Then fitting the instruction would be problematic, hence the need to reduce register set numbers.

Bill, I am talking about verbatim Cogs.

The one thing I would agree with is it's far too late to be discussing such changes.

ozpropdev · 2014-07-30 03:05

evanh wrote: »

Keeping dual-porting with HubExec could be improved by having three source operands and one destination. Then fitting the instruction would be problematic, hence the need to reduce register set numbers.

That may be closer than you think with the ALTDS instruction proposed.
It can handle two source and one destination operand.

Heater. · 2014-07-30 03:47

evanh,

...I'm looking towards a 32 core device...

You might be, but I think you'd be disappointed if it arrived. That halves the bandwidth to HUB again. It would run like a sloth. And you have just thrown away half the COG RAM so now you have no place to put fast code like an FFT for example. Disaster!

One could mitigate the reduced HUB bandwidth buy adding more ports, widening buses etc. I'm not sure this scales well.

History, from the Intel 8080, through 8085, 8086, 186, 286, 386, 486, Pentium, whatever we call it today, and other examples, has shown us that backward compatibility is a very valuable thing. I for one would be very upset to find there was not an easy way to move my projects to the II. With half the COG space it's useless to me. I suspect a lot of useful Propeller objects would become useless. Great.

Yep, it's well passed time to be making suggestions.

JRetSapDoog · 2014-07-30 08:40

Heater. wrote: »

With half the HUB space it's useless to me.

Heater, although more cogs would likely entail a reduction in HUB space, I guess you meant to write "half the COG" space, based on the line of reasoning you made. But whichever the case, you've definitely spelled out one of the advantages of keeping things at 2KB (512 longs, 9 address bits).

I don't mind people considering more cogs (even Chip did at one point), but I'd likely lose faith if we went back to 8. And as far as rules for posts here, I'd say that there are no rules other than mutual respect. But I definitely understand that decisions have to be made and then effort exerted to realize such decisions. Chip seems to be well along that path. I also understand that a window of opportunity exists in terms of available time.

Why I really replied was that I'm interested in the subject of indirect addressing (which ties in with address bits and the instruction set). I'm unclear if that's in the works for the new chip. Having to use self-modifying code seems quite ancient or inelegant for a modern chip (as others have said elsewhere). What would be required to have some kind of indirection, such as for table access? Could we give up the conditional bits to facilitate something? Could an indirect accessing mode make assumptions to "encode" an offset? For example, if the destination address was on a long boundary, then the offset could be some value (from conditional bits?) times 4, or something similar? It would seem that if some kind of assumption were made (yes, a limitation were imposed) that something could be done within one instructional cycle. Or maybe a setup instruction could be used. Perhaps something is already in the works. I think I recall talk about something but have forgotten. Does anyone know? And what are the limiting factors (instruction bits for operands, instruction cycle timing, pipelining (though that's partly gone, I think))? What's involved so that we can think creatively about this. Or maybe self-modifying code isn't so inelegant as it seems. Maybe a macro could hide any inelegance somehow.

Heater. · 2014-07-30 09:34

JRetSapDoog,

Sorry yes, I meant "With half the COG space its useless to me". Well spotted. Post has been edited.

As for the forum rules and behavior, I like to think we all have mutual respect. Even if sometimes the debating gets a bit rough. Such is the nature of human interaction.

JRetSapDoog · 2014-07-30 10:53

Thanks, Heater. Yes, I agree: we have a high degree of mutual respect, here (sorry if it sounded like I thought otherwise). This has to be one of the top forums anywhere! The exception--if there is one--might be the ranting and raving we fell into over hub access mechanisms and/or slot-sharing, but, thankfully, those "dark days" are behind us (assuming that the "egg beater" hub access scheme doesn't explode like the first wormhole ship did in the movie Contact). So why bring it up again (mutual respect, that is)? No reason. Just waxing on. Anyway, I always enjoy reading your posts (and those of other regulars/active contributors). Guess I sometimes pay too much attention, and that's why I noticed the "mis-speak." Most would have just left it alone, trusting that others would know what you meant. But I'm perhaps too pedantic for that and wanted some leverage for the indirect addressing tangent.

JRetSapDoog · 2014-07-30 12:38

JRetSapDoog wrote: »

What would be required to have some kind of indirection, such as for table access? Could we give up the conditional bits to facilitate something? Could an indirect accessing mode make assumptions to "encode" an offset? For example, if the destination address was on a long boundary, then the offset could be some value (from conditional bits?) times 4, or something similar? It would seem that if some kind of assumption were made (yes, a limitation were imposed) that something could be done within one instructional cycle. Or maybe a setup instruction could be used.

Hmm...thinking some more, if we could conditionally give up the four-bit condition field and if the ZCI bits could be conditionally sacrificed (I think the current R will be gone), then that could free up 7 bits as an offset. So, there could be a move with offset instruction from a base register plus the offset to a destination register, as in movo dest, base, 7-bit-offset (movi may be taken). That would allow for long access within a table that could span a quarter of cog memory, wherein the offset is encoded in the 7 freed-up bits. That's likely not the first time that has been considered/suggested.

Yeah, that would mean adding an exception mode of sorts for the flag and condition bits. Anything that's seen as an exception or as breaking orthogonality is frowned on here, I know, and usually for good reason, as simplicity is a major win for the Prop. But some tradeoffs can be worth it. Here, my only argument in favor of such an exception is that it seems like there should be a way around self-modifying code (SMC) for table-like access, whether in this way or another. One wouldn't have to use the mode and could keep the standard usage of all those bits if willing to use SMC.

The thinking above is that the instruction opcode itself would determine whether to use the flag and condition bits as an offset. However, I suppose that, if a "magic bit" stored somewhere else in a cog indicated the usage of those bits, then such a magic bit could be set/cleared with a prior (and likey post) instruction for a block of code, such that a more general offset could be applied to other instructions, such as math instructions (I'm assuming one of the 7 (6 on the P1) opcode bits can't be sacrificed for that), but that's getting more complicated (setting/clearing) and I think table access would likely be the most important usage. So, I'm not currently advocating for that, but it could trigger an idea (in someone's mind) that is worthy of consideration.

Bill Henning · 2014-07-30 13:19

INDA / INDB is about all the indirection we are going to get in the upcoming chip.

A generic (use any register as an indirect register) requires WAY too much change at this point, and I for one do not wish to sacrifice the condition code bits, or half the register space.

Maybe in the next chip we can go to four IND registers.

JRetSapDoog · 2014-07-30 13:31

Or, if the above scheme is not desirable in the grand scheme of things, perhaps the offset could be stored in a fixed (general purpose) register, either a cog register (fixed or user-specified with an instruction, but that begs the question) or a special-purpose register that doesn't have an address (in the sense of 0..511). Hmm...I kind of like the idea of having a special all-purpose register (kind of like a PC counter) somewhere for just such usage, such a register being brought into play by instruction opcodes that need to use it (i.e., implemented in silicon). The caveat for using this for table access (w/o SMC) is that it would entail a second operation to set the special register, but it would avoid self-modifying code (at the expense of speed and code space). But the offset generally needs to be changed anyway by an add instruction. In addition to a way of writing the special register (a mov instruction of some kind) there could be an add instruction used change it. Well, I believe that Chip said that there's now more room for additional instructions (within the bits dedicated to the instruction set).

Edit/Update: Thanks, Bill. I composed this "Or..." post as you were posting. I haven't at all closely followed the INDA/INDB thing (though I recall seeing it) or the status thereof (and I'm a bit mixed up on the new versus old design for the new chip). Maybe what I just posted overlaps with INDA/INDB. Perhaps someone could summarize the details and/or update me/us on the status (yes, perhaps I could search/Google it, too). --Jim

Bill Henning · 2014-07-30 14:22

INDA/INDB are indirection registers, and store the address of the register whose content is to be used. Basically, they are what you asked for

There will also be a prefix instruction, INDS (?) that will allow auto increment/decrement of INDA/INDB and possibly other tweaks - we will know when the docs pop up

evanh · 2014-07-30 14:33

Heater. wrote: »

... That halves the bandwidth to HUB again. It would run like a sloth. ...

One could mitigate the reduced HUB bandwidth buy adding more ports, widening buses etc. I'm not sure this scales well.

Bandwidth is no problem. More ports is straight forward enough. I was personally very surprised how easy, and seemingly cheap, it was for Chip to put in the crosspoint switch. Hub burst duration becomes increasingly longer though.

History, from the Intel 8080, through 8085, 8086, 186, 286, 386, 486, Pentium, whatever we call it today, and other examples, has shown us that backward compatibility is a very valuable thing. ...

Not relevant here. General computing is a different world. We ditched compatibility with Prop1 a long time back.

evanh · 2014-07-30 14:38

When I brought up reducing number of general registers I was thinking of moving all code to separate Cog memory, namely an instruction space/cache, ie: Harvard.

HubExec becomes the norm in this scenario.

JRetSapDoog · 2014-07-30 14:56

Bill Henning wrote: »

INDA/INDB are indirection registers, and store the address of the register whose content is to be used. Basically, they are what you asked for

Thanks, Bill. Good to know!

Ariba · 2014-07-30 16:04

I don't think INDA and INDB exists anymore in the current design. Indirect addressing is now done with the ALTDS instruction, which lets you alter the D or S field of the following instruction, and maybe allows some modification of the registers that provide the value for the follwing D or S field. Something like:

ALTS mypointer++     'use the content of mypointer for the follwing S field, increment mypointer
    MOV  tmp,0-0        'indirect access of source

Andy

Heater. · 2014-07-30 17:06

evanh,

Bandwidth is not a problem.

I'm no expert but you saying that does not convince me. If it were so straight forward how come my quad core Intel does not have four buses out to RAM? How come Amdahl's law exists? How come the world speaks so much of the "von neumann bottleneck". How come XMOS don't do this?

Not relevant here. General computing is a different world.

Is it?

Note that I have used and seen used the first 7 generations of the Intel family in real-time embedded systems (Well OK not the 8080 personally).

We ditched compatibility with Prop1 a long time back.

Did we?

The current Prop II design may not be a drop in replacement for the Prop 1 software wise. Code will need "porting". The last time I expressed concerns about code compatibility I was assured that Propeller 1 PASM could be tweaked to run on a Propeller II without a total rewrite and re-architecting. This is very important in terms of reusing much existing code, the Propeller is particularly in need of this as it has no hardware peripherals and requires that code in order to be useful to people out of the box.

Bill Henning · 2014-07-30 18:19

I think you are right Andy!

Discussion of ALTS:

http://forums.parallax.com/showthread.php/156242-Question-about-ALTDS-implementation-in-new-chip

Ariba wrote: »
I don't think INDA and INDB exists anymore in the current design. Indirect addressing is now done with the ALTDS instruction, which lets you alter the D or S field of the following instruction, and maybe allows some modification of the registers that provide the value for the follwing D or S field. Something like:
ALTS mypointer++     'use the content of mypointer for the follwing S field, increment mypointer
    MOV  tmp,0-0        'indirect access of source
Andy

ozpropdev · 2014-07-30 18:23

Last I heard there were no INDA/B imdirect registers anymore.

	-- addressable registers
	--
	--	addr		read		write		name
	--	----------------------------------------------------
	--
	--	000-1F7		RAM		RAM
	--
	--	1F8		PTRA		RAM+PTRA	PTRA
	--	1F9		PTRB		RAM+PTRB	PTRB
	--	1FA		INA		RAM		INA
	--	1FB		INB		RAM		INB
	--	1FC		RAM		RAM+OUTA	OUTA
	--	1FD		RAM		RAM+OUTB	OUTB
	--	1FE		RAM		RAM+DIRA	DIRA
	--	1FF		RAM		RAM+DIRB	DIRB

Added to these registers are the six pointers at $1F2 to $1F7.
See ALTDS instruction as a indirect replacement.

Edit: Thanks Bill for the link.

Ruminations while awaiting an FPGA image (was "Hello...... Anyone out there?")

Comments