Fast Bytecode Interpreter

Cluso99 · 2017-04-28 08:57

cgracey wrote: »

Remember that RDLUT and WRLUT use S for their addressing. That S is indirect if it is a register. It is not available before the instruction, like D and S addresses are as they go through the pipeline. S becomes available at the start of the instruction, so the LUT read may be issued then, causing data to return late in the next clock - too late to be mux'd to the result. So, the data is captured and then flows through the result mux on the 3rd clock. That's why RDLUT takes three clocks. WRLUT only takes two, because there is nothing to wait for. It's done in the 1st clock.

It's true that the LUT could have been made as accessible as S and D, but it would require some kind of banking mechanism. As it is, it's instruction-fetch usage is just the same as cog RAM, so it can be executed from without any speed penalty, but it must be read and written through discrete instructions. In practice, I've found this to be just fine, as code can be placed in it and it runs as if it were in cog registers. You just need to keep your D/S variables in the cog register space.

I guess I am missing why RDLUT isn't simply a MOV or Multiple MOV with SETQ2, where the D and/or S could be COG or LUT.

When I added LUT to the P1V, I merely added extra COG RAM, and then used an extra bit to address Extended/Upper COG (ie LUT) addresses.

In the P2, one of the ports of the LUT might be used by the streamer, and hence I follow that in that case, an extra clock is required to access the single port. But if the streamer is not running (accessing the LUT), then the second port on the LUT could/should be available (via a simple mux).

cgracey · 2017-04-28 09:19

Cluso99 wrote: »

cgracey wrote: »

Remember that RDLUT and WRLUT use S for their addressing. That S is indirect if it is a register. It is not available before the instruction, like D and S addresses are as they go through the pipeline. S becomes available at the start of the instruction, so the LUT read may be issued then, causing data to return late in the next clock - too late to be mux'd to the result. So, the data is captured and then flows through the result mux on the 3rd clock. That's why RDLUT takes three clocks. WRLUT only takes two, because there is nothing to wait for. It's done in the 1st clock.

It's true that the LUT could have been made as accessible as S and D, but it would require some kind of banking mechanism. As it is, it's instruction-fetch usage is just the same as cog RAM, so it can be executed from without any speed penalty, but it must be read and written through discrete instructions. In practice, I've found this to be just fine, as code can be placed in it and it runs as if it were in cog registers. You just need to keep your D/S variables in the cog register space.

I guess I am missing why RDLUT isn't simply a MOV or Multiple MOV with SETQ2, where the D and/or S could be COG or LUT.

When I added LUT to the P1V, I merely added extra COG RAM, and then used an extra bit to address Extended/Upper COG (ie LUT) addresses.

In the P2, one of the ports of the LUT might be used by the streamer, and hence I follow that in that case, an extra clock is required to access the single port. But if the streamer is not running (accessing the LUT), then the second port on the LUT could/should be available (via a simple mux).

I suppose it could be redone to work like register space, but there would need to be some way to select it, as opposed to lower cog RAM. It hurts my head to think about it. There may be some timing reason(s) why I made it work the way that it does. I really don't remember, but it fetches code just like the register RAM. I say, just use it for code and fast lookups. It's great for that. And making it randomly-accessible via D and S would necessitate some kind of context control. It just seems like a headache.

Cluso99 · 2017-04-28 10:12

cgracey wrote: »

Cluso99 wrote: »

cgracey wrote: »

Remember that RDLUT and WRLUT use S for their addressing. That S is indirect if it is a register. It is not available before the instruction, like D and S addresses are as they go through the pipeline. S becomes available at the start of the instruction, so the LUT read may be issued then, causing data to return late in the next clock - too late to be mux'd to the result. So, the data is captured and then flows through the result mux on the 3rd clock. That's why RDLUT takes three clocks. WRLUT only takes two, because there is nothing to wait for. It's done in the 1st clock.

It's true that the LUT could have been made as accessible as S and D, but it would require some kind of banking mechanism. As it is, it's instruction-fetch usage is just the same as cog RAM, so it can be executed from without any speed penalty, but it must be read and written through discrete instructions. In practice, I've found this to be just fine, as code can be placed in it and it runs as if it were in cog registers. You just need to keep your D/S variables in the cog register space.

I guess I am missing why RDLUT isn't simply a MOV or Multiple MOV with SETQ2, where the D and/or S could be COG or LUT.

When I added LUT to the P1V, I merely added extra COG RAM, and then used an extra bit to address Extended/Upper COG (ie LUT) addresses.

In the P2, one of the ports of the LUT might be used by the streamer, and hence I follow that in that case, an extra clock is required to access the single port. But if the streamer is not running (accessing the LUT), then the second port on the LUT could/should be available (via a simple mux).

I suppose it could be redone to work like register space, but there would need to be some way to select it, as opposed to lower cog RAM. It hurts my head to think about it. There may be some timing reason(s) why I made it work the way that it does. I really don't remember, but it fetches code just like the register RAM. I say, just use it for code and fast lookups. It's great for that. And making it randomly-accessible via D and S would necessitate some kind of context control. It just seems like a headache.

Ok. That's fine Chip. I just thought it might be easy when looking at it from a different perspective.

JasonDorie · 2017-04-28 16:10

Rayman wrote: »

I think that C will evaluate all the conditions before deciding.
But, Spin1 quits evaluating at the first sign of being false.

The trivial fix for this is to use bitwise operators instead of logical ones.

If( A() && B() )
... will not evaluate B() if A() returns false.

If( A() & B() )
... will evaluate both statements then bitwise and the result. It doesn't short circuit.

This is a common trick to avoid branch misprediction stalls when the functions are inlined, short, and random-ish.

Rayman · 2017-04-28 16:29

Ok, so I had it backwards. C short circuits but Spin does not.

Jason just said it does though, so need to check on this...

Wikipedia says that in C and C++ (https://en.wikipedia.org/wiki/Short-circuit_evaluation) these operators short circuit:

&&, ||, and ?

JasonDorie · 2017-04-28 17:21

That's what I meant - The logical operators do short circuit (&&, ||) but the bitwise operators do not (&, |)

Spin1 doesn't short circuit anything, C/C++, C#, Java, Python, etc do short circuit with logical operators.

evanh · 2017-05-02 07:16

I probably should keep my head down here but ... performing assignment operations inside of a conditional check is not what I'd call good coding practice. It's down right obfuscation, imho!

Heater. · 2017-05-02 21:10

I agree evanh, I don't like to see that either.

Some programmers feel the need do to show of their Mo Jo by doing that kind of unnecessary thing.

potatohead · 2017-05-02 21:37

Let them.

Doing that is definitely not mainstream, but for some, using expressions can pack a lot of code into a small package. It's how they think.

It's sometimes a lot like the flow seen in assembly language. Makes sense to me, given how SPIN and PASM work like one thing. I always thought that was a part of why the expressions are so robust.

Not my preference. Looks like line noise. But, I've seen enough different thinkers to see why it makes sense to them.

David Betz · 2017-05-02 21:41

evanh wrote: »

I probably should keep my head down here but ... performing assignment operations inside of a conditional check is not what I'd call good coding practice. It's down right obfuscation, imho!

I don't think that is the most common use of short-circuit boolean evaluation. It's more stuff like this:

if (fp != NULL && fread(buf, 1, sizeof(buf), fp) != -1) {
    /* do something with the data read from the file */
}

Here you don't want to invoke fread if fp is NULL.

Heater. · 2017-05-02 21:53

That is exactly the kind of thing I don't like to see.

Roy Eltham · 2017-05-02 22:17

My more common use of short circuits in C/C++ is this:

if (ptr != nullptr && ptr->IsValid())
{
   // do something with ptr
}

or

if (ptr != nullptr && (ptr->initialized == true))
{
   // do something with ptr
}

Rayman · 2017-05-02 22:26

The way I see it is in IF statements with function calls that return Boolean...

I seem to remember translating some code from C++ to Spin and it came up.
You wouldn't want it to call the second function if the first one isn't what you need...

It was many years ago when I had that issue though...

Rayman · 2017-05-02 22:31

Hey, Google found the thread I was thinking about! Amazing...

http://forums.parallax.com/discussion/107274/evaluation-order-in-spin

evanh · 2017-05-02 22:43

Oh, ha! I was thinking bugginess problems, not a feature to be used.

I wouldn't use short circuiting because it's too easy to misunderstand or worst, misinterpret when refactoring/porting.

jmg · 2017-05-02 22:49

David Betz wrote: »
evanh wrote: »

I probably should keep my head down here but ... performing assignment operations inside of a conditional check is not what I'd call good coding practice. It's down right obfuscation, imho!

I don't think that is the most common use of short-circuit boolean evaluation. It's more stuff like this:
if (fp != NULL && fread(buf, 1, sizeof(buf), fp) != -1) {
    /* do something with the data read from the file */
}
Here you don't want to invoke fread if fp is NULL.

Yes, that is very common use of short circuiting, and reads fine to me.
Short circuiting is required on systems that would generate a GPF on a bad pointer read, less vital on a MCU.

JasonDorie · 2017-05-02 22:50

Heater. wrote: »

That is exactly the kind of thing I don't like to see.

The alternate is overly verbose:

if (fp != NULL )
{
  if( fread(buf, 1, sizeof(buf), fp) != -1) {
      /* do something with the data read from the file */
  }
}

The original form, and Roy's example, are commonplace in C/C++, and far more legible (to me) than, say, PASMs use of wr or if_c on statements. I'm not saying those shouldn't exist - I understand their utility, they just trip me up sometimes. C's operator and expression syntax is similar - it's powerful, and can be abused, but used properly results in efficient and reasonably legible code.

JasonDorie · 2017-05-02 22:54

I bet Heater would hate C#'s way of doing things: exceptions:

  try {
    file.read( buffer, bytesToRead );
    return true;
  }

  catch( System.IO.fileException e )
  {
    return false;
  }

That's a contrived example, but it actually works quite well when you have a LOT of statements in a row that all depend on some prior success. This way you don't have to litter your code with error checks - you just write the straight-line code assuming it'll all work, and the exceptions only trigger when it doesn't, so it's more verbose, but can actually be more legible.

tonyp12 · 2017-05-02 22:59

one way to short circuit If you want to count down to zero and only do a one-shoot routine, but don't want counter to roll-under.

if (counter && !--counter) {
  /*  do something if it was at 1 and reached zero) /*
}

Heater. · 2017-05-03 00:13

@jmg,

Short circuiting is required on systems that would generate a GPF on a bad pointer read...

Certainly checking for a valid pointer may be required. That does not mean using short circuit evaluation and cramming everything into a single line is.

...less vital on a MCU.

Hmm...I guess if you want you system to work predictably, failing nicely, then such checks are equally vital on MCU's as well.

@JasonDorie

The alternate is overly verbose:

I don't see it as "overly verbose." I see it as verbose as it needs to be to make for clearly readable code that expresses the intent of the programmer in the simplest way.

I guess we will never agree on that difference though.

I bet Heater would hate C#'s way of doing things: exceptions:

Dead right. I hate exceptions. Mostly because they get used incorrectly by so many programmers.

I don't believe that is the C# way as such. Exceptions are supposed to be there to handle, well, exceptional, circumstances. Not handle all kind of situations that are normal, for example a user trying to open a non-existent file.

Throwing all the errors you have not thought about into an exception handler is a recipe for memory and other resource leaks. It also obfuscates your intended program flow.

JasonDorie · 2017-05-03 00:41

"Throwing all the errors you have not thought about into an exception handler is a recipe for memory and other resource leaks. It also obfuscates your intended program flow."

I agree somewhat, but I've seen code that looks like this:

  mDeviceInterface = GlobalGetDeviceInterface();
  if( mDeviceInterface == nullptr ) return Err_NoDevice;

  mObjHandle = mDeviceInterface->GetObjectHandle();
  if( mObjectHandle == nullptr) {
    mDeviceInterface->Release();
    return Err_NoHandle;
  }

  bool Supported = mObjectHandle->QueryFeature(ShaderLevel_3_0);
  if( Supported == false ) {
    mObjectHandle->Release();
    mDeviceInterface->Release();
    return Err_FeaturesUnsupported;
  }

...and so on. I'd argue that's actually worse. It's clear, but it's very tedious to code that way, and it's really easy to miss releasing some resource, so using exceptions (or goto) actually makes sense.

I think a lot of language dislike comes down to stylistic choices. I'm "fluent" in C/C++ and C#, but I occasionally see code written in those languages that looks completely foreign because the coder is all over the place with style, indenting, poor naming, dense expressions, and so on. It works, but it's hard to read and maintain.

Anyway - getting off topic for this thread, and I didn't mean to hijack.

Bill Henning · 2017-05-11 17:25

So code in the LUT runs at full speed?

Sounds more than good enough to me, data can stay in the cog to be accessed at full speed.

cgracey wrote: »

Remember that RDLUT and WRLUT use S for their addressing. That S is indirect if it is a register. It is not available before the instruction, like D and S addresses are as they go through the pipeline. S becomes available at the start of the instruction, so the LUT read may be issued then, causing data to return late in the next clock - too late to be mux'd to the result. So, the data is captured and then flows through the result mux on the 3rd clock. That's why RDLUT takes three clocks. WRLUT only takes two, because there is nothing to wait for. It's done in the 1st clock.

It's true that the LUT could have been made as accessible as S and D, but it would require some kind of banking mechanism. As it is, it's instruction-fetch usage is just the same as cog RAM, so it can be executed from without any speed penalty, but it must be read and written through discrete instructions. In practice, I've found this to be just fine, as code can be placed in it and it runs as if it were in cog registers. You just need to keep your D/S variables in the cog register space.

Seairth · 2017-05-11 19:39

Bill Henning wrote: »

So code in the LUT runs at full speed?

Sounds more than good enough to me, data can stay in the cog to be accessed at full speed.

Yes, if the PC is between $200 and $3FF, the cog is fetching instructions (at full speed) from LUT instead of COG. Like you suggest, you could execute from the LUT and use the COG ram purely as data registers. Combine that with shared LUT mode, where the paired cog could dynamically swap out executable code, and you end up with some really interesting execution options!

jmg · 2017-05-11 19:57

Seairth wrote: »

Yes, if the PC is between $200 and $3FF, the cog is fetching instructions (at full speed) from LUT instead of COG. Like you suggest, you could execute from the LUT and use the COG ram purely as data registers. Combine that with shared LUT mode, where the paired cog could dynamically swap out executable code, and you end up with some really interesting execution options!

I wonder how elastic the LUT size is ?

If we are wildly optimistic for a moment, and presume the routed device has spare space after the 512k RAM is included, how easy is it to increase the LUT size to the next notch ?

potatohead · 2017-05-11 20:02

It's a 16x cost.

potatohead · 2017-05-11 20:04

IMHO, the paired COG LUT was really worth it. It may be that XMM can be done at some reasonable speed.

Need more time to play. This is one feature I've not used yet!

Seairth · 2017-05-11 20:09

jmg wrote: »

I wonder how elastic the LUT size is ?

If we are wildly optimistic for a moment, and presume the routed device has spare space after the 512k RAM is included, how easy is it to increase the LUT size to the next notch ?

No. Just... no.

potatohead · 2017-05-11 20:19

Agreed. I think we have managed to very fully exploit the fab process potential.

Synthesis may well tell us we've done too much. I keep harboring that expectation. I hope to be wrong, and likely am. Chip has followed the design rules learned from HOT. And regular reconsideration, like the buffer registers added, should play out well too.

But, there are a lot of systems untested, which we all can and should work on, as well as a first pass synthesis on all this. I find it hard to believe some guidance and or compromise won't come out of that stage in the project.

The tweaks, fine tuning instructions, XBYTE, all seem appropriate and moderate risk at best.

Something like expanding LUT seems very high risk, and we've had the talk before. Address space bits would need to be added, or some banking, segment type kludge would be.

Besides, I'll bet the space we have now can prove very efficient once some work has been put into big code execute, should people go there.

It's more than adequate for the other uses.

cgracey · 2017-05-11 23:48

Yeah, it's a little late to double the LUT memory. I think we have a nice balance, anyway.

Cluso99 · 2017-05-11 23:53

LUT size has been locked in because code executing from hub starts at long address $400 (byte address $1000). ie there is a hole for cog and LUT.

Fast Bytecode Interpreter

Comments