pbtc -- an open source PBASIC tokenization toolchain

AJ Milne · 2014-10-19 05:08

As per the previous thread, this one has its own tokenizer, so should be fully buildable on any target system that has:

. a decent C++ compiler with STL support (this is any halfway recent g++/gcc)
. a libpcre port (Perl-compatible regular expressions library for C/C++; this dependency may go away in not much longer, but I'm pretty sure the lib's pretty ubiquitous anyway)
. Flex and Bison ports

Note, however, that while it may be of interest at this stage for the deeply binary curious, it is _not_ at all ready for average end users as yet. It generates valid code, but only for a subset of the higher level language, which isn't even PBASIC-proper yet (though the migration to this should be pretty painless from here). It's just up in a pre-alpha state for anyone interested in contributing, mostly.

The repository is at:

http://sourceforge.net/p/pbtc/git/ci/master/tree/

... and to get some idea of the higher-level language support progress, see especially the Bison input at:

http://sourceforge.net/p/pbtc/git/ci/master/tree/pbp/pbas.y

Thanks all. Feedback appreciated, but, again, bear in mind: this is not at all yet expected to be an end-user-friendly tool.

Tracy Allen · 2014-10-20 12:01

This sounds great and I'll take a look. But I won't be much help I'm afraid.
c++ My ability to read c code stands somewhere between my latin and my greek, beyond that ++?.
STL support, g++/gcc, libpcre port, Perl support, no idea, despite ubiquity.
Flex and Bison, isn't that what you do at the gym, and what does it have to do with a shaggy prehistoric-looking mammal that hangs around bubbling geysers?! I stand in awe! But thanks again for taking it on.

AJ Milne · 2014-10-22 08:18

Thanks much.

Work's got a bit mad again, so I've had to slow down a bit, but I've been able to put bits and pieces of hours into it in the evenings, last little while, chipping away. At this rate, I figure I'll probably have full coverage for the BS2 command set from BASIC itself within a week or so, at this rate. Then still lots of fit and finish things to make it a little less user-hostile, which only the very unwise would even try to predict, timewise, but it really does look like getting it to useful shouldn't be that big a thing.

Ken Gracey · 2014-10-22 10:01

Any examples of PBTC anywhere? This is most interesting.

Thanks,

Ken Gracey

signal · 2014-10-22 10:02

Yeah?....uh-huh?....duh?.

Yes I'm interested, and will take your advice about waiting awhile.

In the meanwhile I do have pbasic working and can well get along with that.

Thank You

P.S. This site really needs a Linux forum.

AJ Milne · 2014-10-22 17:41

Thanks, guys. And Ken, re examples, sure: explaining it a bit more detail, with specifics, given the test program ex.bs2, as follows:

ex.bs2:
---

a var byte
b var byte
c var word

b = 1
c = 650

for a = 1 to 8 
  debug "test: ", dec a, cr
  c = c + 25
  gosub l3
next

freqout 4, 2000, 3000
end

l3:
  pulsout a, c
  pause 20
  return

... if you ran the standard tokenizer on it, generating ex.tok, you could then then run pbttc (the encoder/decoder) in 'disassembly' mode on the .tok file, as follows:

pbttc d ex.tok > ex.detok

... yielding the following:

ex.detok:
---

. 1
. set_var_byte 09
vset
. 650
. set_var_word 03
vset
. 1
. set_var_byte 08
vset
label_1:
. 84
. 16
serout
    116
    ee
  lc
    101
    ee
  lc
    115
    ee
  lc
    116
    ee
  lc
    58
    ee
  lc
    32
    ee
  lc
    438
    . get_var_byte 08
    ee
  lc
    13
    ee
  le
. get_var_word 03
. 25
opr+
. set_var_word 03
vset
gosub label_0
. 8
. 1
loop_cmp_step_jmp
  get_var_byte 08
  . 1
  ee
  set_var_byte 08
  ee
  adr. label_1
. 4
. 2000
. 3000
freqout
end
label_0:
. get_var_word 03
. get_var_byte 08
pulsout
. 20
pause
return

... that's a 'portable' version of the mnemonic syntax, with the jump targets abstracted to labels. You can also request a 'literal' format, with the addresses and other bits left in place, like this:

pbttc ex.tok -n > ex.ndetok

... yielding:

ex.ndetok:
---

000.0  s_addr 003.4
001.6  adr. 02b.3
003.4  . 1
004.4  . set_var_byte 09
006.0  vset
006.7  . 650
009.0  . set_var_word 03
00a.3  vset
00b.2  . 1
00c.2  . set_var_byte 08
00d.6  vset
00e.5  . 84
010.3  . 16
011.3  serout
012.2      116
013.7      ee
014.0    lc
014.1      101
015.6      ee
015.7    lc
016.0      115
017.5      ee
017.6    lc
017.7      116
019.4      ee
019.5    lc
019.6      58
01b.2      ee
01b.3    lc
01b.4      32
01c.3      ee
01c.4    lc
01c.5      438
01e.4      . get_var_byte 08
020.0      ee
020.1    lc
020.2      13
021.4      ee
021.5    le
021.6  . get_var_word 03
023.1  . 25
024.5  opr+
025.4  . set_var_word 03
026.7  vset
027.6  gosub 03b.3 r. 02b.3
02b.3  . 8
02c.3  . 1
02d.3  loop_cmp_step_jmp
02e.2    get_var_byte 08
02f.5    . 1
030.5    ee
030.6    set_var_byte 08
032.1    ee
032.2    adr. 00e.5
034.0  . 4
035.0  . 2000
037.2  . 3000
039.5  freqout
03a.4  end
03b.3  . get_var_word 03
03c.6  . get_var_byte 08
03e.2  pulsout
03f.1  . 20
040.5  pause
041.4  return

... pbttc also can 'assemble' files written in the portable mnemonic format (I've taken to calling it pbt), either to the literal format (it just works out the 'header' with the start and return address, and resolves the jump addresses) or to a properly encoded .tok file. And this portable format, again, is the format another tool, pbp, the basic parser/compiler, _emits_, when reading BASIC input. So you can feed pbp a file like this:

ex.bss:
---

a var byte;
b var byte;
c var word;


b = 1;
c = 650;

for a = 1 to 8 {
  debug "test: ", dec a, cr;
  c = c + 25;
  gosub l3; }

freqout 4, 2000, 3000;
end;

l3:
  pulsout a, c;
  pause 20;
  return;

... like this:

pbpc ex.bss > ex.pbt

... and it will construct the following, from the BASIC:

ex.pbt:
---

    . 1
    . set_var_byte 09
    vset
    . 650
    . set_var_word 03
    vset
    . 1
    . set_var_byte 08
    vset
L000:
    . 84
    . 16
    serout
    116
    ee
    lc
    101
    ee
    lc
    115
    ee
    lc
    116
    ee
    lc
    58
    ee
    lc
    32
    ee
    lc
    438
    . get_var_byte 08
    ee
    lc
    13
    ee
    le
    . get_var_word 03
    . 25
    opr+
    . set_var_word 03
    vset
    gosub l3
    . 8
    . 1
    loop_cmp_step_jmp
    get_var_byte 08
    . 1
    ee
    set_var_byte 08
    ee
    adr. L000
    . 4
    . 2000
    . 3000
    freqout
    end
l3:
    . get_var_word 03
    . get_var_byte 08
    pulsout
    . 20
    pause
    return

... which, yes, is identical (excepting some diffferent indents and names for jump target labels) to the detokenized output from the disassembly of the original .tok up there...

So, unsurprisingly, if you then run the 'assembly' version of pbttc on this, like, say, this:

pbttc a ex.pbt ex.ntok

... that .ntok file will be identical (in this case) to the original .tok file emitted by the original tokenizer:

... so it's simply:

BASIC source -> parse with pbp -> emits .pbt stream
.pbt -> assemble with pbttc -> encodes .tok format file.

... and you can thereby encode valid .tok files entirely using the new toolchain, and this should work on any platform on which the two tools will themselves build.

(... caveat: I say it's identical 'in this case' because, in fact, although it happens to be entirely identical in this example, pbp doesn't _always_ make quite the same choices about certain things as did the original--it deliberately organizes some jumps a bit differently, to try to save some extra/unnecessary code in certain cases, so on. But it mostly comes up with the same thing, bit for bit, so far (and where it doesn't, your end result should be the same, when it runs, at least, and I'm figuring I'll probably add a 'compatibility' mode, in which it really does crank out identical binaries, if this is what's preferred).)

(... and yes, you'll note the input syntax the pbp stage currently expects isn't _quite_ identical to standard PBASIC. But it's not going to be much trouble to make them the same, actually. I did that mostly because I'm a bit more familiar with languages that look like that myself, and it so happened the Bison example I started from was laid out more like that, and I've been focusing so far on making sure it emits good object code, given the same 'sense' of the input program. But if you take a look at the Bison grammar up there, yes, making it entirely the same should be pretty painless, mostly a matter of switching out the semicolons for end-of-lines, adding the terminal 'loop' and 'next' tokens to the grammar spec, so on. And this is on the todo list (though honestly, I've been getting a bit attached to the new syntax, may keep it around as an option, as well, anyway).)

That's it. That's pbtc.

AJ Milne · 2014-10-22 18:29

Adding: I don't really consider the .pbt format to be quite a finished thing, either. It's _workable_, for the encoder/decoder stages, as it is, right now, and the critical thing for me right now is just being able to see the sense of the flow of the token stream, be able to see my parser is doing sensible things, being able to compare how it stacks up against the reference one, so on, but, as one obvious example from the above i/o, I figure it would be nice if the disassembler just took a shortcut on representing and encoding literal strings where these can be easily recognized in the context of debug/serout targets, and instead of dumping the whole chain of ascii values/chain links, it just turned it back into the characters, so, instead of getting this:

... 
 . 84
 . 16
 serout
    116
    ee
    lc
    101
    ee
    lc
    115
    ee
    lc
    116
    ee
    lc
    58
    ee
    lc
    32
    ee
    lc
    438
    . get_var_byte 08
    ee
...

... you could maybe just get (and likewise compose for assembly) something like this:

... 
 . 84
 . 16
 serout
    "test: "
    lc
    dec
    . get_var_byte 08
    ee
...

... which, of course, would be a little more comprehensible, again, and definitely nicer for anyone actually trying to _write_ in .pbt, as opposed to using it as an intermediate format/analysis/optimization tool... (And, in fact, I've a Perl version of the disassembler that already does some of that (it does understand string formatters), but it's a bit down the priority list, again, really making that smooth in the working version.)

(ETA: I have now actually implemented half of this: the disassembler will now glob up a series of simple pushes of ints in the ASCII range in serout targets as strings, as above, and the assembler will expand these properly, so this part is solved. Figure it's an obvious enhancement. Formatters may be along in a bit; guess we'll see.)

pbtc -- an open source PBASIC tokenization toolchain

Comments