Shop OBEX P1 Docs P2 Docs Learn Events
pbtc -- an open source PBASIC tokenization toolchain — Parallax Forums

pbtc -- an open source PBASIC tokenization toolchain

AJ MilneAJ Milne Posts: 12
edited 2014-10-22 18:29 in BASIC Stamp
As per the previous thread, this one has its own tokenizer, so should be fully buildable on any target system that has:

. a decent C++ compiler with STL support (this is any halfway recent g++/gcc)
. a libpcre port (Perl-compatible regular expressions library for C/C++; this dependency may go away in not much longer, but I'm pretty sure the lib's pretty ubiquitous anyway)
. Flex and Bison ports

Note, however, that while it may be of interest at this stage for the deeply binary curious, it is _not_ at all ready for average end users as yet. It generates valid code, but only for a subset of the higher level language, which isn't even PBASIC-proper yet (though the migration to this should be pretty painless from here). It's just up in a pre-alpha state for anyone interested in contributing, mostly.

The repository is at:

http://sourceforge.net/p/pbtc/git/ci/master/tree/

... and to get some idea of the higher-level language support progress, see especially the Bison input at:

http://sourceforge.net/p/pbtc/git/ci/master/tree/pbp/pbas.y

Thanks all. Feedback appreciated, but, again, bear in mind: this is not at all yet expected to be an end-user-friendly tool.

Comments

  • Tracy AllenTracy Allen Posts: 6,662
    edited 2014-10-20 12:01
    This sounds great and I'll take a look. But I won't be much help I'm afraid.
    c++ My ability to read c code stands somewhere between my latin and my greek, beyond that ++?.
    STL support, g++/gcc, libpcre port, Perl support, no idea, despite ubiquity.
    Flex and Bison, isn't that what you do at the gym, and what does it have to do with a shaggy prehistoric-looking mammal that hangs around bubbling geysers?! I stand in awe! But thanks again for taking it on.
  • AJ MilneAJ Milne Posts: 12
    edited 2014-10-22 08:18
    Thanks much.

    Work's got a bit mad again, so I've had to slow down a bit, but I've been able to put bits and pieces of hours into it in the evenings, last little while, chipping away. At this rate, I figure I'll probably have full coverage for the BS2 command set from BASIC itself within a week or so, at this rate. Then still lots of fit and finish things to make it a little less user-hostile, which only the very unwise would even try to predict, timewise, but it really does look like getting it to useful shouldn't be that big a thing.
  • Ken GraceyKen Gracey Posts: 7,387
    edited 2014-10-22 10:01
    Any examples of PBTC anywhere? This is most interesting.

    Thanks,

    Ken Gracey
  • signalsignal Posts: 6
    edited 2014-10-22 10:02
    Yeah?....uh-huh?....duh?.

    Yes I'm interested, and will take your advice about waiting awhile.

    In the meanwhile I do have pbasic working and can well get along with that.

    Thank You

    P.S. This site really needs a Linux forum.
  • AJ MilneAJ Milne Posts: 12
    edited 2014-10-22 17:41
    Thanks, guys. And Ken, re examples, sure: explaining it a bit more detail, with specifics, given the test program ex.bs2, as follows:

    ex.bs2:
    ---
    a var byte
    b var byte
    c var word
    
    b = 1
    c = 650
    
    for a = 1 to 8 
      debug "test: ", dec a, cr
      c = c + 25
      gosub l3
    next
    
    freqout 4, 2000, 3000
    end
    
    l3:
      pulsout a, c
      pause 20
      return
    

    ... if you ran the standard tokenizer on it, generating ex.tok, you could then then run pbttc (the encoder/decoder) in 'disassembly' mode on the .tok file, as follows:

    pbttc d ex.tok > ex.detok

    ... yielding the following:

    ex.detok:
    ---
    . 1
    . set_var_byte 09
    vset
    . 650
    . set_var_word 03
    vset
    . 1
    . set_var_byte 08
    vset
    label_1:
    . 84
    . 16
    serout
        116
        ee
      lc
        101
        ee
      lc
        115
        ee
      lc
        116
        ee
      lc
        58
        ee
      lc
        32
        ee
      lc
        438
        . get_var_byte 08
        ee
      lc
        13
        ee
      le
    . get_var_word 03
    . 25
    opr+
    . set_var_word 03
    vset
    gosub label_0
    . 8
    . 1
    loop_cmp_step_jmp
      get_var_byte 08
      . 1
      ee
      set_var_byte 08
      ee
      adr. label_1
    . 4
    . 2000
    . 3000
    freqout
    end
    label_0:
    . get_var_word 03
    . get_var_byte 08
    pulsout
    . 20
    pause
    return
    

    ... that's a 'portable' version of the mnemonic syntax, with the jump targets abstracted to labels. You can also request a 'literal' format, with the addresses and other bits left in place, like this:

    pbttc ex.tok -n > ex.ndetok

    ... yielding:

    ex.ndetok:
    ---
    000.0  s_addr 003.4
    001.6  adr. 02b.3
    003.4  . 1
    004.4  . set_var_byte 09
    006.0  vset
    006.7  . 650
    009.0  . set_var_word 03
    00a.3  vset
    00b.2  . 1
    00c.2  . set_var_byte 08
    00d.6  vset
    00e.5  . 84
    010.3  . 16
    011.3  serout
    012.2      116
    013.7      ee
    014.0    lc
    014.1      101
    015.6      ee
    015.7    lc
    016.0      115
    017.5      ee
    017.6    lc
    017.7      116
    019.4      ee
    019.5    lc
    019.6      58
    01b.2      ee
    01b.3    lc
    01b.4      32
    01c.3      ee
    01c.4    lc
    01c.5      438
    01e.4      . get_var_byte 08
    020.0      ee
    020.1    lc
    020.2      13
    021.4      ee
    021.5    le
    021.6  . get_var_word 03
    023.1  . 25
    024.5  opr+
    025.4  . set_var_word 03
    026.7  vset
    027.6  gosub 03b.3 r. 02b.3
    02b.3  . 8
    02c.3  . 1
    02d.3  loop_cmp_step_jmp
    02e.2    get_var_byte 08
    02f.5    . 1
    030.5    ee
    030.6    set_var_byte 08
    032.1    ee
    032.2    adr. 00e.5
    034.0  . 4
    035.0  . 2000
    037.2  . 3000
    039.5  freqout
    03a.4  end
    03b.3  . get_var_word 03
    03c.6  . get_var_byte 08
    03e.2  pulsout
    03f.1  . 20
    040.5  pause
    041.4  return
    

    ... pbttc also can 'assemble' files written in the portable mnemonic format (I've taken to calling it pbt), either to the literal format (it just works out the 'header' with the start and return address, and resolves the jump addresses) or to a properly encoded .tok file. And this portable format, again, is the format another tool, pbp, the basic parser/compiler, _emits_, when reading BASIC input. So you can feed pbp a file like this:

    ex.bss:
    ---
    a var byte;
    b var byte;
    c var word;
    
    
    b = 1;
    c = 650;
    
    for a = 1 to 8 {
      debug "test: ", dec a, cr;
      c = c + 25;
      gosub l3; }
    
    freqout 4, 2000, 3000;
    end;
    
    l3:
      pulsout a, c;
      pause 20;
      return;
    

    ... like this:

    pbpc ex.bss > ex.pbt

    ... and it will construct the following, from the BASIC:

    ex.pbt:
    ---
        . 1
        . set_var_byte 09
        vset
        . 650
        . set_var_word 03
        vset
        . 1
        . set_var_byte 08
        vset
    L000:
        . 84
        . 16
        serout
        116
        ee
        lc
        101
        ee
        lc
        115
        ee
        lc
        116
        ee
        lc
        58
        ee
        lc
        32
        ee
        lc
        438
        . get_var_byte 08
        ee
        lc
        13
        ee
        le
        . get_var_word 03
        . 25
        opr+
        . set_var_word 03
        vset
        gosub l3
        . 8
        . 1
        loop_cmp_step_jmp
        get_var_byte 08
        . 1
        ee
        set_var_byte 08
        ee
        adr. L000
        . 4
        . 2000
        . 3000
        freqout
        end
    l3:
        . get_var_word 03
        . get_var_byte 08
        pulsout
        . 20
        pause
        return
    

    ... which, yes, is identical (excepting some diffferent indents and names for jump target labels) to the detokenized output from the disassembly of the original .tok up there...

    So, unsurprisingly, if you then run the 'assembly' version of pbttc on this, like, say, this:

    pbttc a ex.pbt ex.ntok

    ... that .ntok file will be identical (in this case) to the original .tok file emitted by the original tokenizer:

    ... so it's simply:

    BASIC source -> parse with pbp -> emits .pbt stream
    .pbt -> assemble with pbttc -> encodes .tok format file.

    ... and you can thereby encode valid .tok files entirely using the new toolchain, and this should work on any platform on which the two tools will themselves build.

    (... caveat: I say it's identical 'in this case' because, in fact, although it happens to be entirely identical in this example, pbp doesn't _always_ make quite the same choices about certain things as did the original--it deliberately organizes some jumps a bit differently, to try to save some extra/unnecessary code in certain cases, so on. But it mostly comes up with the same thing, bit for bit, so far (and where it doesn't, your end result should be the same, when it runs, at least, and I'm figuring I'll probably add a 'compatibility' mode, in which it really does crank out identical binaries, if this is what's preferred).)

    (... and yes, you'll note the input syntax the pbp stage currently expects isn't _quite_ identical to standard PBASIC. But it's not going to be much trouble to make them the same, actually. I did that mostly because I'm a bit more familiar with languages that look like that myself, and it so happened the Bison example I started from was laid out more like that, and I've been focusing so far on making sure it emits good object code, given the same 'sense' of the input program. But if you take a look at the Bison grammar up there, yes, making it entirely the same should be pretty painless, mostly a matter of switching out the semicolons for end-of-lines, adding the terminal 'loop' and 'next' tokens to the grammar spec, so on. And this is on the todo list (though honestly, I've been getting a bit attached to the new syntax, may keep it around as an option, as well, anyway).)

    That's it. That's pbtc.
  • AJ MilneAJ Milne Posts: 12
    edited 2014-10-22 18:29
    Adding: I don't really consider the .pbt format to be quite a finished thing, either. It's _workable_, for the encoder/decoder stages, as it is, right now, and the critical thing for me right now is just being able to see the sense of the flow of the token stream, be able to see my parser is doing sensible things, being able to compare how it stacks up against the reference one, so on, but, as one obvious example from the above i/o, I figure it would be nice if the disassembler just took a shortcut on representing and encoding literal strings where these can be easily recognized in the context of debug/serout targets, and instead of dumping the whole chain of ascii values/chain links, it just turned it back into the characters, so, instead of getting this:

    ... 
     . 84
     . 16
     serout
        116
        ee
        lc
        101
        ee
        lc
        115
        ee
        lc
        116
        ee
        lc
        58
        ee
        lc
        32
        ee
        lc
        438
        . get_var_byte 08
        ee
    ...
    


    ... you could maybe just get (and likewise compose for assembly) something like this:

    ... 
     . 84
     . 16
     serout
        "test: "
        lc
        dec
        . get_var_byte 08
        ee
    ...
    


    ... which, of course, would be a little more comprehensible, again, and definitely nicer for anyone actually trying to _write_ in .pbt, as opposed to using it as an intermediate format/analysis/optimization tool... (And, in fact, I've a Perl version of the disassembler that already does some of that (it does understand string formatters), but it's a bit down the priority list, again, really making that smooth in the working version.)

    (ETA: I have now actually implemented half of this: the disassembler will now glob up a series of simple pushes of ints in the ASCII range in serout targets as strings, as above, and the assembler will expand these properly, so this part is solved. Figure it's an obvious enhancement. Formatters may be along in a bit; guess we'll see.)
Sign In or Register to comment.