Usage

Porpose of this extension

StrScanner is Ruby extension for fast scanning.

Since Regexp class of Ruby cannot match to sub-string, to scan string you must make new String. For example

       p " I_want_to_match_this_word but can't".index( /\A\w+/, 1 )
This code display "nil". Another way to match is as like this:

str = " word word word"
while str.size > 0 do
  if /\A[ \t]+/ === str then
    str = $'
  elsif /\A\w+/ === str then
    str = $'
  end
end

But, this method has big problem on speed issue. $' makes new string EVERY time. Then, in this example, all these strings are created:
" word word word",
"word word word",
" word word"
"word word"
" word"
"word"
""

This makes heavy load. If length of 'str' is 50KB, nearly 50KB ** 2 / 5 = 50MB memory is used!!

StrScanner class resolve this.
StrScanner has C string and pointer to it. When scanning, StrScanner do only increment pointer and not create new string. As a result, both of speed and application memory size decrease.

simple examples, and methods

Then, here's two short example of scanning routine.
First is easy to write but slow scanning code. Second is also easy to write, but FAST scanning code using StrScanner class.

First example:

ATOM = /\A\w+/
SPACE = /\A[ \t]+/

while str.size > 0 do
  if ATOM === str then
    str = $'
    return $&
  elsif SPACE === str then
    str = $'
    return $&
  end
end

Second example:

ATOM = /\A\w+/
SPACE = /\A[ \t]+/

s = StringScanner.new( str )
while s.rest? do
  if tmp = s.scan( ATOM ) then
    return tmp
  elsif tmp = s.scan( SPACE ) then
    return tmp
  end
end

Usage of StrScanner is simple.
First: Create StrScanner object, next call 'scan' method. It return matched string and at the same time it increments its internal maintained "scan pointer". It is simply implemented as pointer to char(char*).
'skip' method is similer to 'scan', but it returns length of matched string.

s = StrScanner.new( "abcdefg" )   # scan pointer is on 'a', index 0
puts s.scan( /\Aa/ )              # return 'a'. scan pointer is on 'b', index 1
puts s.skip( /\Abc/ )             # return 2. scan pointer is on 'd', index 3
continue...

At that time previous "scan pointer" is preserved in StrScanner object. Then, str[ prev pointer..current pointer ] means the string which is returned from 'scan' --- "matched string". We can get it by 'matched' method.

puts s.matched                    # return 'bc'. scan pointer don't move
puts s.scan( /\Aa/ )              # return nil. scan pointer don't move, too.
puts s.matched                    # return 'bc'.

To puts scan pointer back, is also permitted. 'unscan' method implements that. But 'unscan' can do only ONE times for one 'scan' because StrScanner object can't preserve more than one pointers.

puts s.scan( /\Ade/ )             # return 'de'. scan pointer is on 'f', index 5
s.unscan                          # scan pointer is on 'd', index 3
puts s.scan( /\Adef/ )            # return 'def'. scan pointer is on 'g', index 6

Yes, all these regexp begin with "\A". This is important. If regexp matching happen on non zero index, 'scan' (and other methods) return string from TOP OF POINTER to matched end. In example:

str = StrScanner.new( 'aaaabbbbcccc' ).scan( /bbbb/ )
p str    # will print "aaaabbbb"

For more details, see reference manual. And of course, source code is most inportant documentation, I think :-)


Copyright (c) 1999,2000 Minero Aoki <aamine@dp.u-netsurf.ne.jp>