Chapter 18: Sets, Classes and Relations
In this chapter we look at more of the built-in functions of J.
The connecting theme
is, somewhat loosely, working with set, classes and relations.
Suppose that, for some list, for the purpose at hand,
the order of the items is irrelevant and the presence of duplicate items
is irrelevant.
Then we can regard the list as (representing) a finite set.
In the abstract, the set 3 1 2 1 is considered to be the same set as
1 2 3.
The word "class" we will use in the sense in which, for example,
each integer in a list belongs either to
the odd class or to the even class.
By "relation" is meant
a table of two or more columns,
expressing a relationship between a value in one column
and the corresponding value in another.
A relation with two columns, for example, is a set of pairs.
18.1 Sets
18.1.1 Membership
There is a built-in verb e. (lowercase e dot, called "Member").
The expresssion x e. y tests whether
x matches any item of y, that is, whether x is a member of the
list y.
For example:
y=: 'abcde' |
'a' e. y |
'w' e. y |
'ef' e. y |
abcde |
1 |
0 |
1 0 |
Evidently the order of items in y is irrelevant and so is the presence
of duplicates in y.
z=: 'edcbad' |
'a' e. z |
'w' e. z |
'ef' e. z |
edcbad |
1 |
0 |
1 0 |
We can test whether a table contains a particular row:
t =: 4 2 $ 'abcdef' |
'cd' e. t |
ab
cd
ef
ab |
1 |
18.1.2 Less
There is a built-in verb -. (minus dot, called "Less").
The expression x -. y produces a list of the items of
x except those which are members of y.
x =: 'consonant' |
y =: 'aeiou' |
x -. y |
consonant |
aeiou |
cnsnnt |
Evidently the order of items in y is irrelevant and so is the presence of
duplicates in y.
18.1.3 Nub
There is a built-in verb ~. (tilde dot, called "Nub"). The expression
~. y produces
a list of the items of y without duplicates.
nub =: ~. |
y =: 'hook' |
nub y |
~. |
hook |
hok |
We can apply nub to the rows of a table:
t |
nub t |
ab
cd
ef
ab |
ab
cd
ef |
18.1.4 Nub Sieve
The verb "nub sieve" (~:) gives a
boolean vector which is true only at
the nub.
y |
b =: ~: y |
b # y |
nub y |
hook |
1 1 0 1 |
hok |
hok |
18.1.5 Functions for Sets
The customary functions on sets, such as set-union,
set-intersection or set-equality, are easily defined
using the built-in functions available. For example
two sets are equal
if all members of one are members of the other,
and vice versa.
seteq =: *./ @: (e. , e.~)
1 2 3 seteq 3 1 2 1 |
1 2 3 seteq 1 2 |
1 |
0 |
18.2 The Table Adverb
Recall that the adverb / generates a verb; for example +/ is a verb
which sums lists.
More precisely, it is the monadic case of +/ which sums lists.
The dyadic case of +/ generates a table:
x =: 0 1 2 |
y =: 3 4 5 6 |
z =: x +/ y |
0 1 2 |
3 4 5 6 |
3 4 5 6
4 5 6 7
5 6 7 8 |
The general scheme is that if we have
z =: x f/ y
then z is a table such that the value at row i column j
is given by applying f dyadically
to the pair of arguments i{x
and j{y. That is,
z contains all possible pairings
of an item of x with an item of y.
Here is another example:
x =: 'abc' |
y =: 'face' |
x =/ y |
abc |
face |
0 1 0 0
0 0 0 0
0 0 1 0 |
The result shows, in the first row,
the value of 'a' = 'face', in the second row
the value of 'b' ='face'
and so on.
18.3 Classes
18.3.1 Self-Classify
Consider the problem of finding the counts
of letters occurring in a string (the
frequency-distribution of letters).
Here is one approach.
We form a table testing each letter for equality
with the nub.
y =: 'hook' |
nub y |
(nub y) =/ y |
hook |
hok |
1 0 0 0
0 1 1 0
0 0 0 1 |
The expression ((nub y) = / y) can be abbreviated as (= y).
The monadic case of the built-in verb = is called
"Self-classify").
y |
nub y |
(nub y) =/ y |
= y |
hook |
hok |
1 0 0 0
0 1 1 0
0 0 0 1 |
1 0 0 0
0 1 1 0
0 0 0 1 |
If we sum each row of = y we obtain
the counts, in the order of the letters in the nub.
y |
= y |
+/ " 1 =y |
hook |
1 0 0 0
0 1 1 0
0 0 0 1 |
1 2 1 |
The counts can be paired with the letters of the
nub:
y |
nub y |
(nub y) ;" 0 (+/ " 1 =y) |
hook |
hok |
+-+-+
|h|1|
+-+-+
|o|2|
+-+-+
|k|1|
+-+-+ |
18.3.2 Classification Schemes
Gardeners classify soil-types as acid, neutral or
alkaline, depending on the pH value. Suppose
that a pH less than 6 is classed as acid,
6 to 7 is neutral, and more than 7 as alkaline.
Here now is a verb to classify a pH value,
returning A for acid, N for neutral and L for alkaline (or limy).
classify =: ({ & 'ANL') @: ((>: & 6) + (> & 7))
classify 6 |
classify 4.8 5.1 6 7 7.1 8 |
N |
AANNLL |
The classify function we can regard as defining a
classification scheme.
The letters ANL, which are in effect
names of classes, are called the keys of the scheme.
18.3.3 The Key Adverb
Given some data (a list, say), we can classify each item
to produce a list of corresponding keys.
data =: 7 5 6 4 8 |
k =: classify data |
7 5 6 4 8 |
NANAL |
We can select and group together all the data in,
say, class A (all the data with key A):
data |
k |
k = 'A' |
(k = 'A') # data |
7 5 6 4 8 |
NANAL |
0 1 0 1 0 |
5 4 |
Now suppose we wish to count the items in each class.
That is, we aim to apply the monadic verb # separately
to each group of items all of the same key.
To do this we can use the built-in adverb /. (slash dot, called "Key").
data |
k =: classify data |
k # /. data |
7 5 6 4 8 |
NANAL |
2 2 1 |
For another example, instead of counting the members
we could exhibit the members, by applying the box verb <.
data |
k =: classify data |
k < /. data |
7 5 6 4 8 |
NANAL |
+---+---+-+
|7 6|5 4|8|
+---+---+-+ |
The verb we apply can discover for itself the class of
each separate argument,
by classifying the first member:
Here the verb u
produces a boxed list: the key and count:
u =: (classify @: {.) ; #
data |
k =: classify data |
k u /. data |
7 5 6 4 8 |
NANAL |
+-+-+
|N|2|
+-+-+
|A|2|
+-+-+
|L|1|
+-+-+ |
The general scheme for the "Key" adverb is as follows.
In the expression x u /. y, we take y to be a list,
and x is a list of keys of corresponding items of y according
to some classification scheme,
and u is the verb to be applied separately to each class.
The scheme is:
x u /. y means (= x) (u @ #) y
To illustrate:
y =: 4 5 6 7 8
x =: classify y
u =: <
y |
x |
= x |
(= x) (u @ #) y |
x u /. y |
4 5 6 7 8 |
AANNL |
1 1 0 0 0
0 0 1 1 0
0 0 0 0 1 |
+---+---+-+
|4 5|6 7|8|
+---+---+-+ |
+---+---+-+
|4 5|6 7|8|
+---+---+-+ |
We see that each row of =x selects items from y,
and u is applied to this selection.
18.3.4 Letter-Counts Revisited
Recall the example of finding the counts of letters in a string.
y =: 'LETTUCE' |
= y |
(nub y) ; " 0 +/ "1 (= y) |
LETTUCE |
1 0 0 0 0 0 0
0 1 0 0 0 0 1
0 0 1 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0 |
+-+-+
|L|1|
+-+-+
|E|2|
+-+-+
|T|2|
+-+-+
|U|1|
+-+-+
|C|1|
+-+-+ |
Here is a variation. We note that
we have in effect a classification scheme where
we have as many different classes as different letters: each letter
is (the key of) its own class. Thus we can write an
expression of the form y u /. y.
The applied verb u will see, each time, a list of letters, all the same.
It counts them, with #,
and takes the first, with {., to be a label for the class.
u =: {. ; #
y |
= y |
y u /. y |
LETTUCE |
1 0 0 0 0 0 0
0 1 0 0 0 0 1
0 0 1 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0 |
+-+-+
|L|1|
+-+-+
|E|2|
+-+-+
|T|2|
+-+-+
|U|1|
+-+-+
|C|1|
+-+-+ |
18.4 Relations
Suppose there are a number of publications,
such as:
- "Pigs" by Smith, on the subject of pigs
- "Pets" by Brown, on cats and dogs
- "Dogs" by Smith and James, on dogs
and we aim to catalog such publications.
A suitable data structure for such a catalog might be a table
relating authors to titles and another
table relating
titles to subjects. For example:
author | title |
Smith | "Pigs" |
Brown | "Pets" |
Smith | "Dogs" |
James | "Dogs" |
|
title | subject |
"Pigs" | pigs |
"Pets" | dogs |
"Pets" | cats |
"Dogs" | dogs |
|
Such tables we may call "relations".
The order of the rows is not significant.
Here,for the sake of simplicity, we will stick to
relations with two columns.
Now we choose a representation for our relations.
For a first approach, we choose tables of boxed strings.
The authors-titles relation is:
] AT =: (". ;. _2) 0 : 0
'Smith' ; 'Pigs'
'Brown' ; 'Pets'
'Smith' ; 'Dogs'
'James' ; 'Dogs'
)
+-----+----+
|Smith|Pigs|
+-----+----+
|Brown|Pets|
+-----+----+
|Smith|Dogs|
+-----+----+
|James|Dogs|
+-----+----+
and the titles-subjects relation is:
] TS =: (". ;. _2) 0 : 0
'Pigs' ; 'pigs'
'Pets' ; 'cats'
'Pets' ; 'dogs'
'Dogs' ; 'dogs'
)
+----+----+
|Pigs|pigs|
+----+----+
|Pets|cats|
+----+----+
|Pets|dogs|
+----+----+
|Dogs|dogs|
+----+----+
18.4.1 Join of Relations
From the authors-titles relation AT and the titles-subjects
relation TS we can compute an authors-subjects relation
showing which author has written a title on which subject.
We say that AT and TS are to be joined with respect to
titles, and we would expect the join to look like this:
+-----+----+
|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|
+-----+----+
The plan for this section is to look at a function for
computing joins, then at an improved version,
and then at the advantage of representing relations
as tables of symbols rather than boxed strings.
Finally we look at some performance comparisons.
A method is as follows.
We consider all possible pairs consisting of a row
x from table AT and a row y from table TS.
Each pair x,y is of the form:
author; title; title; subject
If title matches title, that is, item 1 matches item 2,
then we extract author and subject, that is, items 0 and 3.
Verbs for testing and extracting from x,y pairs
can be written as:
test =: 1&{ = 2&{
extr =: 0 3 & {
and these verbs can be plugged into a suitable
conjunction to do the pairing. In writing this conjunction,
we aim to avoid requiring the whole set of possible pairs
to be present at the same time, since this set may be large.
We also aim to avoid any duplicates in the result.
Here is a first attempt.
PAIR =: 2 : 0
:
z =. 0 0 $ ''
for_x. x. do.
for_y. y. do.
if. u. x,y do. z =. z, v. x,y end.
end.
end.
~. z
)
The join verb can now be written as:
join =: test PAIR extr
and we see:
AT |
TS |
AT join TS |
+-----+----+
|Smith|Pigs|
+-----+----+
|Brown|Pets|
+-----+----+
|Smith|Dogs|
+-----+----+
|James|Dogs|
+-----+----+ |
+----+----+
|Pigs|pigs|
+----+----+
|Pets|cats|
+----+----+
|Pets|dogs|
+----+----+
|Dogs|dogs|
+----+----+ |
+-----+----+
|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|
+-----+----+ |
The join verb as defined above is slow, because
the test and extr verbs are applied to a single x,y pair at
a time - they are scalar computations. Performance will
be better if we can give these verbs as much data as
possible to work on at one time. (This is a universal rule in J).
Vector
or array arguments are better. Here is a revised
vector-oriented version of PAIR and join, which still
avoids building the entire set of pairs.
VPAIR =: 2 : 0
:
z =. 0 0 $ ''
for_x. x. do.
z =. z , |: v. (#~"1 u.) |: x , "1 y.
end.
~. z
)
vjoin =: test VPAIR extr
giving the same result as before:
AT join TS |
AT vjoin TS |
+-----+----+
|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|
+-----+----+ |
+-----+----+
|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|
+-----+----+ |
Representing relations as tables of boxed strings,
as above, is less than efficient.
For a repeated value, the entire string is repeated.
Values are compared by comparing entire strings.
Now we look at another possibility. Rather than boxed strings,
a relation can be represented by a table of
symbols.
18.4.2 What are Symbols?
Symbols are for efficient computation with string data.
Symbols are a distinct data-type, in the same way
that characters, boxes and numbers are distinct data-types.
A symbol is a scalar which identifies, or refers to,
a string.
A symbol can be created by applying the built-in verb
s: (lowercase s colon) to a boxed string.
a =: s: <'hello'
Now the variable a has a value of type symbol. We inspect
this value in the usual way:
a
`hello
and see that the value is displayed
as the original string preceded by a left-quote.
Even though a looks like a string when displayed,
it is a scalar.
The original string is stored in a data-structure,
maintained automatically by the J system, called the
symbol-table.
Strings are not duplicated within the symbol-table.
Hence if another symbol b is created from the same string as a,
then b is equal to a.
a |
b =: s: <'hello' |
b = a |
`hello |
`hello |
1 |
Notice that the comparison is simple scalar equality, with
no need to compare the original strings.
Our relations above can be converted to arrays of
symbols, and joined as before.
sAT =: s: AT |
sTS =: s: TS |
sAT vjoin sTS |
`Smith `Pigs
`Brown `Pets
`Smith `Dogs
`James `Dogs |
`Pigs `pigs
`Pets `cats
`Pets `dogs
`Dogs `dogs |
`Smith `pigs
`Brown `cats
`Brown `dogs
`Smith `dogs
`James `dogs |
Symbols are lexicographically ordered to reflect the ordering
of the original strings. Hence
tables of symbols can be sorted:
sAT |
/:~ sAT |
`Smith `Pigs
`Brown `Pets
`Smith `Dogs
`James `Dogs |
`Brown `Pets
`James `Dogs
`Smith `Dogs
`Smith `Pigs |
18.4.3 Measurements Compared
Here is a utility verb giving time in seconds
to evaluate an expression,
averaged over say 4 executions.
time =: (8j5 & ":) @: (4 & (6!:2))
The examples of relations above are too small
for meaningful performance measurements,
so we make larger relations
by replicating each say 100 times.
AT =: 100 $ AT
TS =: 100 $ TS
sAT =: 100 $ sAT
sTS =: 100 $ sTS
There are 4 cases to compare:
t1 =: time 'AT join TS' NB. scalar method, boxed strings
t2 =: time 'sAT join sTS' NB. scalar method, symbols
t3 =: time 'AT vjoin TS' NB. vector method, boxed strings
t4 =: time 'sAT vjoin sTS' NB. vector method, symbols
and we see:
3 3 $ ' '; 'strings'; 'symbols';'scalar';t1;t2; 'vector';t3;t4
+------+--------+--------+
| |strings |symbols |
+------+--------+--------+
|scalar|10.76841| 0.49555|
+------+--------+--------+
|vector| 0.17658| 0.02611|
+------+--------+--------+
The built-in verb
7!:5 gives the size of its argument in bytes.
A table of symbols is
in itself smaller than the corresponding
array of boxed strings.
7!:5 <'sAT' |
7!:5 <'AT' |
1024 |
13824 |
However, we must also take into account
the size of the underlying "symbol table", which
may be considerable.
For more details, see the Dictionary.
This is the end of Chapter 18
|