Vectorized "in" function in julia?
I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing
giant_list = ["a", "c", "j"]
good_letters = ["a", "b"]
isin = falses(size(giant_list,1))
for i=1:size(giant_list,1)
isin[i] = giant_list[i] in good_letters
end
Is there any vectorized (doubly-vectorized?) way to do this in julia? In analogy with the basic operators I want to do something like
isin = giant_list .in good_letters
I realize this may not be possible, but I just wanted to make sure I wasn't missing something. I know I could probably use DefaultDict from DataStructures to do the similar but don't know of anything in base.
The indexin
function does something similar to what you want:
indexin(a, b)
Returns a vector containing the highest index in
b
for each value ina
that is a member ofb
. The output vector contains 0 wherevera
is not a member ofb
.
Since you want a boolean for each element in your giant_list
(instead of the index in good_letters
), you can simply do:
julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
true
false
false
The implementation of indexin
is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b
:
function vectorin(a, b)
bset = Set(b)
[i in bset for i in a]
end
Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.
There are a handful of modern (i.e. Julia v1.0) solutions to this problem:
First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref
object:
julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
true
false
false
This same result can be achieved by broadcasting the infix ∈
(\in
TAB) operator:
julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
true
false
false
Additionally, calling in
with one argument creates a Base.Fix2
, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.
julia> is_good1 = in(good_letters);
is_good2(x) = x in good_letters;
julia> is_good1.(giant_list)
3-element BitArray{1}:
true
false
false
julia> is_good2.(giant_list)
3-element BitArray{1}:
true
false
false
All in all, using .∈
with a Ref
will probably lead to the shortest, cleanest code.