Elements of a in b - Find matching strings in two columns using regex

Elements of a in b

abresult

 Dictionary mode

Regex a
Regex b

Elements of a in b - Find matching strings in two columns using regexEverything you need to know about Regular Expressions

This is a small JavaScript tool to solve a very specific problem: Imagine you have two columns of data, and potentially duplicate substrings that may exist in both columns. How do you quickly identify which of the elements (substrings) of column A are also included in column B?

Although this might seem like a fairly niche problem, it actually comes up all the time in various forms. Generally, the solution involves writing a quick throwaway script in Ruby, Python, Perl, or some other scripting language. Depending on the nature of the data, it may also be solvable with clever piping of command-line tools like sedsort, and uniq.

Nevertheless, being able to quickly find matching subsets of data without needing to write a custom script each time can be a great time-saver for situations where this kind of thing comes up frequently.

Supported features

  • Live search (results update as you type)
  • Works offline (just clone or download this repository and open index.html in any browser)
  • Online demo available
  • Group capturing using parentheses
  • Display total number of matches found
  • Dictionary mode (see below)

Usage

Enter some data in box a and box b. By default, values will appear in the result box as you type if there are sequences of numerals (/\d\d+/) that appear anywhere in both box a and box b.

For example, if you input the following into box a:

foo foo 54 bar bar
bar bar 96 foo foo
foo foo 21 bar bar

And the following into box b:

234
abc
96
foo bar

The result will show as 96, because this sequence of numerals occurs in both column a and column b.

The sequence matched in both columns is easily configurable, and does not have to be the same. The examples below demonstrate the flexibility possible using capture groups, character classes, metacharacters and other regex features.

Grouping

Use rounded brackets () to isolate capture groups. Characters outside of the capture group will be ignored.

This can be useful for data that is separated by regular delimiters, for example tabs, commas or other characters.

For example, input the following into column a:

abc@def#ghi
jkl@mno#pqr
stu@vwx#yza

And the following in column b:

abc
def
ghi
jkl
mno
pqr
stu
vwx
yza

For Regex a enter:

#([a-z]+)

And for Regex b enter:

[a-z]+

This gives the result:

ghi
pqr
yza

This is possible because the regular expression /#([a-z]+)/ matched the strings of letters (/[a-z]+/) following a hash/pound/number symbol (/#/) in column a, but only captured the group of letters in each line (without the # symbol) for the purposes of matching with the data in column b.

When capturing groups, anything outside of the parentheses is ignored. If Regex a had been @([a-z]+) instead (try it!), the result would be:

def
mno
vwx

This is because the regular expression /@([a-z]+)/ matches the "middle column" of data in box a, and matches it against the text in box b.

To match any sequence of characters in column b (not just letters), use .* instead of [a-z]+.

If the data in either column has been separated by tabs, the metacharacter \t can be used to match it.

For example, given the following data in column a:

abc	123	def
ghi	456	jkl
mno	789	pqr
stu	101	vwx

And the following data in column b:

980
765
432
123
987
765

Enter \t(.*)\t for Regex a, and .* for Regex b.

The result will be 123, because only sequences surrounded by tabs were matched in column a.

Note: An easier way to approach the above example in particular might be to simply use \d+ for Regex a, which will match all sequences of digits (which in this case happen to only occur in the middle column).

Dictionary mode

Clicking on the checkbox at the bottom of the page enables the optional dictionary mode. This mode is meant to handle a specific common subset of problems involving elements in a list (column b) that match keys in a key-value pair database (column a). You can think of it like a simple hash table or dictionary lookup.

Key features of dictionary mode:

  • All elements in the list remain in order
  • All keys are printed in the result, even if they do not have a matching value
  • Keys and values are returned together in the result
  • The delimiter (default TAB) can be changed arbitrarily
  • Duplicate values are retained in the result

In its simplest form, the dictionary in column a is a two-column list of values separated by tab spaces, for example:

apple	manzana
apricot	albaricoque
banana	plátano
peach	melocotón
pear	pera
plum	ciruela

If you then enter the following into column b:

pear
orange
plum
apple
pear
plum
kiwi

You would get the following result:

pear	pera
orange
plum	ciruela
apple	manzana
pear	pera
plum	ciruela
kiwi

(Note the blank values for orange and kiwi, which were not in the original dictionary list.)

By default, the delimiter for the input (the dictionary data in column a) and the output (the result in column b) is set to a TAB stop (\t). To change the delimiter, adjust the values in the Input delimiter and Output delimiter boxes.

See also

"Elements of a in b" is part of the tiny tools series.

Other tools for working with columns of data that might also be of interest:

Credits

License

MIT.

Post a Comment

Post a Comment (0)

Previous Post Next Post