For a project I am working on, I wanted to identify mentions of brain natriuretic peptide (BNP) or N-terminal pro b-type natriuretic peptide (NT-proBNP) in a column containing clinical trial outcomes. I reviewed a bunch of the ones provided and the best “catch all” regular expression was (BNP|(?i)natriuretic peptide), where (?i) provides case-insensitive matching.
I’ve slowly been migrating over to polars, but Pandas has a need findall() method that accepts RegEx patterns that I ended up using. Below is a simple example to highlight this functionality.
import pandas as pd# RegEx patternpattern_bnp =r'(BNP|(?i)natriuretic peptide)'# Simple example dfdata = pd.DataFrame({"outcomes": ['This has KCCQ', 'This has NT-proBNP', None, 'This has Seattle Angina Questionnaire', 'bnp']})data
outcomes
0 This has KCCQ
1 This has NT-proBNP
2 None
3 This has Seattle Angina Questionnaire
4 bnp
We can apply our RegEx pattern to identify which ones contain BNP/NT-proBNP as an outcome in the ‘outcomes’ column. In this case, I’ve used the findall() and contains() (uses re.search) methods here, but others include:
0 0
1 1
2 0
3 0
4 1
Name: outcomes, dtype: int64
<string>:3: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.