Regex match when spaces are removed, how to delete the matched chars from
the original string with spaces?
(disclaimer: this is my first stackoverflow question so forgive me in
advance if I'm not too clear)
Expected results:
My task is to find company legal identifiers in a string representing a
company name, then separate them from it and save them in a separate
string. The company names have already been cleaned so that they only
contain alphanumeric lowercase characters.
Example:
company_1 = 'uber wien abcd gmbh'
company_2 = 'uber wien abcd g m b h'
company_3 = 'uber wien abcd ges mbh'
should result in
company_1_name = 'uber wien abcd'
company_1_legal = 'gmbh'
company_2_name = 'uber wien abcd'
company_2_legal = 'gmbh'
company_3_name = 'uber wien abcd'
company_3_legal = 'gesmbh'
Where I am right now:
I load the list of all company ids up from a csv file. Austria provides a
good example. Two legal ids are:
gmbh
gesmbh
I use a regex expression that tells me IF the company name contains the
legal identifier. However, this regex removes all spaces from the string
in order to identify the legal id.
company_1_nospace = 'uberwienabcdgmbh'
company_2_nospace = 'uberwienabcdgmbh'
company_3_nospace = 'uberwienabcdgesmbh'
since I look for the regex in the string without spaces, I am able to see
that all three companies have legal ids inside their name.
Where I am stuck:
I can say whether there is a legal id in company_1, company_2, and
company_3 but I can only remove it from company_1. In fact, I cannot
remove g m b h because it does not match, but I can say that it is a legal
id. The only way I could remove it is to also remove spaces in the rest of
the company name, which I dont want to do (it would only be a last resort
option)
Even if I were to insert spaces into gmbh to match it with g m b h, I
would then not pick up ges mbh or ges m b h. (Note that the same thing
happens for other countries)
My code:
import re
re_code = re.compile('^gmbh|gmbh$|^gesmbh|gesmbh$')
comp_id_re = re_code.search(re.sub('\s+', '', company_name))
if comp_id_re:
company_id = comp_id_re.group()
company_name = re.sub(re_code, '', company_name).strip()
else:
company_id = ''
Is there a way for python to understand which characters to remove from
the original string? Or would it just be easier if somehow (that's another
problem) I find all possible alternatives for legal id spacing? ie from
gmbh I create g mbh, gm bh, gmb h, g m bh, etc... and use that for
matching/extraction?
I hope I have been clear enough with my explanation. Thinking about a
title for this was rather difficult.
UPDATE 1: company ids are usually at the end of the company name string.
They can occasionally be at the beginning in some countries.
UPDATE 2: I think this takes care of the company ids inside the company
name. It works for legal ids at the end of the company name, but it does
not work for company ids at the beginning
legal_regex = '^ltd|ltd$|^gmbh|gmbh$|^gesmbh|gesmbh$'
def foo(name, legal_regex):
#compile regex that matches company ids at beginning/end of string
re_code = re.compile(legal_regex)
#remove spaces
name_stream = name.replace(' ','')
#find regex matches for legal ids
comp_id_re = re_code.search(name_stream)
#save company_id, remove it from string
if comp_id_re:
company_id = comp_id_re.group()
name_stream = re.sub(re_code, '', name_stream).strip()
else:
company_id = ''
#restore spaced string (only works if id is at the end)
name_stream_it = iter(name_stream)
company_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e
in name)
return (company_name, company_id)
No comments:
Post a Comment