Following code are some of the alternatives for removing special characters from string:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from bs4 import BeautifulSoup
raw_html = 'Dear Parent,
<span style = "font-size: 1rem;" >This is a test message, < / span><span style = "font-size: 1rem;" >kindly ignore it. < / span>
<span style = "font-size: 1rem;" >Thanks< / span>
'
clean_text = BeautifulSoup(raw_html, "lxml" ).text
print clean_text
|
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways. The first one is BeautifulSoup's get_text method with strip argument as True
1 2 3 4 | clean_text = BeautifulSoup(raw_html, "lxml" ).get_text(strip = True )
print clean_text
|
The other option is to use python's library unicodedata
1 2 3 4 5 6 7 8 9 | import unicodedata
clean_text = BeautifulSoup(raw_html, "lxml" ).text
print clean_text
new_str = unicodedata.normalize( "NFKD" ,clean_text)
print new_str
|