Tuesday, 16 January 2018

Python - HTML to Text for sending SMS - SMS Safe characters - remove \xa0

Following code are some of the alternatives for removing special characters from string:
from bs4 import BeautifulSoup

raw_html = '

Dear Parent, 

This is a test message, kindly ignore it. 

Thanks

' clean_text = BeautifulSoup(raw_html, "lxml").text print clean_text #u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways. The first one is BeautifulSoup's get_text method with strip argument as True
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)

print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

The other option is to use python's library unicodedata
import unicodedata

clean_text = BeautifulSoup(raw_html, "lxml").text
print clean_text
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

new_str = unicodedata.normalize("NFKD",clean_text)
print new_str
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

No comments:

Post a Comment