How to handle accented & special character strings in Python.

MP
2 min readSep 14, 2018

--

String operations on string array containing strings with accented &/or special characters alongside regular ascii strings can be quiet an annoyance

These days I am involved with web/mobile automation. Other day I had a challenge to parse all strings on page for a generic automation library I am writing.

Since I was supposed to write a generic library to parse all strings on page, I didnot have the luxury of using ids for specific control/component on page. So I used the reliable xpath//*[@name] to parse strings in an android application page. This would extract all the text attributes on the page which was a good enough solution for me.(I would like to know a better solution using css selectors esp., if you have one!!)

As the solution was so easy I found it difficult to believe that the code had handled all the edge cases. To clear my doubts, I went about testing it on different applications with different inputs, until I hit a road block where the page was returning a mixture of accented strings, strings containing special characters and regular ascii strings. Here is how the array looked like

strs = ["hell°", "hello", "tromsø", "boy", "stävänger", "ölut", "world"]

If you have hit similar challenge read on for the solution.

Strings with accented or special characters are unicode strings while regular one’s ascii. So to handle unicode strings as regular ascii strings one has to convert unicode strings to ascii. (For a history on unicode read a detailed article)

To convert unicode to ascii; one has to encode unicode strings to utf-8

Here is how you do in python

text = text.encode(‘utf-8’)

Simple isn’t it!!

But wait you need to strip out extra escape characters to do string operations. here is how you can strip those out

import redef extract_word(text):
print "Input Text::{}".format(text)
regex = r"(\w|\s)*"
matches = re.finditer(regex, text, re.DOTALL)
newstr = ''
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
newstr = newstr + match.group()
print "Output Text::{}".format(newstr)
return newstr

With the returned string, now you are good to go and do other string operations on the array.

(If this has helped you guys, do let me know in comment section…)

--

--

MP
MP

Written by MP

Startup guy. Loves Programming

Responses (1)