String operations on string array containing strings with accented &/or special characters alongside regular ascii strings can be quiet an annoyance
These days I am involved with web/mobile automation. Other day I had a challenge to parse all strings on page for a generic automation library I am writing.
Since I was supposed to write a generic library to parse all strings on page, I didnot have the luxury of using ids for specific control/component on page. So I used the reliable xpath//*[@name]
to parse strings in an android application page. This would extract all the text attributes on the page which was a good enough solution for me.(I would like to know a better solution using css selectors esp., if you have one!!)
As the solution was so easy I found it difficult to believe that the code had handled all the edge cases. To clear my doubts, I went about testing it on different applications with different inputs, until I hit a road block where the page was returning a mixture of accented strings, strings containing special characters and regular ascii strings. Here is how the array looked like
strs = ["hell°", "hello", "tromsø", "boy", "stävänger", "ölut", "world"]
If you have hit similar challenge read on for the solution.
Strings with accented or special characters are unicode strings while regular one’s ascii. So to handle unicode strings as regular ascii strings one has to convert unicode strings to ascii. (For a history on unicode read a detailed article)
To convert unicode to ascii; one has to encode unicode strings to utf-8
Here is how you do in python
text = text.encode(‘utf-8’)
Simple isn’t it!!
But wait you need to strip out extra escape characters to do string operations. here is how you can strip those out
import redef extract_word(text):
print "Input Text::{}".format(text)
regex = r"(\w|\s)*"
matches = re.finditer(regex, text, re.DOTALL)
newstr = ''
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
newstr = newstr + match.group()
print "Output Text::{}".format(newstr)
return newstr
With the returned string, now you are good to go and do other string operations on the array.
(If this has helped you guys, do let me know in comment section…)