I needed a function to strip out html tags from a text input, but still let me specify allowable tags.
Instead of spending time figuring out the regular expressions needed to pull it off and becoming a better programmer, I figured why repeat work someone else has probably already done.. I mean I could be a busy man. Anyway I found this great function on Flexer.info [link]. But after trying it out I noticed that the one tag I really really wanted to be parsed out iframe wasn't. It seems because I had specified i as an allowable tag it was also accepting iframe.
So with all due respect to Andrei, below is the revised function with the security hole patched.
All I changed was near the bottom where it determines if it's an allowable tag or not the reg exp was
<\/?" + tagsToKeep[j] + "[^<>]*?>
which allowed any character to follow the allowed tag as long as it wasn't a nested tag, which included frame following i. This will also support self closing tags.
// strips htmltags
// @param html - string to parse
// @param tags - tags to ignore
public static function stripHtmlTags(html:String, tags:String = ""):String
{
var tagsToBeKept:Array = new Array();
if (tags.length > 0)
tagsToBeKept = tags.split(new RegExp("\\s*,\\s*"));
var tagsToKeep:Array = new Array();
for (var i:int = 0; i < tagsToBeKept.length; i++)
{
if (tagsToBeKept[i] != null && tagsToBeKept[i] != "")
tagsToKeep.push(tagsToBeKept[i]);
}
var toBeRemoved:Array = new Array();
var tagRegExp:RegExp = new RegExp("<([^>\\s]+)(\\s[^>]+)*>", "g");
var foundedStrings:Array = html.match(tagRegExp);
for (i = 0; i < foundedStrings.length; i++)
{
var tagFlag:Boolean = false;
if (tagsToKeep != null)
{
for (var j:int = 0; j < tagsToKeep.length; j++)
{
var tmpRegExp:RegExp = new RegExp("<\/?" + tagsToKeep[j] + " ?/?>", "i");
var tmpStr:String = foundedStrings[i] as String;
if (tmpStr.search(tmpRegExp) != -1)
tagFlag = true;
}
}
if (!tagFlag)
toBeRemoved.push(foundedStrings[i]);
}
for (i = 0; i < toBeRemoved.length; i++)
{
var tmpRE:RegExp = new RegExp("([\+\*\$\/])","g");
var tmpRemRE:RegExp = new RegExp((toBeRemoved[i] as String).replace(tmpRE, "\\$1"),"g");
html = html.replace(tmpRemRE, "");
}
return html;
}
There are a number of situations where you'd want to grab the urls from a block of text. For example you may be loading in some external or dynamic data and want to make the links clickable, or change their colour. Regular expressions are used in a multitude of languages; they define patterns that can be matched against a string, thus certain key characters used in defining a RegExp have to be escaped so they are interpreted as special characters like \d matches any digit. In Actionscript, you can define a RegExp by either wrapping it in double quotes "", or forward slashes//. In each case you would have to escape any characters that match the wrapping in addition to the characters that need to be escaped in the actual pattern. Further more Actionscript requires you to separate out the last part of the regular expression, called flags, and insert it as the second argument when defining a new RegExp object.
Here's how you find a url in text or html:
var str:String = new String('This is a url www.fightskillz.com, and this is another one: <a href="http://chalk-it-out.com">Chalk It Out</a>');
var reg:RegExp = new RegExp("\\b(((https?)://)|(www.))([a-z0-9-_.&=#/]+)", 'i');
var result:Object = reg.exec(str);
trace(result[0]);
First off if you're new to Flex/Actionscript you have to copy and paste this into a function and the variables created will only be accessable within that function while it's running as they are created and destroyed as it runs. If you wanted more permanence you'd just define the variables outside the function.
Now Let's break it down. The first \ is used as a character escape for Actionscript. In actionscript when defining a string within double quotes you'd escape a double that's part of the string like this "Look at this double quote \"". \b searches for a word boundary ie: a whitespace, or the beginning or end of a string.The next part ((https?)://)|(www.))defines the first part of a 'word' that passes for a url. It's made up of two substrings, the first looks for http, the question mark deems the preceding character optional, so it'll match to https as well. It then looks to see if the protocol is followed by ://. The |character means OR, so if there is no protocol specified, it checks for (www.). Next we have [a-z0-9-_.&=#/] which is a list of characters a to z, 0 to 9, and various others commonly found in urls. This is followed by a + which instructs the pattern to match the preceding list of characters until it can't anymore. It can't anymore when it reaches whitespace, a single or double quote, brackets, or any other non-url character. Finally the RegExp flag i informs the pattern to be case insensitive.
reg.exec(str); executes the pattern on the specified string and returns the results as an array. Since the example is only designed to match the first url it encounters and then stop, the array will only have one result. The method reg.exec(str) is interchangable withstr.match(reg)