Microsoft Windows Vista Community Forums - Vistaheads
Recommended Download



Welcome to the Microsoft Windows Vista Community Forums - Vistaheads, YOUR Largest Resource for Windows Vista related information.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so , join our community today!

If you have any problems with the registration process or your account login, please contact us.

Driver Scanner

Does the XML file for import to IE8's InPrivate filter actually use regex?

microsoft.public.internetexplorer.general






Speedup My PC
Reply
  #1 (permalink)  
Old 09-06-2009
VanguardLH
 

Posts: n/a
Does the XML file for import to IE8's InPrivate filter actually use regex?
The InPrivate Filter of IE8 has an install-time default of Off.
However, a registry edit can change its behavior to default to On when
you start IE8. That way, you don't have to keep remembering to turn it
on when you load IE8.

Always enable InPrivate Filtering:
http://www.pcmag.com/article2/0,2817,2346892,00.asp
http://blogs.msdn.com/dmart/archive/...y-default.aspx

Its watching and recording of same content providers that are repeated
at the sites that you visit gets recorded with this filter mode enabled.
Even if a common content provider is found at the sites that you visit,
they aren't eligible for blocking unless the occurrence exceeds the
threshold that you configure for this filter (the default is 10 sites
for that same content provider). As you decrease this threshold, more
content providers will probably show up in the list (i.e., more become
eligible for blocking). However, that blocking is NOT a reasonable
ad-blocker since it merely looks for common content across multiple
sites, not what is the content.

So I found where others had mentioned using the InPrivate Filter (after
configuring it to always start On) as an ad-block filter. One guy
converted the AdBlock list (a plug-in for Firefox) to an XML file that
you can import into IE8 (Internet Options -> Programs -> Manage Add-ons
-> InPrivate Filtering). Alas, Microsoft didn't let users directly
enter what URL strings on which they want to filter. You have to create
an XML list that you then import. I'm not keen on the huge list for the
AdBlock filter and decided to prune it down while also adding my prior
list of URL strings on which I was blocking (in the URL filter in
Avast's Web Shield but there is other software, like some firewalls,
that let you block on URLs). His list is mentioned at:

http://www.dslreports.com/forum/r221...lock-plus-list

I started to wonder about the syntax of his CDATA strings. You can use
anything you want for the description but the URL string should
supposedly follow some regex syntax. Well, Microsoft doesn't explain
much other than what I found at:

http://msdn.microsoft.com/en-us/libr...20(VS.85).aspx

I can understand why you need to escape the period character (if you're
actually testing for a period character at that position rather than for
any 1 character at that position) but I don't see why the forward
slashes in the path have to be escaped. So I have to wonder how valid
is Microsoft's implementation of regex.

In regex, ".*\.adbrite\.com.*" would match on zero or more of any
characters followed by a period followed by "adbrite" followed by a
period followed by "com" and followed by zero or more of any characters.
You need to escape the period character to use it as that character
rather than its regex use of "1 of any character". Since the function
seems to look for substrings (I'm not sure of this), the .* at the
beginning and end of the URL string are probably not needed, so I could
use "\.adbrite\.com" to find that substring anywhere in the URL string.

You need the escape the backslash character, as in "\\" is for one
backslash character but I don't see why you have to escape forward slash
characters since they are *not* use in URLs.
"http://www.intel.com/index.htm" has no backslash characters that need
to be escaped, and forward slashes don't need to be escaped in regex.

I have to wonder if the author of the rules.xml file (converted from
AdBlock's list) used legitimate regex syntax since none of the period
characters are escaped in the CDATA strings in his entries. They
probably work well enough since a period character at that position
qualifies as any 1 character at that position; however, "adbrite.com"
would also match on "adbritexcom".

Since Microsoft hasn't been keen on embracing regular expressions, it is
likely that they don't follow the PCRE standard. Perhaps the forward
slashes do need to be escaped by backslashes but that's not true in PCRE
(Perl Core Regular Expressions). By Microsoft's article, anony101's
converted AdBlock list might happen to work but syntax is invalid. It
looks like anony101 used the old DOS wildcarding syntax rather than a
valid regex syntax.

Even Microsoft's example of:

<wf:blockRegex> <![CDATA[ads.contoso.com\/.*]]> </wf:blockRegex>

is not a valid regular expression unless its author actually intended to
match on ANY character where the first 2 period characters show up. The
above regex would also match on "adsXcontosoYcom\" followed by any
characters. Any why is the backslash shown in the URL which is invalid.
The delimiter is the forward slash? You do not go to
http:\\www.intel.com\index.html. You go to
http://www.intel.com/index.html. There's something goofy in how
Microsoft says URLs are syntaxed.

When I've looked at some other XML RSS feed files, they'll specify the
CDATA as something like:

a.*\.contoso\.com.*

which is "a" followed by zero or more characters, a period, "contoso", a
period, and followed by zero or more characters. The periods in the URL
are properly escaped (I'm not sure the .* is needed at the end, or at
the beginning as used in some regex strings that I've seen; that is,
what's the difference between ".*host\.domain\.tld.*" and
"host\.domain\.tld"? Is a substring search performed to find it
anywhere in the URL? Or is there an assumption that the regex string is
anchored after the URL scheme (as if http:// or https:// were implied
since I did read this filtering only works on those URL schemes)? In
regex, if I were to anchor the left side of the string, I'd use
"^a.*\.contoso\.com.*" if this string follows immediately after the
http:// protocol prefix and must span the entire searched string rather
than look for a substring (and why the trailing .* would be needed).

Is there better documentation on the XML RSS feed file (used for
subscriptions for the InPrivate Filter)? I'd like to know that what I'm
specifying to search on is what IE8 actually uses. I don't see a
problem in the XML used in the RSS feed file that anony101 came up with
but is looks like he didn't employ proper regex syntax for the CDATA
values (which are the URL substrings on which to block).
Reply With Quote
Sponsored Links
  #2 (permalink)  
Old 09-07-2009
Trial_and_Error
 

Posts: n/a
RE: Does the XML file for import to IE8's InPrivate filter actually us
Just a user's comment back to you though your question is well over my
technical knowledge. anony101's xml file is functioning well (blocking about
98% of external ads) despite the syntax issue you mention.

"VanguardLH" wrote:

> The InPrivate Filter of IE8 has an install-time default of Off.
> However, a registry edit can change its behavior to default to On when
> you start IE8. That way, you don't have to keep remembering to turn it
> on when you load IE8.
>
> Always enable InPrivate Filtering:
> http://www.pcmag.com/article2/0,2817,2346892,00.asp
> http://blogs.msdn.com/dmart/archive/...y-default.aspx
>
> Its watching and recording of same content providers that are repeated
> at the sites that you visit gets recorded with this filter mode enabled.
> Even if a common content provider is found at the sites that you visit,
> they aren't eligible for blocking unless the occurrence exceeds the
> threshold that you configure for this filter (the default is 10 sites
> for that same content provider). As you decrease this threshold, more
> content providers will probably show up in the list (i.e., more become
> eligible for blocking). However, that blocking is NOT a reasonable
> ad-blocker since it merely looks for common content across multiple
> sites, not what is the content.
>
> So I found where others had mentioned using the InPrivate Filter (after
> configuring it to always start On) as an ad-block filter. One guy
> converted the AdBlock list (a plug-in for Firefox) to an XML file that
> you can import into IE8 (Internet Options -> Programs -> Manage Add-ons
> -> InPrivate Filtering). Alas, Microsoft didn't let users directly
> enter what URL strings on which they want to filter. You have to create
> an XML list that you then import. I'm not keen on the huge list for the
> AdBlock filter and decided to prune it down while also adding my prior
> list of URL strings on which I was blocking (in the URL filter in
> Avast's Web Shield but there is other software, like some firewalls,
> that let you block on URLs). His list is mentioned at:
>
> http://www.dslreports.com/forum/r221...lock-plus-list
>
> I started to wonder about the syntax of his CDATA strings. You can use
> anything you want for the description but the URL string should
> supposedly follow some regex syntax. Well, Microsoft doesn't explain
> much other than what I found at:
>
> http://msdn.microsoft.com/en-us/libr...20(VS.85).aspx
>
> I can understand why you need to escape the period character (if you're
> actually testing for a period character at that position rather than for
> any 1 character at that position) but I don't see why the forward
> slashes in the path have to be escaped. So I have to wonder how valid
> is Microsoft's implementation of regex.
>
> In regex, ".*\.adbrite\.com.*" would match on zero or more of any
> characters followed by a period followed by "adbrite" followed by a
> period followed by "com" and followed by zero or more of any characters.
> You need to escape the period character to use it as that character
> rather than its regex use of "1 of any character". Since the function
> seems to look for substrings (I'm not sure of this), the .* at the
> beginning and end of the URL string are probably not needed, so I could
> use "\.adbrite\.com" to find that substring anywhere in the URL string.
>
> You need the escape the backslash character, as in "\\" is for one
> backslash character but I don't see why you have to escape forward slash
> characters since they are *not* use in URLs.
> "http://www.intel.com/index.htm" has no backslash characters that need
> to be escaped, and forward slashes don't need to be escaped in regex.
>
> I have to wonder if the author of the rules.xml file (converted from
> AdBlock's list) used legitimate regex syntax since none of the period
> characters are escaped in the CDATA strings in his entries. They
> probably work well enough since a period character at that position
> qualifies as any 1 character at that position; however, "adbrite.com"
> would also match on "adbritexcom".
>
> Since Microsoft hasn't been keen on embracing regular expressions, it is
> likely that they don't follow the PCRE standard. Perhaps the forward
> slashes do need to be escaped by backslashes but that's not true in PCRE
> (Perl Core Regular Expressions). By Microsoft's article, anony101's
> converted AdBlock list might happen to work but syntax is invalid. It
> looks like anony101 used the old DOS wildcarding syntax rather than a
> valid regex syntax.
>
> Even Microsoft's example of:
>
> <wf:blockRegex> <![CDATA[ads.contoso.com\/.*]]> </wf:blockRegex>
>
> is not a valid regular expression unless its author actually intended to
> match on ANY character where the first 2 period characters show up. The
> above regex would also match on "adsXcontosoYcom\" followed by any
> characters. Any why is the backslash shown in the URL which is invalid.
> The delimiter is the forward slash? You do not go to
> http:\\www.intel.com\index.html. You go to
> http://www.intel.com/index.html. There's something goofy in how
> Microsoft says URLs are syntaxed.
>
> When I've looked at some other XML RSS feed files, they'll specify the
> CDATA as something like:
>
> a.*\.contoso\.com.*
>
> which is "a" followed by zero or more characters, a period, "contoso", a
> period, and followed by zero or more characters. The periods in the URL
> are properly escaped (I'm not sure the .* is needed at the end, or at
> the beginning as used in some regex strings that I've seen; that is,
> what's the difference between ".*host\.domain\.tld.*" and
> "host\.domain\.tld"? Is a substring search performed to find it
> anywhere in the URL? Or is there an assumption that the regex string is
> anchored after the URL scheme (as if http:// or https:// were implied
> since I did read this filtering only works on those URL schemes)? In
> regex, if I were to anchor the left side of the string, I'd use
> "^a.*\.contoso\.com.*" if this string follows immediately after the
> http:// protocol prefix and must span the entire searched string rather
> than look for a substring (and why the trailing .* would be needed).
>
> Is there better documentation on the XML RSS feed file (used for
> subscriptions for the InPrivate Filter)? I'd like to know that what I'm
> specifying to search on is what IE8 actually uses. I don't see a
> problem in the XML used in the RSS feed file that anony101 came up with
> but is looks like he didn't employ proper regex syntax for the CDATA
> values (which are the URL substrings on which to block).
>

Reply With Quote
  #3 (permalink)  
Old 09-07-2009
VanguardLH
 

Posts: n/a
Re: Does the XML file for import to IE8's InPrivate filter actually us
Trial_and_Error wrote:

> Just a user's comment back to you though your question is well over my
> technical knowledge. anony101's xml file is functioning well (blocking about
> 98% of external ads) despite the syntax issue you mention.


I commented on that. The period character, in regular expressions,
means to match on a character in that position. Not just the period
character itself but also a-z, 0-9, $, &, -, and so on. So:

startrek.org

would match on "startrekborg" and that might be a parameter in a URL,
like in "sci-fi.com/characters&augmented=startrekborg". The regex
matches on the wrong substring in the URL. Yes, it'll work to match on
startrek.org but it'll match on ANY character at the position with the
non-escaped period character. Since it is searching for the substring
to appear anywhere in the URL, it might not be on the FQDN portion.

There was mention in a MS blog for IE8 that Microsoft was supposed to
come out with some documentation but they haven't yet other than the
measly 2 articles that I've found. There was mention of the hAtom
microformat but I look at that. It just defines the structure for the
data, not its value, so what gets specified in the CDATA string is
specifically whatever IE8 will handle.

Considering how Microsoft has yet to embrace regular expressions
(probably because they smack too much of Unix origins), I wouldn't be
surprised if what Microsoft calls regex is just the old and severely
deficient DOS wildcarding syntax. Microsoft doesn't even comply with
their statements regarding how to escape the special characters in their
own examples.
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
RE: InPrivate filter still turns itself off all the time Lee microsoft.public.internetexplorer.general 0 04-03-2009 05:15
IE8's clickjacking fix not much help, experts say Steve Security News 0 01-28-2009 11:00
IE8's Clickjacking Fix Not Much Help, Experts Say Steve Security News 0 01-28-2009 01:40
Regex Match Tracer 2.0 VistaDev Vista Software Development Feed 0 10-20-2007 15:00
Cisco Confirms Regex Flaw in IOS Steve General Technology News 0 09-15-2007 17:30




All times are GMT +1. The time now is 20:27.




Driver Scanner - Free Scan Now

Vistaheads.com is part of the Heads Network. See also XPHeads.com , Win7Heads.com and Win8Heads.com.


Design by Vjacheslav Trushkin for phpBBStyles.com.
Powered by vBulletin® Version 3.6.7
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.6.0 RC 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120