CastleCops, Internet Crime Fighters
Need help? Click here to register for free! Absolutely zero advertisements on this site!

Donation/Premium
spacer
block bottom
Security Central
spacer
· Home
· PIRT/Fried Phish
· MIRT
· SIRT
· Deutsch
· Wiki
· Newsletter
· O16/ActiveX
· CLSID List
· Contest2007
· Downloads
· Feedback (send)
· Forums
· HijackThis
· Hijacktrend
· LSPs
· My Downloads
· O18
· O20
· O21
· O22
· O23
· O9
· Premium
· Private Messages
· Proxomitron
· Reviews
· Search
· StartupList
· Stories Archive
· Submit News
· WsIRT
· Your Account
· Acceptable Use Policy
block bottom
spacer spacer

Need help with regex filter.
Goto page Previous  1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic       All -> FavForums -> Mailwasher - Troubleshooting / General [del.icio.us!] [digg it!] [reddit!]
View previous topic :: View next topic  
Author Message
Cowboy

Guest
IP: 213.112.*.*






PostPosted: Mon Nov 24, 2003 6:02 pm    Post subject:
Reply with quote

denn988, thank you very much for the filter. That is exactly what I wanted!
I too have tested it for a day now, and like you about the second thing I did was to remove the [!/] to see if I would get more hits.
I did get more hits but unfortunately I also got one bad hit, so I put the [!/] back in. I have had no bad hits with that whatsoever.
I'm currently trying to figure out if there is any other way to catch the non-comment garbage.

I've also got a new problem because i'm getting a lot more non-sex spam lately which doesn't contain much in the way of words you'd like filtered. Need to think about what to do about those.

Plus, I'm getting spam where the letters from the visible spam comes in raw as a string of numbers like #233 and so on, and only the whited out text is normal in raw.

Back to top
Ikeb

Special Response Team
Forums Admin

Joined: Apr 20, 2003
Posts: 16543

Forums Admin Moderators MVP Premium SRT Team CC Committee Team F@H

PostPosted: Mon Nov 24, 2003 7:22 pm    Post subject:
Reply with quote

Anonymous wrote:
I forgot to turn th e HTML off when I posted

Those examples, if sent as HTML, would appear in the raw text as:

1 0 & l t ; 2 0 & l t ; 3 0

and

3 0 & g t ; 2 0 & g t ; 1 0

I had to place spaces between each charactor above to get them to post.

The brackets must be sustituted when converting them to the HTML raw text in order to keep the translator from being confused.

Haven't you learned anything from this exercise? Razz

You should be able to put garbage tags between the key letters thusly:

10&<garbage> lt;20&<garbage> lt;30
and
30&<garbage> gt;20&<garbage> gt;10

But now I see that it isn't required here for some reason. The following was edited using the "<" and ">" characters:

10&<garbage>lt;20&<garbage>lt;30
and
30&<garbage>gt;20&<garbage>gt;10

I would have thought this should have made the <garbage> invisible. Apparently HTML tags aren't fully fuctional here?

An anti-SPAMing move no doubt! Rolling Eyes

Back to top
View users profile Send private message
denn988

Guest
IP: 66.44.*.*






PostPosted: Mon Nov 24, 2003 7:52 pm    Post subject:
Reply with quote

Cowboy wrote:
denn988, thank you very much for the filter. That is exactly what I wanted!
I too have tested it for a day now, and like you about the second thing I did was to remove the [!/] to see if I would get more hits.
I did get more hits but unfortunately I also got one bad hit, so I put the [!/] back in. I have had no bad hits with that whatsoever.
I'm currently trying to figure out if there is any other way to catch the non-comment garbage.

I've also got a new problem because i'm getting a lot more non-sex spam lately which doesn't contain much in the way of words you'd like filtered. Need to think about what to do about those.

Plus, I'm getting spam where the letters from the visible spam comes in raw as a string of numbers like #233 and so on, and only the whited out text is normal in raw.


Cowboy,

You could try setting up two versions of the filter.

The first one (higher in priority) would have the !/ included while the second one would be made more broad by not including those charactors.

I would not auto delete based on either of them alone....but...there may be additional rules that you could add to one or the other that would give you two different keys to trap on.

That might make the possibility of false positives rare enough to auto-delete.

Give it some thought. You have some good strategy ideas...and if you could learn regex you might be able to devise some good filters.


As far as the number strings (#233)...

Could you post a few examples??

Back to top
Cowboy

Guest
IP: 213.112.*.*






PostPosted: Mon Nov 24, 2003 9:25 pm    Post subject:
Reply with quote

I already have the two versions set up like you say. (i put the second filter back when you suggested it)
I'm running both of them to override friends list and they are the two top filters.

But i can not see how the first filter does anything the second filter won't do. I'm rather sceptical to the second filter and would like something more precise, just like the first one.
If the first filter keeps working perfectly for an extended period of time I will switch it to auto. I won't do it right away.

I will think about combining the second filter with something else.
My hope of learning regex shrinks every time I try. The commands list in the help files seems to be incomplete, and I don't quite understand the examples either.

Here's an example of numbers for letters:

----------------------------------
<html?<br><font color=white>Viziertoaskhisson,whoownedthetruth,addingthat,dearlyALCIBIADES: You did.DICAEOPOLIS</font><font color=black><br>
Unleash The Power<font color=white>that have responded.</font><font color=black><br>
Of Your Digital Cable <font color=white>she always did, At night she would not come if it was dark, for she</font><font color=black><br>
-----------------------------------------

And so on...

Back to top
Cowboy

Guest
IP: 213.112.*.*






PostPosted: Mon Nov 24, 2003 9:30 pm    Post subject:
Reply with quote

Damn numbers come out as text whatever I do.
How do I post the code so you can see it?

Back to top
stan_qaz

Premium Member


Joined: Mar 31, 2003
Posts: 10635

Premium

PostPosted: Mon Nov 24, 2003 11:48 pm    Post subject:
Reply with quote

Pretty bogus but this works, add a space after every & sign.

& #115;& #097;& #108;& #101;& #115;& #064;& #115;& #116;& #097;& #110;& #109;& #105;& #108;& #108;& #101;& #114;& #046;& #105;& #110;& #102;& #111;

same thing with no spaces:

sales@stanmiller.info

Back to top
View users profile Send private message
Ikeb

Special Response Team
Forums Admin

Joined: Apr 20, 2003
Posts: 16543

Forums Admin Moderators MVP Premium SRT Team CC Committee Team F@H

PostPosted: Tue Nov 25, 2003 12:34 am    Post subject:
Reply with quote

denn988 wrote:
As to the '[^<]' in the Regex...

It is there to prevent the filter from trapping a situation where there are two opening brackets prior to a closing bracket.

Example:

10<20<30
30>20>10

The above is NOT html, but represent two valid mathematical expressions.

You don't want the filter to trap on something like that.

Upon further thought, I occurs to me that valid expressions such as the examples you gave above could well include the "something<something_else>still_something_more" sequence.

Cowboy, you mentioned getting a false positive. Mind sharing what sequence triggered the filter?

Anyway I can think of another valid sequence that would trigger the filter, a math example again:
Code:
y<x,y>z

So.....

denn988 wrote:
Before you ask....

You could have another rule in the filter that looks for a "Content-Type: text/html"....but it would be something of a useless rule. There would be no easy way to write the filter so that it would only look at the message part that was HTML, in those cases that were multipart messages.

Anything that you would try to do with regex to try to do that would be even more CPU intensive than the 'multi-hit' filter would be.

Well now I'm asking! Surprised

True, the content type found may well not be associated with the portion of the message which triggers the match but there are other techniques no? Couldn't one look for the <html> tag then finish the match on the </html> tag. Here's my untested regex to do that:
Code:
(?# words in html text, broken by html tags)<html>[^<]*?[a-z]<[^<]*?>[a-z][^<]*?</html>

Now I'm a little more mindful of processing time so this expression probably isn't practical because of the load. Sad

Well if nothing else, this exercise drives home that working with regular expressions sure has it's limitations. Surprised

Back to top
View users profile Send private message
denn988

Guest
IP: 66.44.*.*






PostPosted: Tue Nov 25, 2003 1:33 am    Post subject:
Reply with quote

Cowboy wrote:
Damn numbers come out as text whatever I do.
How do I post the code so you can see it?


stan_qaz wrote:
Posted: Mon Nov 24, 2003 6:48 pm Post subject:

--------------------------------------------------------------------------------

Pretty bogus but this works, add a space after every & sign.

& #115;& #097;& #108;& #101;& #115;& #064;& #115;& #116;& #097;& #110;& #109;& #105;& #108;& #108;& #101;& #114;& #046;& #105;& #110;& #102;& #111;

same thing with no spaces:

sales@stanmiller.info



This would be something that you could filter with a Regexp as follows:

The body contains...
Regular Expression...

Quote:

(&#\d\d\d;){5}


The above will look for 5 consecutive charactors in the format specified.


You could also form the Regexp to look for a certain number of occurences in total:

Quote:

(&#\d\d\d;.*?){15}


The above will look for 15 occurences in the entire body.


By the way....I have not tested these Regexps in any way....

Back to top
Ikeb

Special Response Team
Forums Admin

Joined: Apr 20, 2003
Posts: 16543

Forums Admin Moderators MVP Premium SRT Team CC Committee Team F@H

PostPosted: Tue Nov 25, 2003 6:11 am    Post subject:
Reply with quote

denn988 wrote:
This would be something that you could filter with a Regexp as follows:

The body contains...
Regular Expression...

Quote:

(&#\d\d\dWink{5}


The above will look for 5 consecutive charactors in the format specified.

You could also form the Regexp to look for a certain number of occurences in total:

Quote:

(&#\d\d\d;.*?){15}


The above will look for 15 occurences in the entire body.

By the way....I have not tested these Regexps in any way....

That's OK. We'll test 'em for you! Right Cowboy? Smile

Actually I've been playing with encoded characters the last couple of days! I find a lot of spam tends to use this technique to obscure their links. Here's a regex to fire on http links with more than 10 encoded characters:
Code:
<\s*a[\s\w=]+(?s)href=(3D)??"?http://[^>]*?(&#\d{1,3};.*){10,}?[^>]*>

I haven't run this filter for very long yet so I can't vouch for it's capability to sniff out SPAM.

P.S. Denn 988, I notice that in the part of your post I quoted, your regex ends up with a couple of the characters converted to a smiley. I could have fixed it I suppose but that would be changing what you posted so I left it be.....

Back to top
View users profile Send private message
Cowboy

Guest
IP: 213.112.*.*






PostPosted: Tue Nov 25, 2003 5:29 pm    Post subject:
Reply with quote

The filter should also look for the code with 2 digits instead of 3. I don't know if 1 digit is possible. Don't think so.
I'm trying these out now. No bad hits on startup.

The bad hit mail you ask for is unfortunately gone. I didn't think to save it and hotmail deleted it for me. Sad
I looked it through first, and it looked like normal html to me, although I could not find what caused the hit. Should have looked harder.
I think the message had a yahoo groups sponsor message at the bottom but I'm not sure that did it. I haven't had any other hits from yahoo groups or that particular sender.

Back to top
Ikeb

Special Response Team
Forums Admin

Joined: Apr 20, 2003
Posts: 16543

Forums Admin Moderators MVP Premium SRT Team CC Committee Team F@H

PostPosted: Tue Nov 25, 2003 5:42 pm    Post subject:
Reply with quote

Cowboy wrote:
I looked it through first, and it looked like normal html to me, although I could not find what caused the hit. Should have looked harder.

Next time you check out what's causing the hit, use TRegExpr. Just save the message into a file, copy the filter's regex into TRegExpr and run it against the file you saved.

Back to top
View users profile Send private message
TimeGhost

Major
Major


Joined: Apr 11, 2003
Posts: 750
Location: USA
Team F@H

PostPosted: Tue Nov 25, 2003 5:56 pm    Post subject:
Reply with quote

I've been reading this thread with delight. The only thing I want to add is that Gary's "HTML Spam Tricks" and "Questionable Links" filters deal with some of the issues that you're discussing.

The whole set filters can be downloaded at:

www.w5hq.com/MailWasher/MailWasherFilters.txt

Back to top
View users profile Send private message
Ikeb

Special Response Team
Forums Admin

Joined: Apr 20, 2003
Posts: 16543

Forums Admin Moderators MVP Premium SRT Team CC Committee Team F@H

PostPosted: Tue Nov 25, 2003 6:31 pm    Post subject:
Reply with quote

I consider Gary's stuff to be the baseline to which I've added and subtracted based on other factors. For a while I resisted the regex allure because I run POPFile with great success. I subsequently turned off blacklisting and no longer use a friends list. I even disabled all of Gary's filters for a while.

But even POPFile gets the rare false positive so now I justify the regex effort as a means of reducing false positives. I don't clutter up my display with anything that a) passes all other than the POPFile "good" email filters or b) gets hit by any regex filter that has never had a false positive.

But to be perfectly honest, the rare false positive isn't a show stopper for me. In reality, improving the regex filters I run has become a bit of fun time for me .... something I now do instead of doing a crossword puzzle or playing a game of Solitare on my computer.

And it's all Denn988's fault for egging me on! Razz Wink

BTW, in another topic, Denn988 alluded to problems with Gary's filters but he hasn't spilled the beans ... yet! Wink

Back to top
View users profile Send private message
Cowboy

Guest
IP: 213.112.*.*






PostPosted: Tue Nov 25, 2003 9:44 pm    Post subject:
Reply with quote

I just recieved a mail (spam actually) , that triggered the [a-z]<[^<]*?>[a-z] filter by a linebreak < br> between two words, that had no space between them.
Is there a way to prevent this? Seems like a possible source of bad hits.

Back to top
denn988

Guest
IP: 66.44.*.*






PostPosted: Tue Nov 25, 2003 11:02 pm    Post subject:
Reply with quote

Cowboy wrote:
I just recieved a mail (spam actually) , that triggered the [a-z]<[^<]*?>[a-z] filter by a linebreak < br> between two words, that had no space between them.
Is there a way to prevent this? Seems like a possible source of bad hits.


You will need to put the ! back in there, at the least.

Quote:
[a-z]<![^<]*?>[a-z]


This whole thing could be solved if MWP would simply allow the choice of filtering on the RAW text...or the translated text (after it removes the HTML code).

Back to top
Display posts from previous:   
Post new topic   Reply to topic       All -> FavForums -> Mailwasher - Troubleshooting / General All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Page 2 of 6

 
Quick Reply:
Username: 

Quote the last message
Attach signature (signatures can be changed in profile)
 
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001 phpBB Group
spacer spacer