Comment On Eliminating Shady Characters

"These characters are real troublemakers," writes Cody. "You know the guys. Hair slicked back, boots, leather jackets. Or else, pants hanging down to their knees, pagers going off. Wouldn't it be great if we could just get rid of them? Wouldn't that make life a whole lot easier?" [expand full text]
« PrevPage 1 | Page 2 | Page 3Next »

Re: Eliminating Shady Characters

2007-04-27 09:05 • by Welbog
If only we could get ridoff those guys who like to say "first".

Re: Eliminating Shady Characters

2007-04-27 09:05 • by timb (unregistered)
owned them good!

Re: Eliminating Shady Characters

2007-04-27 09:05 • by timb (unregistered)
owned them good!

Re: Eliminating Shady Characters

2007-04-27 09:06 • by Mcoder
Hey, what a lovely manual "quotation" system! But I'll still keep using prepared statements. At least they don't corrupt my data :)

Re: Eliminating Shady Characters

2007-04-27 09:13 • by ParkinT
"ridoff", is that related to "rip-off"?

Re: Eliminating Shady Characters

2007-04-27 09:14 • by WIldpeaks
So the wtf is the lame joke ?

Re: Eliminating Shady Characters

2007-04-27 09:15 • by Alex G. (unregistered)
I don't really get it. Either that or I just look too much into it.

Re: Eliminating Shady Characters

2007-04-27 09:21 • by tiro
133930 in reply to 133929
Alex G.:
I don't really get it. Either that or I just look too much into it.


It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

Re: Eliminating Shady Characters

2007-04-27 09:30 • by s (unregistered)
133931 in reply to 133930

It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.


Except when you know the input shouldn't contain any of these characters. Usually the replacement is more heavyweight then. But there happen situations when the target system has magic characters which can't be quoted.

This entry is not a WTF without context. At worst it's a "could be done better" practice.

Re: Eliminating Shady Characters

2007-04-27 09:31 • by Steve (unregistered)
133932 in reply to 133930
tiro:
. . .
It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

It may be something of a tautology but the general case is not the specific case.

Sometimes it makes perfect sense to just eat the character or replace it with something benign.

It all depends on context.


Re: Eliminating Shady Characters

2007-04-27 09:34 • by loose (unregistered)
Totally agree. The function does exactly what it says on the tin!

Re: Eliminating Shady Characters

2007-04-27 09:37 • by bonzombiekitty
133934 in reply to 133930
tiro:
Alex G.:
I don't really get it. Either that or I just look too much into it.


It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.


Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){
return a_sValue.replaceAll("[\\\\'|\\\\"|&|]", " ");
}

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.

Re: Eliminating Shady Characters

2007-04-27 09:40 • by Nelle (unregistered)
I don't get it ... Sorry, but I don't ...
It may be funny or WTF, but it is far beyond my comprehension ...
Perhaps when I pass a couple of Sun's Java Certification tests I may be able to laugh at this or to say WTF, but until then can someone please spell out what exactly is wrong with this peace of code ?

Re: Eliminating Shady Characters

2007-04-27 09:41 • by Licky Lindsay
133936 in reply to 133931
s:
Except when you know the input shouldn't contain any of these characters.


In which case the input should be totally rejected, not "sanitized".

Re: Eliminating Shady Characters

2007-04-27 09:44 • by TheJasper
133937 in reply to 133929
Alex G.:
I don't really get it. Either that or I just look too much into it.


Well, I think its just a little strange that the special characters gotten rid of are ' " and &. I would expect !@#$%^&*() or something to also be included. But then the submission says very little about what kinds of strings are being manipulated or for what purpose.

I certainly don't get the 'clever' flavor text surrounding the code. It just feels like someone trying to be funny and falling flat on his face. It certainly doesn't explain the situation, unless the situation was that someone wanted to get rid of some special characters and the submitter thought that was bad.

Re: Eliminating Shady Characters

2007-04-27 09:47 • by s (unregistered)
133938 in reply to 133934
Regexp is almost always slower than straight-off character replace, even repeated 3 times. The original code is enormously more readable too.

Sure if you can afford the comfort of discarding a piece of data and reporting an error instead, go ahead.

Re: Eliminating Shady Characters

2007-04-27 09:48 • by TheJasper
133939 in reply to 133934
bonzombiekitty:

Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){
return a_sValue.replaceAll("[\\\\'|\\\\"|&|]", " ");
}

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.


I must admit I thought the same thing. But it is hardly a wtf. In fact, one could even argue that is more readable the way it is written. I wouldn't be surprised if the compiler could even optimise it to the same thing. At worst, it's just a bit inefficient, but not shockingly so.

Re: Eliminating Shady Characters

2007-04-27 09:53 • by Stephen Cochran (unregistered)
133940 in reply to 133936
In which case the input should be totally rejected, not "sanitized".

There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.

Re: Eliminating Shady Characters

2007-04-27 09:54 • by loose (unregistered)
It is all about context. The guy who wrote it may have had a boss that said "I'm sick of those goddam quotes and ampersands, get ridoff em!". In that case, the function is fine!

Re: Eliminating Shady Characters

2007-04-27 09:55 • by Markp
133942 in reply to 133932
Steve:
tiro:
. . .
It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.

It may be something of a tautology but the general case is not the specific case.

Sometimes it makes perfect sense to just eat the character or replace it with something benign.

It all depends on context.




Exactly, how do we know that the string isn't a ", ' or & delimited list of integers that the coder wants to generalize into a space-delimited list. That's obviously not likely, but man, we really can't comment on something as valid as this without any context.

Re: Eliminating Shady Characters

2007-04-27 10:00 • by Strider (unregistered)
133943 in reply to 133933
loose:
Totally agree. The function does exactly what it says on the tin!



Agreed
This is the first time I have felt that this was not WTF worthy.
I would write this code if I needed this functionality... although I'd never leave a typo in the comment, I'm too o.c. fourthat.

CAPTCHA: poindexter - a person who is intelligent but socially inept; a nerd...hehe*snort*

Re: Eliminating Shady Characters

2007-04-27 10:01 • by shadowman
Yeah, I'm not sure what's wrong here. Assuming the programmer had a good reason to remove single quotes, double quotes, and ampersands, that seems like a perfectly reasonable way to do it. Sure, I'm sure there are other ways, even more efficient ways to do it -- but the code itself really isn't bad or buggy or wtf'ey.

So I'm guessing the problem lies in my assumption that there was a good reason to remove those things. But we really need more information here.

Re: Eliminating Shady Characters

2007-04-27 10:06 • by Independent
I will really hate myself for what I'm about to say, but...

The real WTF is that all of you guys argue over the code while the real WTF is in the comments: there is no such an expression as "get ridoff". The idiom is "get rid of smt".

Re: Eliminating Shady Characters

2007-04-27 10:08 • by Dan (unregistered)
133947 in reply to 133936
Licky Lindsay:
s:
Except when you know the input shouldn't contain any of these characters.


In which case the input should be totally rejected, not "sanitized".


What? It is this type of thought that keeps some systems from being extremely easy to use.

A user should never have to conform to the wills of the system, the system should have to conform to the will of the user. If the system doesn't need quotes and can work fine simply by removing them, why would that not be the obvious solution? You propose that the system should just reject input and ask the user to type it in again without quotes???

Re: Eliminating Shady Characters

2007-04-27 10:09 • by .. (unregistered)
133948 in reply to 133939
[quote user="TheJasper"]
if(a_sValue != null){
return a_sValue.replaceAll("[\\\\'|\\\\"|&|]", " ");
}
[/quote]

I must admit I thought the same thing. But it is hardly a wtf. In fact, one could even argue that is more readable the way it is written. I wouldn't be surprised if the compiler could even optimise it to the same thing. At worst, it's just a bit inefficient, but not shockingly so.[/quote]

A single-character search&replace is almost always faster than a regexp. A smart compiler in the above case will realize the first argument is a static string and statically compile the replacement code. A somewhat dumber compiler will assume the replacement expression -could- be a variable, and so will first pass the first argument through a regexp parser, then generate copy&replace code, then parse the value using said code, all during runtime.

People think stuff like eval() or regexp replaces is harmless because it gets magically changed during original compilation. But what about regex=someExternalUserInput; somestring.replaceAll(regex,target); ?

Re: Eliminating Shady Characters

2007-04-27 10:09 • by Cassis (unregistered)
I had a similar experience @ work

we were pushing XML files to a server that was not under our control, and the software was not XML compliant...

following the request of the customer, we had to replace ">" by "greater than", "<" by "lower than", and so on in the text nodes...

that's the best way for sure, fuck the standards !

Re: Eliminating Shady Characters

2007-04-27 10:12 • by .. (unregistered)
133950 in reply to 133945
Independent:
I will really hate myself for what I'm about to say, but...

The real WTF is that all of you guys argue over the code while the real WTF is in the comments: there is no such an expression as "get ridoff". The idiom is "get rid of smt".


If the WTF is a poor wording of a comment to a valid piece of code, then this story is Worse Than Failure.

Re: Eliminating Shady Characters

2007-04-27 10:13 • by Zen (unregistered)
I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.

Re: Eliminating Shady Characters

2007-04-27 10:16 • by akatherder
If you're building a filename or directory in Windows you have to do "something" with the invalid characters. A space or underscore is probably your best bet.

Although the code still doesn't make much sense.

Re: Eliminating Shady Characters

2007-04-27 10:19 • by anon (unregistered)
133953 in reply to 133947
Dan:
Licky Lindsay:
s:
Except when you know the input shouldn't contain any of these characters.


In which case the input should be totally rejected, not "sanitized".


What? It is this type of thought that keeps some systems from being extremely easy to use.

A user should never have to conform to the wills of the system, the system should have to conform to the will of the user. If the system doesn't need quotes and can work fine simply by removing them, why would that not be the obvious solution? You propose that the system should just reject input and ask the user to type it in again without quotes???


Context is always the key. Sometimes sanitizing by dropping/replacing characters is not really a valid option. In those situations one should reject the input (if we're not going to escape it) If this was a password value, we would not want silently to turn a password of &'""''&&" into a bunch of spaces.

I agree this may not be best way to deal with special characters, or maybe not the most efficient. But without know what they are doing with these strings or why they decided these characters are bad, ti's hardly a wtf

Re: Eliminating Shady Characters

2007-04-27 10:19 • by Art (unregistered)
133954 in reply to 133934
bonzombiekitty:
tiro:
Alex G.:
I don't really get it. Either that or I just look too much into it.


It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.


Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){
return a_sValue.replaceAll("[\\\\'|\\\\"|&|]", " ");
}

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.


A test run on my PC doing the same operation 10000000 times showed that the character replace outperforms the regexp by a factor of 10 to 1.

I would say the same about readability. The fact that you mistyped it in your post (missing a backslash) says something.


Re: Eliminating Shady Characters

2007-04-27 10:23 • by Thief^
133955 in reply to 133951
Zen:
I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.

I've had something similar where a system accepted but truncated the password I gave it on signup, but didn't tell me and didn't truncate the password on login. It's very annoying when a system changes things without even telling you.

Re: Eliminating Shady Characters

2007-04-27 10:25 • by s (unregistered)
133956 in reply to 133951
"cash&items".match(/shit/)

Re: Eliminating Shady Characters

2007-04-27 10:29 • by BA (unregistered)
133957 in reply to 133940
Stephen Cochran:
In which case the input should be totally rejected, not "sanitized".

There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.


Agreed. As an example consider the many ways a phone number can be written down:
800 555 1234
800.555.1234
(800) 555-1234
800-555-1234
555-1234
555.1234

And the internal format the application expects... 800/555-1234

The answer in this particular case would be: Sanitize when possible, reject when not.

And the real WTF is ragging on people for misspelling in comments when the meaning and intention are perfectly clear.
Maybe the guy is ESL and isn't familiar with English idioms.

For shame worsethanfailure.com, for shame.

Re: Eliminating Shady Characters

2007-04-27 10:39 • by s (unregistered)
133958 in reply to 133955
Thief^:
Zen:
I had a system "sanitize" my password on account creation. Everyting that was not [a-zA-Z0-9_] got replaced by an underscore. Took me some time to figure out why I couldn't log in.

I've had something similar where a system accepted but truncated the password I gave it on signup, but didn't tell me and didn't truncate the password on login. It's very annoying when a system changes things without even telling you.


I've used a system that truncated the password to 8 characters. I was quite annoyed when after a year of using the system just fine, typing my 12-letter password with special characters at the end I learned my password was really a common dictionary word, all-lowercase. Still beats the hell of a system which required the password to contain at least 2 digits, at least 5 lowercase characters, no other characters than [0-9a-z] and be between 8 and 10 characters long, then refuse anything else.

A basic place for use of the OP code would be a simple search engine entry preprocessing. Remove the quotes after switching the "sentence mode" on, remove '&' as AND operator which is default, leave spaces because otherwise the remaining words would get concatenated.

Re: Eliminating Shady Characters

2007-04-27 10:55 • by Alan (unregistered)
I remember back in the bad old days when I used vb4, access and csv. Quotes were the bane of my life and no matter the amount of times we tried to filter them out, some function somewhere would let em through and screw up the whole system.

Og course we just told off the users - "How dare you use a quote in your item descriptions - its your fault the application lost all your data"

Re: Eliminating Shady Characters

2007-04-27 10:56 • by AssimilatedByBorg
133961 in reply to 133940
Stephen Cochran:
In which case the input should be totally rejected, not "sanitized".

There are many, many systems in which you can't arbitrarily reject data. Many times, you have to accept the input and make sure it doesn't do any harm.

OTOH, what really annoys me is systems that "sanitize" input for no apparent reason. e.g., removes a hyphen from a hyphenated name, or lowercases all but the first letter in a name that has mixed capitalization.

I don't like, but can accept, a system that uppercases everything and leaves it that way, but if it's going to allow some sort of mixed case, e.g., Mckay, why not allow McKay? grrr...

Re: Eliminating Shady Characters

2007-04-27 10:58 • by dkf (unregistered)
133962 in reply to 133928
WIldpeaks:
So the wtf is the lame joke ?
Either that or Derrick Pallas's lame choice of article. (You might think that scanning three times is a WTF, but it isn't because computers scan for single characters really quickly, whereas scanning for something more complicated is slower.)

Re: Eliminating Shady Characters

2007-04-27 10:59 • by Luke (unregistered)
133963 in reply to 133939
> I wouldn't be surprised if the compiler could even optimise it to the same thing.

I would be surprised. And impressed.

I'm guessing that the direct character replace version (the one featured) is faster, but I'd be willing to bet that if the regex were precompiled then the regex would be faster by a factor of about 3.

As though it mattered. This isn't the kind of site where we are supposed to be picking over people's code for being cycles slower! Let's see some *bad* code snippets. Give me exponential time solutions to log (or constant) time problems! :-) I can go to YouTube if I want to see people who can't spell.

Luke

Re: Eliminating Shady Characters

2007-04-27 11:03 • by akatherder
The Sparc workstations at my University had the same password weakness where it only looked at the first 8 characters. I proclaimed my password un-hackable to my roommate. When he wasn't looking I copied my password. Then I told him to watch. I pasted my password then proceeded to slam on keys for about 5 full seconds. Lo and behold it worked when I hit enter.

Re: Eliminating Shady Characters

2007-04-27 11:05 • by Luke (unregistered)
133966 in reply to 133962
> (You might think that scanning three times is a WTF, but it isn't because computers scan for single characters really quickly, whereas scanning for something more complicated is slower.)

Nah. Most regexes (specifically, those without backreferences) will be just as fast as scanning for single characters, as long as you stay away from Unicode. They are asymptotically equivalent, and it takes a pretty big regex to get outside a modern CPU cache.

Luke

Re: Eliminating Shady Characters

2007-04-27 11:06 • by dkf (unregistered)
133968 in reply to 133961
AssimilatedByBorg:
I don't like, but can accept, a system that uppercases everything and leaves it that way, but if it's going to allow some sort of mixed case, e.g., Mckay, why not allow McKay? grrr...
You are aware that capitalization rules are locale-specific? That means if you work with the same data in the USA, the Netherlands and Turkey, you run the risk of getting three different answers if you don't take care...

Re: Eliminating Shady Characters

2007-04-27 11:07 • by Uberbandit (unregistered)
133969 in reply to 133934
bonzombiekitty:
tiro:
Alex G.:
I don't really get it. Either that or I just look too much into it.


It's generally not the best of ideas to just replace stuff with an empty space when you should be escaping it.


Plus it's done a pretty inefficient way. I assume this is java, so an easier way to have done it would be in a single replaceAll() statment. Lets assume he has a reason to replace the special characters with an empty space, he could have just done:

if(a_sValue != null){
return a_sValue.replaceAll("[\\\\'|\\\\"|&|]", " ");
}

But I really can't complain, I've done things like that before because I was lazy.

side note: the four backslashes thing always screws me up.


Why the last | ? I'm not very familiriazed with Java but Shouldn't it be:
return a_sValue.replaceAll("[\\\\'|\\\\"|&]", " ");

CAPTCHA: Again "Slashbot"? I filled my quota of creativity with you last time.

Re: Eliminating Shady Characters

2007-04-27 11:08 • by travisowens
For those who couldn't figure it out (in order of WTF-ness):

1. He's doing it in JavaScript, which is ran on the client side, which can easily be disabled, allowing me to submit any naughty chars I want.

2. He's skipped other chars which can cause damage: \0 ; \n \r (I forget the whole list)

3. He's replacing the bad characters with spaces, when he should stop processing if they're that nasty, or convert them into HTML char codes if he just wants to make them safe to process.

4. If a user entered an HTML char code, which contain a &, the third replace will really foul it up.

PS: RegEx isn't always as slow as people think it is, but for a simple replace, Regex is probably overkill. I've found cases where RegEx was much faster than non RegEx methods, it depends on the language & compiler.

Re: Eliminating Shady Characters

2007-04-27 11:11 • by travisowens
Oh I forgot the Javascript syntax for RegEx.Replace. A MAJOR mistake he made is that global replace is not enabled, so he replaces only the first naughty char of each type, but no further repeating chars.

Re: Eliminating Shady Characters

2007-04-27 11:18 • by dkf (unregistered)
133974 in reply to 133966
Luke:
Most regexes (specifically, those without backreferences) will be just as fast as scanning for single characters, as long as you stay away from Unicode. They are asymptotically equivalent, and it takes a pretty big regex to get outside a modern CPU cache.
Regexps can do better when you are scanning for a long-enough multi-character string (the compiled RE should be able to take advantage of Boyer-Moore matching, though I suspect many RE compilers don't bother) but when you're just doing simple search for a single character, brute force works really well on modern hardware because it works with cache predictor units very nicely. This is the only way in which I can explain the fact that when I've tried out various ways of doing search/replace on strings in the past, the stupid brute-force approach worked well enough that it wasn't worth trying something more sophisticated and harder to maintain.

Now, if the original code had been doing the opposite (a complex RE to perform a simple one-character-for-one-character replacement) then we'd have had our WTF! for sure on the grounds of that being a stupidly expensive way to do a trivial operation. We'd have also known for sure that the original author was proving that (s)he could write Perl in any language.

Re: Eliminating Shady Characters

2007-04-27 11:18 • by xeen (unregistered)
I think the wtf here is that the double quote is escaped although it shouldn't be.

Re: Eliminating Shady Characters

2007-04-27 11:19 • by Bart (unregistered)
Without context, I don't see the wtf. It's not in javascript and the comment clearly states that the function removes "some" special characters. Never claimed to remove them all.

Re: Eliminating Shady Characters

2007-04-27 11:21 • by Art (unregistered)
133978 in reply to 133970
travisowens:
For those who couldn't figure it out (in order of WTF-ness):

1. He's doing it in JavaScript, which is ran on the client side, which can easily be disabled, allowing me to submit any naughty chars I want.

2. He's skipped other chars which can cause damage: \0 ; \n \r (I forget the whole list)

3. He's replacing the bad characters with spaces, when he should stop processing if they're that nasty, or convert them into HTML char codes if he just wants to make them safe to process.

4. If a user entered an HTML char code, which contain a &, the third replace will really foul it up.

PS: RegEx isn't always as slow as people think it is, but for a simple replace, Regex is probably overkill. I've found cases where RegEx was much faster than non RegEx methods, it depends on the language & compiler.



1. That is not JavaScript
2. Since we don't know the context, we don't know which characters can do what damage
3. See 2
4. Who says HTML codes are harmless our should be allowed

Until we know what the return value should be used for and where the input value comes from, I see nothing wrong with that code.

Re: Eliminating Shady Characters

2007-04-27 11:22 • by Justin (unregistered)
travisowens: QFT

on another note:
we're fond of doing all sorts of bizarre character replacements/sanitization here.
We really like: ' to become `
And we can't handle , because we do naive CSV parsing
And we translate = into ~ in some places because we're passing x=y type strings as values in GET or POST contexts z=x~y

It makes me giggle every time one of those comes back to bite us in the ass. I especially liked when we just removed ' from people's names, changing O'Reilly into Oreilly. I'm sure they loved that.
« PrevPage 1 | Page 2 | Page 3Next »

Add Comment