Home > English, Erlang > Don’t you blaspheme in here

Don’t you blaspheme in here


Whether you are developing a MMO chat system or an http proxy you will face someday the implementation of a profanity filter in order to
remove/prevent abusive language and profanities from the content.

The first idea that one could have consists in reading the text and compare each word to a list of profanities you stored somewhere. This kind of implementation is widely adopted, but has some drawbacks:

First: filtering is a not easy. I am not saying that it is difficult to code a function to do this kind of stuff, you just need to read all the text and compare it to a list of profanities you stored somewhere, then you can just replace any words found in the list with asterisks. Anyhow, you should consider how difficult it is to decide how to apply such a function. For example, what if in your MMO chat a user wants to ask another his/her sex? Should you filter “sex” out and susbtitute it with “***”? I don’t think so.

Second: users use different languages. When I was in Finland, I learnt  that some words that in Italy are pretty common, are very offensive in Finnish, and vice-versa…I bet you don’t want to build a list with all the profanities in all languages and devise a system to prevent this kind of situations, it would take a huge effort in terms of time and research.

Third: your filter can be circumvented. Come on guys, you know what I mean: the world is full of chat systems where “sex” is banned but “s_ex” is not. That’s the nature of this kind of filtering, someone will eventually find a way to bypass your work, since there is always someone smarter that us.

Now, the good thing is that you can see the problem from the opposite way: what are the words that a user is allowed to insert? I’m talking about a whitelist, which is actually a pretty good solution if your system is used mostly by kids (for example Lego Universe and others use whitelists). Again, it is pretty heavy work to build a complete lists, and some smart guywill eventually be able to fool you without big efforts: languages constantly evolve and words that you considered inoffensive could be used with ambiguous meanings or could be used as slang in some social group.

What comes out from this few lines is that profanity filter is not easy topic in my humble opinion…you should really understand by which target
audience your product will be used, and then try to obtain the better product you can. It will not be bulletproofproof, but it will be still good enough.

Oh, I almost forgot: if you arrived to this post looking for a simple erlang implementation of profanity filter, I can show you some stuff.

Here’s  part of the code proposed by Magnus Henoch in mod_shit.erl (an gen_mod that was used in the ejabberd circle quite often by system
administrators):


bad_words() ->
    ["fuck", "shit"].

filter_string(String) ->
    BadWords = bad_words(),
    lists:foldl(fun filter_out_word/2, String, BadWords).

filter_out_word(Word, String) ->
    {ok, NewString, _} = regexp:gsub(String, Word, string:chars($\*, length(Word))),
    NewString.
Categories: English, Erlang
  1. May 2, 2012 at 9:41 am

    Hello, we are running an Erlang shop in Italy (Padova and Rome). Are you interested in a cooperation?

    • pdincau
      May 2, 2012 at 9:49 am

      Hi,

      you can send me a mail at paolo[dot]dincau[at]gmail[dot]com

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: