Rasika's Blog

My personal knowledgebase

Regex replace for MySQL

Hi All,

Recently I came across a requirement where I had to cleanse the data that I’m inserting in to MySQL database using regular expressions, so I started searching for a function to do this on MySQL, but unfortunately I couldn’t find any. There was one called UDB but that’s also you need to install a module and stuff like that. I gave up searching finally and wrote my own “regex_replace” MySQL function.

Below is the source code of my function;

DELIMITER $$
CREATE FUNCTION  `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000))

RETURNS VARCHAR(1000)
DETERMINISTIC
BEGIN 
 DECLARE temp VARCHAR(1000); 
 DECLARE ch VARCHAR(1); 
 DECLARE i INT;
 SET i = 1;
 SET temp = '';
 IF original REGEXP pattern THEN 
  loop_label: LOOP 
   IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;  
   END IF;
   SET ch = SUBSTRING(original,i,1);
   IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp,ch);
   ELSE
    SET temp = CONCAT(temp,replacement);
   END IF;
   SET i=i+1;
  END LOOP;
 ELSE
  SET temp = original;
 END IF;
 RETURN temp;
END$$
DELIMITER ;

Note:

If you are using MySQL version 5.0.1 or higher, make sure you set the NO_BACKSLASH_ESCAPES mode ON, before you use the above function to replace any characters which are escaped with back slash “\”, ie: \A,\B,etc… See how to set the NO_BACKSLASH_ESCAPES mode here

Example on how to use this function
mysql> select regex_replace('[^a-zA-Z0-9\-]','','2my test3_text-to. check \\ my- sql (regular) ,expressions ._,');

Happy Coding!! :)

About these ads

Single Post Navigation

47 thoughts on “Regex replace for MySQL

  1. Thank you. That was a good one.

  2. bitseeker on said:

    when using:
    SELECT regex_replace(‘[^a-zA-Z0-9]‘, ”, “%$&?/’|test><+-,][)(" )
    then the result is:
    $?'test<-])

    seems that not all characters are being replace ;o)

    • Simon on said:

      Your regex is only matching a single character. Try this:

      SELECT regex_replace(‘[^a-zA-Z0-9]+‘, ”, “%$&?/’|test><+-,][)(" )

      (Note the "+")

  3. Hi bitseeker,

    You are right !! I don’t know why it’s doing that, any way I fixed the problem with the following;

    FUNCTION `regex_replace`(pattern varchar(1000),replacement varchar(1000),original varchar(1000))
    RETURNS varchar(1000)
    DETERMINISTIC
    BEGIN
    DECLARE temp VARCHAR(1000);
    DECLARE ch VARCHAR(1);
    DECLARE i INT;
    SET i = 1;
    SET temp =”;
    IF original REGEXP pattern THEN
    loop_label: LOOP
    IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;
    END IF;
    SET ch = SUBSTRING(original,i,1);
    IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp,ch);
    ELSE
    SET temp = CONCAT(temp,replacement);
    END IF;
    SET i=i+1;
    END LOOP;
    END IF;
    RETURN temp;
    END$$

    Thanks for your comment :)

  4. Misha on said:

    replacement – don’t work

    • Hi Misha,

      What is the problem you face? are you saying the above modification doesn’t replace the character that you have specified in “replacement” parameter?

      Cheers

  5. Misha on said:

    This function regular replace by chars

    FUNCTION regex_replace_char(pattern VARCHAR(255), replacement VARCHAR(255), subject VARCHAR(1000))
    RETURNS varchar(1000)
    BEGIN
    # ver 0.1
    # autor: Bunakov Mykhaylo

    DECLARE i INT;

    SET i = 1;
    SET @temp = `subject`;
    SET @len_prn = CHAR_LENGTH(pattern);

    WHILE i <= @len_prn DO
    SET @ch_ptn = SUBSTRING(pattern,i,1);
    SET @ch_rpl = SUBSTRING(replacement,i,1);

    SET @temp = REPLACE(@temp, @ch_ptn, @ch_rpl);

    SET i=i+1;
    END WHILE;

    RETURN @temp;
    END

  6. Alex on said:

    I modified your function, now it works and replace by pattern:

    FUNCTION regex_replace(pattern varchar(1000), replacer varchar(1000), original varchar(1000))
    RETURNS varchar(1000)
    DETERMINISTIC
    BEGIN
    DECLARE temp VARCHAR(1000);
    DECLARE ch VARCHAR(1);
    DECLARE i INT;
    SET i = 1;
    SET temp =”;
    IF original REGEXP pattern THEN
    loop_label: LOOP
    IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;
    END IF;
    SET ch = SUBSTRING(original,i,1);
    IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp, ch);
    ELSE
    SET temp = CONCAT(temp,replacer);
    END IF;
    SET i=i+1;
    END LOOP;
    ELSE
    SET temp = original;
    END IF;
    RETURN temp;
    END

    • Hi Alex,

      Great work, I missed that bit. I was mainly focusing on removing the characters somehow I got carried away from replacing the unwanted characters, but you spotted it, that’s great. I’ll update the main post with the change.

      Thanks :)

  7. Hi,
    Can you please elaborate on how to create this function in the mysql monitor. mysql> create function… gives all kinds of errors, the first one being: “No database selected”?

    • Hi Jimmy,

      I haven’t tried the MySQL Monitor to create any functions, but I wonder why it would be difficult to create a function any way.
      Could you tell you what you tried to do ? or try with SQLyog

    • I don’t know if you have figured this out yet since it has been over a year since you have posted this comment but the you first need to select a database when you log into MySQL. >> use [db_name]

  8. Hi, I always get following error when trying to create your function:
    #1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ” at line 5

  9. When there’s nothing to replace is returing an empty string. I’ve fixed it to return the original.

    DROP FUNCTION IF EXISTS `tdoperr0sv1db`.`regex_replace`;
    CREATE FUNCTION `tdoperr0sv1db`.`regex_replace`(pattern varchar(1000),replacement varchar(1000),original varchar(1000))
    RETURNS varchar(1000)
    DETERMINISTIC
    BEGIN
    declare temp varchar(1000);
    declare ch varchar(1);
    declare i int;

    set i = 1;

    set temp = “”;
    if original REGEXP pattern then
    loop_label: LOOP
    if i>char_length(original) then
    LEAVE loop_label;
    end if;
    set ch = substring(original,i,1);
    if NOT ch REGEXP pattern then
    set temp = concat(temp,ch);

    else
    set temp = concat(temp,replacement);

    end if;
    set i=i+1;
    END LOOP;
    else
    return original;
    end if;
    return temp;

    END$$

  10. I do always get this error when trying to regex_replace (with your example in the original post):

    Unknown column ‘”’ in ‘field list’

    • Got it – I copy/pasted your example code and the

      set temp = ”

      was making some problems. I replaced the quotation marks with normal ones.

  11. When I do
    SELECT regex_replace(‘/A/’, ”, `firstname`) FROM `contacts`
    I have the following error:
    #1054 – Unknown column ‘â€

    Seems i’m mussing somethign, buyt what? I’m using utf8.

    • I suppose, it’s the same problem I had. If you copy/paste the code above be sure to replace the fancy quotation marks with normal ones and redeclare the function.

      (Drop it before with DROP FUNCTION regex_replace)

      • Yes Simon, you are correct. Kissifrot, make sure you replace the ” (double quotation) with ” (two single quotations) in the line set temp = ”

  12. I used two single quotes (ASCII code 39), with the same result.

    • Hi Kissifrot,

      I’ve updated the original post, so that it correctly display the single quotes. Copy the code in to a notepad and then copy it from there to your MySql environment and drop the existing function and run the script to create the function again.

      Then check whether you can run the following query and get the result.

      SELECT regex_replace(‘[^a-zA-Z0-9\-]‘,’ ‘,’2my test3_text-to. check \\ my- sql (regular) ,expressions ._,’);

      make sure to replace the single quotes properly.

      Cheers,

  13. yourmomma on said:

    FYI—BIG WARNING FOUND AFTER A LOT OF BANGING HEAD…This works great EXCEPT FOR:
    This only universally works 1) if NO_BACKSLASH_ESCAPES is set in sql_mode (Server Variable) or 2) if the input string doesn’t contain any character combinations that may be interpreted as escaped commands. IE: \Z \n \t \b etc.
    If the input string contains these escaped commands (intentionally or unintentionally) and NO_BACKSLASH_ESCAPES is not set, they are executed as commands and pollute/change the expected return value. Without the No_BACKSLASH_ESCAPES set, there is no way the regex_replace will work all the time for everything.

    • Hi Yourmomma,

      Thanks for the information. I’m sure it would help most of the readers who are going to use this function to regex_replace their text.
      If the NO_BACKSLASH_ESCAPES is unset you have to use STDSQL systax of escaping problamatic characters in your SQL query, ie: use double single quote ("''") to escape a single quote in your sql query instead of using backslash and single quote ("\'").

      I’ll add a note for this in my original post.

      Once again thanks very much for your valuable input :)

  14. As far as I could understand, this function can replace single characters only. It can’t replace pattern. F.ex, I need to remove HTML tags. To simplify regex let’s try to remove all starting from :

    select regex_replace(”,”,’aaappp’)

    But it does not work.

    And even select regex_replace(”,”,’aaappp’) does not work!
    If I’m right, it’s almost useless function. It’s functionality can be reached with several nested “replace” functions.

    • what do you mean it doesnt work
      try this:

      mysql> select regex_replace(‘[a-z]‘,”,’aaappp2′) as result;

      it show result 2, and with pattern replace any character a-z
      ——–
      |result|
      ——–
      |2 |

      you problem is understand about how regular expression work . :D

  15. I confirmed what Sergy said. It only work to replace single character.

    Example:

    SELECT regex_replace(‘test’, ”, ‘test test test’)

    will return unmodified original, which is wrong result.

  16. Despite your protestations, this function in its current form does not, replace a regular expression. It replaces all instances of single letters that match a regular expression, which is a very different thing.

    SET ch = SUBSTRING(original,i,1);
    IF NOT ch REGEXP pattern THEN

    Those two lines extract a single character, test that single character, and replace only that one character.

    As I understand it, (ans as my testing below shows) it is impossible, using this, to replace a string of more than one character.

    Here are some examples to explain the problem. In these examples, the goal is to replace the “foo”s in “foo1 foo2 foo3″ with “bar”s, to get “bar1 bar2 bar3″.

    ===

    The simplest regex, which should work, will not:
    SELECT regex_replace(‘foo’, ‘bar’, ‘foo1 foo2 foo3’);
    Output: foo1 foo2 foo3

    In PHP, which has regex replacements built in, it comes out correctly:
    echo preg_replace(‘/foo/’, ‘bar’, ‘foo1 foo2 foo3’);
    Output: bar1 bar2 bar3

    ===

    Just to show that the ‘/’ required by php is not relevant here:
    SELECT regex_replace(‘/foo/’, ‘bar’, ‘foo1 foo2 foo3′);
    Output: foo1 foo2 foo3

    ===

    Here are some more complex regexes, that illustrate the issue hopefully a little more clearly:

    SELECT regex_replace(‘[^0-9 ]+’, ‘bar’, ‘foo1 foo2 foo3′);
    Output: barbarbar1 barbarbar2 barbarbar3

    The SQL function is acting here like the regex for:
    [^0-9 ](?:[^0-9 ]*)

    In PHP:
    php -r “echo preg_replace(‘/[^0-9 ]+/’, ‘bar’, ‘foo1 foo2 foo3′);”
    Output: bar1 bar2 bar3

    ===

    SELECT regex_replace(‘fo*’, ‘bar’, ‘foo foo foo’);
    Output: baroo1 baroo2 baroo3

    The SQL function is acting here like the regex for:
    f(?:o*)

    In PHP:
    SELECT regex_replace(‘fo*’, ‘bar’, ‘foo foo foo’);
    Output: bar1 bar2 bar3

    Clearly illustrates how the SQL function matches only a single character, but the regular expression should match the whole string.

    ===

    The following, which is how you use it in your examples, is a character set (http://www.regular-expressions.info/charclass.html). It works, as a character set specifying a single character, but does not do what we want:
    SELECT regex_replace(‘[foo]‘, ‘bar’, ‘foo1 foo2 foo3’);

    Output: barbarbar1 barbarbar2 barbarbar3
    …which is correct, and the same as PHP gives, but is not what we are after.

    I hope that these examples have helped clarify the issue.

  17. Hii,

    I want to modify this function to get the return value with a alias name just like -> SELECT regexp_replace(‘[E\'\\!:\\?\'\]‘, ‘\’\’)||\’ \’,P.description AS vals)

    How can I change the function. Highly appreciated

    Thanks.This post healed my great headache :)

  18. Jarek on said:

    Hi everybody,
    I wrote my own function because all previous didn’t work for me.
    Maybe it isn’t fast but it works like a hell for me.
    I also think it only works for one occurrence of pattern in string but that’s fine by me.

    It simply cuts from the right of the original till it match the pattern
    Then it does the same from the left.
    It leaves you only the part of the string that math pattern.
    Then simple replace.

    CREATE FUNCTION `regex_replace`(`original` vARCHAR(1000), `pattern` vARCHAR(1000), `replacement` vARCHAR(1000))
    RETURNS varchar(1000)
    LANGUAGE SQL
    NOT DETERMINISTIC
    NO SQL
    SQL SECURITY DEFINER
    COMMENT ”
    BEGIN
    declare temp varchar(1000);
    declare temp2 varchar(1000);
    declare ch varchar(1);
    declare i int;
    set i = 1;
    set temp = ”;
    set temp2 = ”;

    if original REGEXP pattern then
    loop_label: LOOP
    if (right(original,length(original)-i) REGEXP pattern)=0 then
    set temp=right(original,length(original)-i+1);
    LEAVE loop_label;
    end if;
    if (i>length(original)) then
    LEAVE loop_label;
    end if;
    set i = i+1;
    END LOOP;
    end if; #koniec ifa dla tego czy spelnia pattern
    set i=1;
    if temp” then
    if temp REGEXP pattern then
    loop_label2: LOOP
    if (left(temp,length(temp)-i) REGEXP pattern)=0 then
    set temp2=left(temp,length(temp)-i+1);
    LEAVE loop_label2;
    end if;
    if (i>length(temp)) then
    LEAVE loop_label2;
    end if;
    set i = i+1;
    END LOOP;
    end if; #koniec ifa dla tego czy spelnia pattern
    end if; # koniec ifa na wszelki wypadek

    if temp2=” then
    return oryginal;
    else
    return replace(original,temp2,replacement);
    end if;

    END

  19. Thank you for the post — but PLEASE consider updating the css for your code blocks — impossible to read.

  20. Interesting idea, but I want to warn anyone who wants to use this on large data sets – the performance of this approach is very poor on long strings. It does thousands of regex comparisons until it finds the correct position of the replacement location. A potential improvement here would be to do a binary search instead, i.e. instead of doing i=i+1 start with length/2, select either the first or the second half, then repeat until length is < 2.

  21. Thanks a tonne bro.

  22. Thanks ! Was very useful

  23. Amit Shah on said:

    Hey Thanks for useful function. I have done my work very efficiently with the same !

  24. Thanks for the post! I cannot get the function to work (maybe because I’m using the code in the original post, not one of the iterations in the comments). Here’s my test:


    SET sql_mode='NO_BACKSLASH_ESCAPES';
    SELECT regex_replace(' AV ',' AVE ','GOOBER AV AT AVERY ST');
    SELECT regex_replace('[::]]','AVE','GOOBER AV AT AVERY ST');

    Does not produce the expected results:

    GOOBER AVE AT AVERY ST
    

    Can you point me to which code to use to make this work? (or update the original post)

  25. I still can’t use this function on a column. Anyone else getting this? I checked there’s no double quotes.

    select regex_replace(‘[^a-zA-Z0-9]‘,”,`user_login`);
    ERROR 1054 (42S22): Unknown column ‘user_login’ in ‘field list’

  26. I did some modifications so that I can use sort of a “last found” $_ flag and “replace only in the beginning” ^ operator:

    Examples of usage:

    # String containing letters, numbers and non-alphanumeric:
    set @str = ‘abacaxi – 1234′;

    # Replaces all non-alphanumerics for empty:
    set @str = regex_replace( ‘[^a-z0-9]‘, ”, @str);

    # Surrounds all leters with a ‘.’
    set @str = regex_replace( ‘[a-z]‘, ‘.$_.’, @str);

    #Corrects .. to .
    set @str = replace(@str, ‘..’, ‘.’);

    # Takes out the first dot ‘.’ if exists:
    set @str = regex_replace( ‘^\.’, ”, @str);

    # shows the result: (a.b.a.c.a.x.i.1234)
    select @str;

    drop function if exists regex_replace;
    DELIMITER $$
    CREATE FUNCTION `regex_replace`( pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000))

    RETURNS VARCHAR(1000)
    DETERMINISTIC
    BEGIN
    DECLARE temp VARCHAR(1000);
    DECLARE ch VARCHAR(1);
    DECLARE replacement_2 varchar(1000);
    DECLARE i, onlyfirst INT;
    SET i = 1;
    SET onlyfirst = 0;
    SET temp = ”;
    IF original REGEXP pattern THEN

    # Sergio Abreu begin
    # Set a flag to indicate that should leave after the first check,
    # concatenating the remaining original chars to the first replaced char.
    IF INSTR( pattern, ‘^’) = 1 THEN
    set pattern = SUBSTR(pattern,2);
    set onlyfirst = 1;
    END IF;
    # Sergio Abreu end

    loop_label: LOOP
    IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;
    END IF;
    SET ch = SUBSTRING(original,i,1);
    IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp,ch);
    ELSE
    IF INSTR(replacement , ‘$_’) > 0 THEN
    set replacement_2 = REPLACE( replacement, ‘$_’, ch);
    ELSE
    set replacement_2 = replacement;
    END IF;
    SET temp = CONCAT(temp,replacement_2);
    END IF;
    SET i=i+1;

    # Sergio Abreu begin
    # To remove or replace just the first occurence.
    # It will concatenate the original remaining chars.
    IF onlyfirst = 1 THEN
    SET temp = CONCAT( temp, SUBSTR( original, i));
    LEAVE loop_label;
    END IF;
    # Sergio Abreu end

    END LOOP;
    ELSE
    SET temp = original;
    END IF;
    RETURN temp;
    END$$
    DELIMITER ;

  27. hey thanks techras…

  28. Mitch Golden on said:

    This is not a real implementation of regexp, because it is going character by character. For example, try

    SELECT regex_replace(‘abc’, ‘xyz’, ‘rrrabcsss’);

    This should return ‘rrrxyzsss’ but it doesn’t. Only if the regexp matches a single character will it work:

    SELECT regex_replace(‘a’, ‘xyz’, ‘rrrabcsss’);

    returns

    ‘rrrxyzbcsss’

    There is a mysly implementation of regexp out there in C, but it is harder to use.

    • Mitch,

      You are more than welcome to post a better version of this function to run on MYSQL and do string replacement rather than character replacement. I honestly couldn’t think of a more generic way of replacing strings. I could slightly modify this function replace words.

      Please feel free to post your implementation, which could help the readers of this blog.

      Thanks,
      Techras

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: