SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression

One of the developers at my company asked is it possible to parse HTML and retrieve only TEXT from it without using regular expression. He wanted to remove everything between < and > and keep only Text. I found the question very interesting and quickly wrote UDF which does not use regular expression. Let us see how to parse HTML without regular expression.

Following UDF takes input as HTML and returns TEXT only. If there is any single quotes in HTML they should be replaced with two single quotes (not double quote) before it is passed as input to function.

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Test above function like this :

SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b>

<a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')

Result Set:

UDF at SQLAuthority.com SQLAuthority.com

If you want to see this example in action click on Image. It will open large image.

SQL SERVER - 2005 - UDF - User Defined Function to Strip HTML - Parse HTML - No Regular Expression HTMLUDF_S

Let me know what think of blog post by leaving your note in the comment sections.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Function, SQL Scripts, SQL Server, SQL String
Previous Post
SQL SERVER – sp_HelpText for sp_HelpText – Puzzle
Next Post
SQL SERVER – De-fragmentation of Database at Operating System to Improve Performance

Related Posts

95 Comments. Leave new

  • Wilson Gunanithi . J
    June 16, 2007 9:00 pm

    Hi ,

    Really fantastic and jore hey, but I tried this in sql server query analyzer…

    I got the errors like this :

    Server: Msg 170, Level 15, State 1, Procedure udf_StripHTML, Line 2
    Line 2: Incorrect syntax near ‘MAX’.
    Server: Msg 137, Level 15, State 1, Procedure udf_StripHTML, Line 11
    Must declare the variable ‘@HTMLText’.
    Server: Msg 137, Level 15, State 1, Procedure udf_StripHTML, Line 12
    Must declare the variable ‘@HTMLText’.
    Server: Msg 137, Level 15, State 1, Procedure udf_StripHTML, Line 17
    Must declare the variable ‘@HTMLText’.
    Server: Msg 137, Level 15, State 1, Procedure udf_StripHTML, Line 18
    Must declare the variable ‘@HTMLText’.
    Server: Msg 137, Level 15, State 1, Procedure udf_StripHTML, Line 19
    Must declare the variable ‘@HTMLText’.
    Server: Msg 137, Level 15, State 1, Procedure udf_StripHTML, Line 22
    Must declare the variable ‘@HTMLText’.

    Send the remedy by query

    With regards…

    Wilson Gunanthi . J

    Reply
  • This code is interesting. How would one go about using this as a procedure instead of a function ?

    Reply
  • Wilson Gunanithi . J
    you are getting an error coz of copy paste

    you need to replace the ‘ in your text when you copy paste.

    Reply
  • oops you are using sql server 2000 and he is using 2005
    so you cannot use varchar(max) which is only for 2005.

    Replace varchar(max) with varchar(8000) and you will get rid of errors.

    sometimes the ‘ marks are not copied properly thats what I thought initially. please who post these articles should point out for which version this is compatible with.

    Reply
  • Does this code allow for HTML characters like the non-breaking space? &nbsp and/or other HTML encoded characters ?

    Thanks
    Tim

    Reply
  • I tried the modifyed code from Lazy coder’s site and the result shows the CHAR(13) + CHAR(10) in the data instead of a BR.. any way to change that?

    Reply
  • Hi,

    I wanted to know, if I’m inserting this record from this

    control, how thw spaces are getting also inserted.

    Regards,
    SAROJ

    Reply
  • Hi,

    I wish to return a table.

    how do i go about it ??

    Reply
  • Hi,

    How can I use this procedure to insert multiple html files in my existing table column.

    Regards
    Bobby

    Reply
  • Fantastic!!!I wish I could be your student..

    Reply
  • Hi Dave,

    Thanks for nice example. I’ve also as a Lazy Coder wanted just a little bit more from the function. Looking into his code – seems he’s got small mistake – replacing < and > before actually stripping the HTML tags out will result in a different result then it could actually be, i.e.:
    ‘Check out the content: And here should be the end’
    and
    ‘Check out the content: And here should be the end’

    Here is my version – a bit modified, but does same functionality:

    CREATE FUNCTION [dbo].[uf_strip_html] (
    @stringWithHTML varchar(max)
    ) RETURNS varchar(max)
    AS
    BEGIN
    /* first part is to replace HTML entities with appropriate (for my needs) characters */
    — replacing & with &
    set @stringWithHTML = replace(@stringWithHTML, ‘&’, ‘&’);
    — replacing   with space
    set @stringWithHTML = replace(@stringWithHTML, ‘ ‘, ‘ ‘);
    — replacing with space – I don’t need to count these ones
    set @stringWithHTML = replace(@stringWithHTML, ”, ‘ ‘);

    declare @start int, @end int;
    set @start = charindex(”, @stringWithHTML);

    — by now should already have the text with almost clean html tags and need to exctract content between those
    while (@start > 0 and @end > 0)
    begin
    set @stringWithHTML = stuff(@stringWithHTML, @start, @end – @start + 1, ”);
    set @start = charindex(”, @stringWithHTML);
    end;

    — removing front/trailing spaces if any
    set @stringWithHTML = ltrim(rtrim(@stringWithHTML));

    /* replacing these entities just now in order not to strip those during text extraction */
    — replacing < with <
    set @stringWithHTML = replace(@stringWithHTML, ‘<‘, ‘
    set @stringWithHTML = replace(@stringWithHTML, ‘>’, ‘>’);

    RETURN @stringWithHTML;
    END

    Thanks again,
    Vlad

    Reply
  • P.S. Seems that the form for comment submission modified the characters within the quotes. I’ll try to include the text again – please update the previous posting if possible and remove that one.

    examples:
    `Check out the content: And here should be the end`
    `Check out the content: And here should be the end`

    and function:

    CREATE FUNCTION [dbo].[uf_strip_html] (
    @stingWithHTML varchar(max)
    ) RETURNS varchar(max)
    AS
    BEGIN
    /* first part is to replace HTML entities with appropriate (for my needs) characters */
    — replacing & with &
    set @stingWithHTML = replace(@stingWithHTML, `&`, `&`);
    — replacing   with space
    set @stingWithHTML = replace(@stingWithHTML, ` `, ` `);
    — replacing with space – I don`t need to count these ones
    set @stingWithHTML = replace(@stingWithHTML, “, ` `);

    declare @start int, @end int;
    set @start = charindex(“, @stingWithHTML);

    — by now should already have the text with almost clean html tags and need to exctract content between those
    while (@start > 0 and @end > 0)
    begin
    set @stingWithHTML = stuff(@stingWithHTML, @start, @end – @start + 1, “);
    set @start = charindex(“, @stingWithHTML);
    end;

    — removing front/trailing spaces if any
    set @stingWithHTML = ltrim(rtrim(@stingWithHTML));

    /* replacing these entities just now in order not to strip those during text extraction */
    — replacing < with <
    set @stingWithHTML = replace(@stingWithHTML, `<`, `
    set @stingWithHTML = replace(@stingWithHTML, `>`, `>`);

    RETURN @stingWithHTML;
    END
    go

    Reply
  • Hello. And Bye. :)

    Reply
  • Thanks for posting this function. Very useful.

    Reply
  • Priya Srinivas
    June 10, 2009 9:11 pm

    This does not strip &nbsp.

    Reply
    • add
      SET @HTMLText = REPLACE(@HTMLText,’ ‘,”)

      Right before this

      RETURN LTRIM(RTRIM(@HTMLText))

      Reply
  • Hi,

    Here i am update this query for removing style & script….

    Just replacing chars with blank between this TAG’s [] – [] & [<script *] – []

    Alter FUNCTION [dbo].[udf_StripHTML]
    (@HTMLText VARCHAR(MAX))
    RETURNS VARCHAR(MAX)
    AS
    BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT

    SET @Start = CHARINDEX(”,@HTMLText)
    SET @End = (len(”)-1)+CHARINDEX(”,@HTMLText,CHARINDEX(”,@HTMLText))
    SET @Length = (@End – @Start) + 1
    WHILE @Start > 0 AND @End > 0 AND @Length > 0
    BEGIN

    SET @HTMLText = STUFF(@HTMLText,@Start,@Length,”)
    SET @Start = CHARINDEX(”,@HTMLText)
    SET @End = (len(”)-1) + CHARINDEX(”,@HTMLText,CHARINDEX(”,@HTMLText))
    SET @Length = (@End – @Start) + 1
    END

    SET @Start = CHARINDEX(‘<script',@HTMLText)
    SET @End = (len('’)-1)+CHARINDEX(”,@HTMLText,CHARINDEX(‘ 0 AND @End > 0 AND @Length > 0
    BEGIN

    SET @HTMLText = STUFF(@HTMLText,@Start,@Length,”)
    SET @Start = CHARINDEX(‘<script',@HTMLText)
    SET @End = (len('’)-1) + CHARINDEX(”,@HTMLText,CHARINDEX(‘<script',@HTMLText))
    SET @Length = (@End – @Start) + 1
    END

    SET @Start = CHARINDEX('’,@HTMLText,CHARINDEX(‘ 0 AND @End > 0 AND @Length > 0
    BEGIN
    SET @HTMLText = STUFF(@HTMLText,@Start,@Length,”)
    SET @Start = CHARINDEX(”,@HTMLText,CHARINDEX(‘<',@HTMLText))
    SET @Length = (@End – @Start) + 1
    END

    RETURN LTRIM(RTRIM(@HTMLText))
    END
    GO

    Reply
  • B”H

    YESSSS! Thank you for making this available. Works like a charm!

    Reply
  • Thanks for the code.
    easiest way to do it :)

    Reply
  • Nice solution, i have been searching for something that remove the html code embbebed in a string, you usually think in a complex solution, i thougth to solve this using regular expresions, but this is a easy and beauty way to do it.

    Thanks and regards from Mexico.

    Reply
  • Thanks, saved me some time!

    Reply

Leave a Reply