SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression

One of the developers at my company asked is it possible to parse HTML and retrieve only TEXT from it without using regular expression. He wanted to remove everything between < and > and keep only Text. I found the question very interesting and quickly wrote UDF which does not use regular expression. Let us see how to parse HTML without regular expression.

Following UDF takes input as HTML and returns TEXT only. If there is any single quotes in HTML they should be replaced with two single quotes (not double quote) before it is passed as input to function.

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Test above function like this :

SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b>

<a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')

Result Set:

UDF at SQLAuthority.com SQLAuthority.com

If you want to see this example in action click on Image. It will open large image.

SQL SERVER - 2005 - UDF - User Defined Function to Strip HTML - Parse HTML - No Regular Expression HTMLUDF_S

Let me know what think of blog post by leaving your note in the comment sections.

Reference: Pinal Dave (https://blog.sqlauthority.com)

Solarwinds
, , ,
Previous Post
SQL SERVER – sp_HelpText for sp_HelpText – Puzzle
Next Post
SQL SERVER – De-fragmentation of Database at Operating System to Improve Performance

Related Posts

89 Comments. Leave new

  • james mcfarland
    January 29, 2014 2:03 am

    Thanks Pinal, and lazycoders – I used the lazycoders version with a few typo fixes (See this StackOverflow post: – all of that together provided me with a great and easy-to-understand solution. Well done generous ones.

    Reply
  • Hi Pinal,

    I have taken a look at the LazyCoder’s SQL function. Could you please explain why can’t we just replace the BREAK or tags with a REPLACE? Why do we need to use a WHILE loop to achieve this replacement?

    Why this?
    – Replace any tags with a newline
    — Replace any tags with a newline
    SET @Start = CHARINDEX(”, @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End – @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ‘CHAR(13) + CHAR(10)’)
    SET @Start = CHARINDEX(”, @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End – @Start) + 1
    END

    Why not just?

    SET @HTMLText = REPLACE(@HTMLText, ”, ‘CHAR(13) + CHAR(10)’)

    Thanks,
    Sathish

    Reply
  • Hi,

    In database i have stored the values like (&gt, ;<, /p, >) this , unable to remove the tags. Any solution is there.

    Reply
  • Pinal, you are the best!!! Every time I have a question, you have the answer. You are becoming very popular at the office! Thank you!

    Reply
  • Hi Pinal,

    I am having chinese text in the html tags… But the above function is not working for chinese characters.

    Reply
  • As I don’t have write access, can I have the same code with temp procedure

    Reply
  • how do i get only “change abc contents” as output and trim other html contents

    {\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\fnil MS Sans Serif;}}
    {\colortbl ;\red0\green0\blue0;}
    \viewkind4\uc1\pard\cf1\lang1033\f0\fs16 change abc contents
    \par \f1
    \par }

    Reply
  • very Nice Idea !!

    Thanks you for share

    Reply
  • An issue…run against the following: ‘This number 4 is < this number 5. ‘

    Reply
    • if there are characters which can confuse HTML then you need to handle them separately.

      Reply
      • For what little its worth, here’s a modified edition that will recognize a content (non-HTML) “<" symbol. Might be useful, or someone might provide a better method:

        DECLARE @String nvarchar(max) = 'AB<CD’ –>Should yield ‘AB<CD'
        DECLARE @Start INT,
        @NextStart INT,
        @End INT,
        @Length INT,
        @BeginSearch int = 0

        SELECT @Start = CHARINDEX('<', @String)
        SELECT @NextStart = CHARINDEX('’, @String, @Start)
        IF @NextStart < @END –Second "”?
        BEGIN
        SET @Start = @NextStart –Skip this orphan “<"
        SET @BeginSearch = @NextStart –Resume with second " 0 AND @END > 0 –Must have one of each
        BEGIN
        SELECT @Length = (@End – @Start) + 1
        IF @Length > 0
        BEGIN
        SELECT @String = STUFF(@String, @Start, @Length, ”)
        SET @BeginSearch = @BeginSearch – @Length
        END
        SELECT @Start = CHARINDEX(‘<', @String, @BeginSearch)
        SELECT @NextStart = CHARINDEX('’, @String, @BeginSearch)
        IF @NextStart < @END
        BEGIN
        SET @Start = @NextStart
        SET @BeginSearch = @NextStart
        END
        ELSE
        SET @BeginSearch = @END
        END

        SELECT @String

  • How do i get around:
    Msg 206, Level 16, State 2, Line 39
    Operand type clash: table is incompatible with varchar(max)

    Reply
  • Awesome tid bit.. Thanks a million…..

    Reply
  • worked perfectly. thanks!

    Reply
  • doesn’t work for AWS redshift, kept giving me syntax errors regarding @, any suggested modifications? i use aginity for redshift workbench

    Reply

Leave a Reply

Menu