SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression

One of the developers at my company asked is it possible to parse HTML and retrieve only TEXT from it without using regular expression. He wanted to remove everything between < and > and keep only Text. I found the question very interesting and quickly wrote UDF which does not use regular expression. Let us see how to parse HTML without regular expression.

Following UDF takes input as HTML and returns TEXT only. If there is any single quotes in HTML they should be replaced with two single quotes (not double quote) before it is passed as input to function.

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Test above function like this :

SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b>

<a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')

Result Set:

UDF at SQLAuthority.com SQLAuthority.com

If you want to see this example in action click on Image. It will open large image.

SQL SERVER - 2005 - UDF - User Defined Function to Strip HTML - Parse HTML - No Regular Expression HTMLUDF_S

Let me know what think of blog post by leaving your note in the comment sections.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Function, SQL Scripts, SQL Server, SQL String
Previous Post
SQL SERVER – sp_HelpText for sp_HelpText – Puzzle
Next Post
SQL SERVER – De-fragmentation of Database at Operating System to Improve Performance

Related Posts

95 Comments. Leave new

  • Hi Pinal,

    I am having chinese text in the html tags… But the above function is not working for chinese characters.

    Reply
  • As I don’t have write access, can I have the same code with temp procedure

    Reply
  • how do i get only “change abc contents” as output and trim other html contents

    {\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\fnil MS Sans Serif;}}
    {\colortbl ;\red0\green0\blue0;}
    \viewkind4\uc1\pard\cf1\lang1033\f0\fs16 change abc contents
    \par \f1
    \par }

    Reply
  • very Nice Idea !!

    Thanks you for share

    Reply
  • An issue…run against the following: ‘This number 4 is < this number 5. ‘

    Reply
    • if there are characters which can confuse HTML then you need to handle them separately.

      Reply
      • For what little its worth, here’s a modified edition that will recognize a content (non-HTML) “<" symbol. Might be useful, or someone might provide a better method:

        DECLARE @String nvarchar(max) = 'AB<CD’ –>Should yield ‘AB<CD'
        DECLARE @Start INT,
        @NextStart INT,
        @End INT,
        @Length INT,
        @BeginSearch int = 0

        SELECT @Start = CHARINDEX('<', @String)
        SELECT @NextStart = CHARINDEX('’, @String, @Start)
        IF @NextStart < @END –Second "”?
        BEGIN
        SET @Start = @NextStart –Skip this orphan “<"
        SET @BeginSearch = @NextStart –Resume with second " 0 AND @END > 0 –Must have one of each
        BEGIN
        SELECT @Length = (@End – @Start) + 1
        IF @Length > 0
        BEGIN
        SELECT @String = STUFF(@String, @Start, @Length, ”)
        SET @BeginSearch = @BeginSearch – @Length
        END
        SELECT @Start = CHARINDEX(‘<', @String, @BeginSearch)
        SELECT @NextStart = CHARINDEX('’, @String, @BeginSearch)
        IF @NextStart < @END
        BEGIN
        SET @Start = @NextStart
        SET @BeginSearch = @NextStart
        END
        ELSE
        SET @BeginSearch = @END
        END

        SELECT @String

  • How do i get around:
    Msg 206, Level 16, State 2, Line 39
    Operand type clash: table is incompatible with varchar(max)

    Reply
  • Awesome tid bit.. Thanks a million…..

    Reply
  • worked perfectly. thanks!

    Reply
  • doesn’t work for AWS redshift, kept giving me syntax errors regarding @, any suggested modifications? i use aginity for redshift workbench

    Reply
  • Amitesh sharma
    July 1, 2021 12:53 pm

    Hello Vinay please change the data type from varchar to nvarchar as below

    ALTER FUNCTION [dbo].[udf_StripHTML] (@HTMLText NVARCHAR(MAX))
    RETURNS NVARCHAR(MAX)

    Reply
  • If you prefer a version that uses a recursive CTE to declaring variables and having a WHILE loop, I reorganized the functionality to do it this way.

    WITH cte (n, HtmlText) AS (
    SELECT 0, CONVERT(nvarchar(MAX), N’some test htmlxxx’)

    UNION ALL

    SELECT N+1, CONVERT(nvarchar(MAX), STUFF(HtmlText, CHARINDEX(N”, HtmlText, CHARINDEX(N'<', HtmlText)) – CHARINDEX(N'<', HtmlText) + 1, ''))
    FROM cte
    WHERE CHARINDEX(' 0
    AND CHARINDEX(‘>’, HtmlText, CHARINDEX(‘ 0
    AND CHARINDEX(‘>’, HtmlText, CHARINDEX(‘<', HtmlText)) – CHARINDEX(' 0
    )
    SELECT TOP 1 HtmlText
    FROM cte
    ORDER BY N DESC

    Reply
  • Thanks a bunch, by far the best solution I’ve found; a minor follow-up: I find the stripped text has a few HTML entities, like &nbsp – any tips to also remove those?

    Reply
  • after remove the text bounded by , we might want to rermove html key words bound by & and ; ?

    Reply
  • Can this be written to allow and ?

    Reply
  • Excellent sir it works for me.

    Reply

Leave a Reply