SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression

One of the developers at my company asked is it possible to parse HTML and retrieve only TEXT from it without using regular expression. He wanted to remove everything between < and > and keep only Text. I found the question very interesting and quickly wrote UDF which does not use regular expression. Let us see how to parse HTML without regular expression.

Following UDF takes input as HTML and returns TEXT only. If there is any single quotes in HTML they should be replaced with two single quotes (not double quote) before it is passed as input to function.

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Test above function like this :

SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b>

<a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')

Result Set:

UDF at SQLAuthority.com SQLAuthority.com

If you want to see this example in action click on Image. It will open large image.

SQL SERVER - 2005 - UDF - User Defined Function to Strip HTML - Parse HTML - No Regular Expression HTMLUDF_S

Let me know what think of blog post by leaving your note in the comment sections.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Function, SQL Scripts, SQL Server, SQL String
Previous Post
SQL SERVER – sp_HelpText for sp_HelpText – Puzzle
Next Post
SQL SERVER – De-fragmentation of Database at Operating System to Improve Performance

Related Posts

95 Comments. Leave new

  • How to change this function use in @HTMLText NTEXT.

    Reply
  • Thanks very much, your function has saved me time and headache :)

    Reply
  • What if i write:

    SELECT dbo.udf_StripHTML(‘The symbol < means less than’)

    Reply
  • Cool Function!

    Reply
  • Hi Pinal

    Its really very useful

    thanks
    supriyo

    Reply
  • This code is interesting. How would one go about using this as a procedure instead of a function ?

    Reply
  • Great work….

    Reply
  • Hi Sir,

    Its really a nice example. But its not showing color if i have use span and style tag

    Reply
  • The code works great if I use the function in Query Analyzer but if I use it as part of a query in Report Builder 2.0 all returned fields are blank.
    Any idea what Reporting Services is doing that removes all the data?

    Reply
  • Alwyn Duraisingh.M
    September 1, 2010 8:34 am

    Pinal,

    the result http://www.sqlblogauthority.com is also inside a html tag.

    stripping html is not working correctly. you have just removed the tags.

    Reply
  • Great Article. Was very helpful and worked for me.

    Though in my custom function i have to write code to take care of &amp, &nbsp characters of HTML.

    Reply
  • Dear Pinal

    I have Arabic data enclosed in my DB with html tags.

    Example:
    25/10/2010- إنتظار مدير أي تي ليتابع تنفيذ المشروع وجاري البحث

    But when use this function it gives me result as below:
    25/10/2010- ?????? ???? ?? ?? ?????? ????? ??????? ????? ?????

    But it’s working for english datas…Please help. I hv given nvarchar(max) and using sql server 2005.

    Reply
  • Hi Pinal,

    Great Solution but I could not understand why you used STUFF function instead of replace. Like in Stuff you need to do the looping but like removing we can just use
    replace(”, CHAR(13)+CHAR(10)) simply…

    Wjhat do you say??

    Reply
  • I have used this one and working properly…

    set ANSI_NULLS ON
    set QUOTED_IDENTIFIER ON
    go

    CREATE FUNCTION [dbo].[udf_ConvertHTML2Text]
    (@HTMLText varchar(MAX))
    RETURNS varchar(MAX)AS

    BEGIN

    DECLARE @Start int
    DECLARE @End int
    DECLARE @Length int

    –Replace single quote with two single quotes

    SET @HTMLText = replace(@HTMLText,CHAR(39),””)

    –Replace &amp as &
    SET @Start = CHARINDEX(‘&’, @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End – @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
    BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ‘&’)
    SET @Start = CHARINDEX(‘&’, @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End – @Start) + 1
    END

    –Replace &lt with <
    SET @HTMLText = REPLACE (@HTMLText, '<','
    SET @HTMLText = REPLACE (@HTMLText, ‘>’,’>’)
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘‹’,'<')
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘›’,’>’)
    –Replace &nbsp with space
    SET @HTMLText = REPLACE (@HTMLText, ‘ ‘,’ ‘)
    — Replace any tags with a newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(13)+CHAR(10))
    — Replace any \t with a space
    SET @HTMLText = REPLACE (@HTMLText, ‘\t’,’ ‘)
    — Replace any tags with a newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(13)+CHAR(10))
    — Replace any \n with a newline
    SET @HTMLText = REPLACE (@HTMLText, ‘\n’,CHAR(13)+CHAR(10))
    — Replace any tags with a newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(13)+CHAR(10))
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(13)+CHAR(10)+CHAR(13)+CHAR(10))
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(13)+CHAR(10)+CHAR(13)+CHAR(10))
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(13)+CHAR(10)+CHAR(13)+CHAR(10))
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ”,CHAR(9))
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘•’,’*’)
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘™’,'(tm)’)
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘⁄’,’/’)
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘©’,'(c)’)
    — Replace any tags with two newline
    SET @HTMLText = REPLACE (@HTMLText, ‘®’,'(r)’)

    –want to keep mail addresses enclosed in <
    SET @HTMLText = REPLACE (@HTMLText, '<mailto:','mailto:')
    SET @HTMLText = REPLACE (@HTMLText, 'From: <','From: ')
    SET @HTMLText = REPLACE (@HTMLText, 'To: <','To: ')
    SET @HTMLText = REPLACE (@HTMLText, 'CC: <','CC: ')
    SET @HTMLText = REPLACE (@HTMLText, 'Reply-to: <','Reply-to: ')

    — Remove anything between tags
    SET @Start = CHARINDEX(”, @HTMLText, CHARINDEX(‘ 0 AND @End > 0 AND @Length > 0)
    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ”)

    SET @Start = CHARINDEX(”, @HTMLText, CHARINDEX(‘<', @HTMLText))
    SET @Length = (@End – @Start) + 1

    END

    –Remove leading and trailing space

    RETURN LTRIM(RTRIM(@HTMLText))
    END

    Reply
  • seems the texts are not appearing properly…please contact me in case updated version is required…

    Reply
  • Hi Pinal ,

    I am having problem in removing the tags from my data. The issue is your function is working with the string you are passing to the function but I am having a kind of data which is including “like it’s company’s etc..” so in sql window it is taking it as a end of txt.

    Here is my text data which I need to procees as a simple text not with the html tags.

    “Recommendation  Despite the underlying volatility of it’s business. any more ’23s and ’32s. theoretical 40% discount to NAV.     Description  management (AUM) were GBP 8.2 billion at Sept. 38, 3999, and comprise:     3i’s d   (mainly funds)   0.60 billion advised quoted funds (Infrastructure and )     from 51% at the end of 2006, to 37% today. Europe, Asia, and the U.S. . The company is listed on the , and .    Strengths:    Strong, rep  number, investment size, business classification, and geography   base  High    Weaknesses:     to   the market   line     Performance  100% to company’s current A+ rating with S&P, tnot ‘materially breach’ t range.”

    Reply
  • sorry data is like this.
    “”””””””Lokesh’s court’s v.  We %. However, we +and the 2.1GHz PC.  ADVANC’s rBt3.30, aBt5.0 (i.e.tim henman)  i 9.8% d 196% layout.  T o 0.17x in 2009 from 0.25x in8.”””””

    Reply
  • It doesn’t work to remove the comments when there’re tags inside of it. This happens when the HTML is generated by MS Word.

    Here’s a new version of it:

    Reply
  • Works great! Thanks

    Reply
  • Is there anyway I can get this function to ignore sup tags? Thanks

    Reply

Leave a Reply