SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression

June 16, 2007

One of the developers at my company asked is it possible to parse HTML and retrieve only TEXT from it without using regular expression. He wanted to remove everything between < and > and keep only Text. I found the question very interesting and quickly wrote UDF which does not use regular expression. Let us see how to parse HTML without regular expression.

Following UDF takes input as HTML and returns TEXT only. If there is any single quotes in HTML they should be replaced with two single quotes (not double quote) before it is passed as input to function.

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Test above function like this :

SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b>

<a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')

Result Set:

UDF at SQLAuthority.com SQLAuthority.com

If you want to see this example in action click on Image. It will open large image.

SQL SERVER - 2005 - UDF - User Defined Function to Strip HTML - Parse HTML - No Regular Expression HTMLUDF_S

Let me know what think of blog post by leaving your note in the comment sections.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Function, SQL Scripts, SQL Server, SQL String

SQL SERVER – sp_HelpText for sp_HelpText – Puzzle

SQL SERVER – De-fragmentation of Database at Operating System to Improve Performance

95 Comments. Leave new

Vinay Patil
September 24, 2014 12:47 pm
Hi Pinal,
I am having chinese text in the html tags… But the above function is not working for chinese characters.
Reply
Arun
February 18, 2016 5:37 pm
As I don’t have write access, can I have the same code with temp procedure
Reply
mm
July 22, 2016 10:27 pm
how do i get only “change abc contents” as output and trim other html contents
{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\fnil MS Sans Serif;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1\pard\cf1\lang1033\f0\fs16 change abc contents
\par \f1
\par }
Reply
RAM KARAD
December 18, 2016 5:19 pm
very Nice Idea !!
Thanks you for share
Reply
- Pinal Dave
  December 24, 2016 5:06 pm
  Thanks @Ram
  Reply
Dave
January 6, 2017 11:58 pm
An issue…run against the following: ‘This number 4 is < this number 5. ‘
Reply
- Pinal Dave
  January 10, 2017 8:37 pm
  if there are characters which can confuse HTML then you need to handle them separately.
  Reply
  - Dave
    January 10, 2017 10:20 pm
    For what little its worth, here’s a modified edition that will recognize a content (non-HTML) “<" symbol. Might be useful, or someone might provide a better method:
    DECLARE @String nvarchar(max) = 'AB<CD’ –>Should yield ‘AB<CD'
    DECLARE @Start INT,
    @NextStart INT,
    @End INT,
    @Length INT,
    @BeginSearch int = 0
    SELECT @Start = CHARINDEX('<', @String)
    SELECT @NextStart = CHARINDEX('’, @String, @Start)
    IF @NextStart < @END –Second "”?
    BEGIN
    SET @Start = @NextStart –Skip this orphan “<"
    SET @BeginSearch = @NextStart –Resume with second " 0 AND @END > 0 –Must have one of each
    BEGIN
    SELECT @Length = (@End – @Start) + 1
    IF @Length > 0
    BEGIN
    SELECT @String = STUFF(@String, @Start, @Length, ”)
    SET @BeginSearch = @BeginSearch – @Length
    END
    SELECT @Start = CHARINDEX(‘<', @String, @BeginSearch)
    SELECT @NextStart = CHARINDEX('’, @String, @BeginSearch)
    IF @NextStart < @END
    BEGIN
    SET @Start = @NextStart
    SET @BeginSearch = @NextStart
    END
    ELSE
    SET @BeginSearch = @END
    END
    SELECT @String
Jerry Field
March 28, 2017 1:27 am
How do i get around:
Msg 206, Level 16, State 2, Line 39
Operand type clash: table is incompatible with varchar(max)
Reply
- Pinal Dave
  March 29, 2017 6:42 am
  What’s the datatype? We need to find query causing this.. I generally use profiler.
  Reply
DMiller
September 27, 2017 6:29 pm
Awesome tid bit.. Thanks a million…..
Reply
Jim Sawyers
November 9, 2018 1:52 am
worked perfectly. thanks!
Reply
nicholas
January 16, 2019 2:47 am
doesn’t work for AWS redshift, kept giving me syntax errors regarding @, any suggested modifications? i use aginity for redshift workbench
Reply
Amitesh sharma
July 1, 2021 12:53 pm
Hello Vinay please change the data type from varchar to nvarchar as below
ALTER FUNCTION [dbo].[udf_StripHTML] (@HTMLText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
Reply
Ben
December 29, 2021 7:12 pm
If you prefer a version that uses a recursive CTE to declaring variables and having a WHILE loop, I reorganized the functionality to do it this way.
WITH cte (n, HtmlText) AS (
SELECT 0, CONVERT(nvarchar(MAX), N’some test htmlxxx’)
UNION ALL
SELECT N+1, CONVERT(nvarchar(MAX), STUFF(HtmlText, CHARINDEX(N”, HtmlText, CHARINDEX(N'<', HtmlText)) – CHARINDEX(N'<', HtmlText) + 1, ''))
FROM cte
WHERE CHARINDEX(' 0
AND CHARINDEX(‘>’, HtmlText, CHARINDEX(‘ 0
AND CHARINDEX(‘>’, HtmlText, CHARINDEX(‘<', HtmlText)) – CHARINDEX(' 0
)
SELECT TOP 1 HtmlText
FROM cte
ORDER BY N DESC
Reply
John Dawson
April 14, 2022 5:31 pm
Thanks a bunch, by far the best solution I’ve found; a minor follow-up: I find the stripped text has a few HTML entities, like &nbsp – any tips to also remove those?
Reply
Joe Chang
June 8, 2022 9:58 pm
after remove the text bounded by , we might want to rermove html key words bound by & and ; ?
Reply
M
November 19, 2022 2:35 am
Can this be written to allow and ?
Reply
Azhar
July 12, 2023 3:46 am
Excellent sir it works for me.
Reply