One of the developers at my company asked is it possible to parse HTML and retrieve only TEXT from it without using regular expression. He wanted to remove everything between < and > and keep only Text. I found the question very interesting and quickly wrote UDF which does not use regular expression. Let us see how to parse HTML without regular expression.
Following UDF takes input as HTML and returns TEXT only. If there is any single quotes in HTML they should be replaced with two single quotes (not double quote) before it is passed as input to function.
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @Start INT DECLARE @End INT DECLARE @Length INT SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) SET @Length = (@End - @Start) + 1 WHILE @Start > 0 AND @End > 0 AND @Length > 0 BEGIN SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'') SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) SET @Length = (@End - @Start) + 1 END RETURN LTRIM(RTRIM(@HTMLText)) END GO
Test above function like this :
SELECT dbo.udf_StripHTML('<b>UDF at SQLAuthority.com </b> <a href="http://www.SQLAuthority.com">SQLAuthority.com</a>')
Result Set:
UDF at SQLAuthority.com SQLAuthority.com
If you want to see this example in action click on Image. It will open large image.
Let me know what think of blog post by leaving your note in the comment sections.
Reference: Pinal Dave (https://blog.sqlauthority.com)
95 Comments. Leave new
Hi Pinal,
I am having chinese text in the html tags… But the above function is not working for chinese characters.
As I don’t have write access, can I have the same code with temp procedure
how do i get only “change abc contents” as output and trim other html contents
{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\fnil MS Sans Serif;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1\pard\cf1\lang1033\f0\fs16 change abc contents
\par \f1
\par }
very Nice Idea !!
Thanks you for share
Thanks @Ram
An issue…run against the following: ‘This number 4 is < this number 5. ‘
if there are characters which can confuse HTML then you need to handle them separately.
For what little its worth, here’s a modified edition that will recognize a content (non-HTML) “<" symbol. Might be useful, or someone might provide a better method:
DECLARE @String nvarchar(max) = 'AB<CD’ –>Should yield ‘AB<CD'
DECLARE @Start INT,
@NextStart INT,
@End INT,
@Length INT,
@BeginSearch int = 0
SELECT @Start = CHARINDEX('<', @String)
SELECT @NextStart = CHARINDEX('’, @String, @Start)
IF @NextStart < @END –Second "”?
BEGIN
SET @Start = @NextStart –Skip this orphan “<"
SET @BeginSearch = @NextStart –Resume with second " 0 AND @END > 0 –Must have one of each
BEGIN
SELECT @Length = (@End – @Start) + 1
IF @Length > 0
BEGIN
SELECT @String = STUFF(@String, @Start, @Length, ”)
SET @BeginSearch = @BeginSearch – @Length
END
SELECT @Start = CHARINDEX(‘<', @String, @BeginSearch)
SELECT @NextStart = CHARINDEX('’, @String, @BeginSearch)
IF @NextStart < @END
BEGIN
SET @Start = @NextStart
SET @BeginSearch = @NextStart
END
ELSE
SET @BeginSearch = @END
END
SELECT @String
How do i get around:
Msg 206, Level 16, State 2, Line 39
Operand type clash: table is incompatible with varchar(max)
What’s the datatype? We need to find query causing this.. I generally use profiler.
Awesome tid bit.. Thanks a million…..
worked perfectly. thanks!
doesn’t work for AWS redshift, kept giving me syntax errors regarding @, any suggested modifications? i use aginity for redshift workbench
Hello Vinay please change the data type from varchar to nvarchar as below
ALTER FUNCTION [dbo].[udf_StripHTML] (@HTMLText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
If you prefer a version that uses a recursive CTE to declaring variables and having a WHILE loop, I reorganized the functionality to do it this way.
WITH cte (n, HtmlText) AS (
SELECT 0, CONVERT(nvarchar(MAX), N’some test htmlxxx’)
UNION ALL
SELECT N+1, CONVERT(nvarchar(MAX), STUFF(HtmlText, CHARINDEX(N”, HtmlText, CHARINDEX(N'<', HtmlText)) – CHARINDEX(N'<', HtmlText) + 1, ''))
FROM cte
WHERE CHARINDEX(' 0
AND CHARINDEX(‘>’, HtmlText, CHARINDEX(‘ 0
AND CHARINDEX(‘>’, HtmlText, CHARINDEX(‘<', HtmlText)) – CHARINDEX(' 0
)
SELECT TOP 1 HtmlText
FROM cte
ORDER BY N DESC
Thanks a bunch, by far the best solution I’ve found; a minor follow-up: I find the stripped text has a few HTML entities, like   – any tips to also remove those?
after remove the text bounded by , we might want to rermove html key words bound by & and ; ?
Can this be written to allow and ?
Excellent sir it works for me.