SQL SERVER – Stripping HTML Tags from Text

SQL SERVER - Stripping HTML Tags from Text htmltags-800x464 Web scraping is a common practice today for fetching data from websites and storing it for analysis. When storing this data in a SQL Server database, you might have many HTML tags mixed in with your data. While these tags are essential for the formatting and structure of web pages, they can be an obstacle when analyzing the extracted data. In this blog post, we’ll explore a method for stripping these HTML tags from your text data using SQL Server.

The Problem

Consider the following SQL table:

CREATE TABLE HTMLTable
(
    id int,
    htmlText varchar(max)
);

Let’s say we’ve filled this table with some data extracted from web pages:

INSERT INTO HTMLTable VALUES
(1,'<p>Hello <b>World</b>!</p>'),
(2,'<div>Welcome to my <span>blog</span></div>'),
(3,'Goodbye');

Now, if you select the data from this table, you’ll see that the htmlText field contains a lot of HTML tags. If you want to analyze this text data, these HTML  can be an obstacle.

The Solution

To strip all the HTML tags from the HTML text in your HTMLTable, you can create a user-defined function that uses a pattern to replace the HTML with an empty string.

Here is the function:

CREATE FUNCTION dbo.udf_StripHTML
(@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT

    SET @Start = CHARINDEX('<', @HTMLText)
    
    WHILE @Start > 0
        AND @Start < LEN(@HTMLText)
        AND CHARINDEX('>', @HTMLText, @Start) > 0
    BEGIN
        SET @End = CHARINDEX('>', @HTMLText, @Start)
        SET @Length = (@End - @Start) + 1

        IF @Length > 0
            SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
        
        SET @Start = CHARINDEX('<', @HTMLText)
    END

    RETURN @HTMLText
END;
GO

This function loops through the HTML text, and for each text that starts with < and ends with >, it replaces it with an empty string. This effectively removes all HTML tags from the text.

You can then apply this function to your HTMLTable like so:

UPDATE HTMLTable
SET htmlText = dbo.udf_StripHTML(htmlText);

This updates the htmlText column in your HTMLTable, stripping all HTML  from the text.

You can also check the data by running the following script:

SELECT *
FROM HTMLTable

Conclusion

This method provides a simple way to clean your text data from HTML tags in SQL Server. However, it’s important to note that this function won’t decode HTML entities, and it won’t handle incorrectly formatted HTML tags. Moreover, regular expressions are not recommended for parsing HTML in a production environment, as there are many edge cases that regular expressions cannot handle correctly. For complex HTML parsing, consider using a dedicated HTML parser.

Source of the Function

The function dbo.udf_StripHTML is a user-defined function in SQL Server and is not built into the SQL Server system. It is a commonly used function in the SQL Server community for stripping HTML tags from a string and can be found in various SQL Server forums and blogs online.

Remember, the function should be used wisely and tested thoroughly in a controlled environment before applying it to production systems. Happy coding!

You can always reach out to me on Twitter.

Reference: Pinal Dave (https://blog.sqlauthority.com)

HTML, SQL Function
Previous Post
SQL SERVER – Query Store Feature for Secondary Replicas
Next Post
SQL SERVER – Modern Explicit JOIN Syntax – A Brief Note

Related Posts

1 Comment. Leave new

Leave a Reply