Web scraping is a common practice today for fetching data from websites and storing it for analysis. When storing this data in a SQL Server database, you might have many HTML tags mixed in with your data. While these tags are essential for the formatting and structure of web pages, they can be an obstacle when analyzing the extracted data. In this blog post, we’ll explore a method for stripping these HTML tags from your text data using SQL Server.
The Problem
Consider the following SQL table:
CREATE TABLE HTMLTable ( id int, htmlText varchar(max) );
Let’s say we’ve filled this table with some data extracted from web pages:
INSERT INTO HTMLTable VALUES (1,'<p>Hello <b>World</b>!</p>'), (2,'<div>Welcome to my <span>blog</span></div>'), (3,'Goodbye');
Now, if you select the data from this table, you’ll see that the htmlText
field contains a lot of HTML tags. If you want to analyze this text data, these HTMLÂ can be an obstacle.
The Solution
To strip all the HTML tags from the HTML text in your HTMLTable
, you can create a user-defined function that uses a pattern to replace the HTML with an empty string.
Here is the function:
CREATE FUNCTION dbo.udf_StripHTML (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @Start INT DECLARE @End INT DECLARE @Length INT SET @Start = CHARINDEX('<', @HTMLText) WHILE @Start > 0 AND @Start < LEN(@HTMLText) AND CHARINDEX('>', @HTMLText, @Start) > 0 BEGIN SET @End = CHARINDEX('>', @HTMLText, @Start) SET @Length = (@End - @Start) + 1 IF @Length > 0 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) END RETURN @HTMLText END; GO
This function loops through the HTML text, and for each text that starts with <
 and ends with >
, it replaces it with an empty string. This effectively removes all HTML tags from the text.
You can then apply this function to your HTMLTable
 like so:
UPDATE HTMLTable SET htmlText = dbo.udf_StripHTML(htmlText);
This updates the htmlText
 column in your HTMLTable
, stripping all HTMLÂ from the text.
You can also check the data by running the following script:
SELECT * FROM HTMLTable
Conclusion
This method provides a simple way to clean your text data from HTML tags in SQL Server. However, it’s important to note that this function won’t decode HTML entities, and it won’t handle incorrectly formatted HTML tags. Moreover, regular expressions are not recommended for parsing HTML in a production environment, as there are many edge cases that regular expressions cannot handle correctly. For complex HTML parsing, consider using a dedicated HTML parser.
Source of the Function
The function dbo.udf_StripHTML
 is a user-defined function in SQL Server and is not built into the SQL Server system. It is a commonly used function in the SQL Server community for stripping HTML tags from a string and can be found in various SQL Server forums and blogs online.
Remember, the function should be used wisely and tested thoroughly in a controlled environment before applying it to production systems. Happy coding!
You can always reach out to me on Twitter.
Reference:Â Pinal Dave (https://blog.sqlauthority.com)
1 Comment. Leave new
Thanks, saved med the coding