Web scraping is a common practice today for fetching data from websites and storing it for analysis. When storing this data in a SQL Server database, you might have many HTML tags mixed in with your data. While these tags are essential for the formatting and structure of web pages, they can be an obstacle when analyzing the extracted data. In this blog post, we’ll explore a method for stripping these HTML tags from your text data using SQL Server.
Consider the following SQL table:
CREATE TABLE HTMLTable ( id int, htmlText varchar(max) );
Let’s say we’ve filled this table with some data extracted from web pages:
INSERT INTO HTMLTable VALUES (1,'<p>Hello <b>World</b>!</p>'), (2,'<div>Welcome to my <span>blog</span></div>'), (3,'Goodbye');
Now, if you select the data from this table, you’ll see that the
htmlText field contains a lot of HTML tags. If you want to analyze this text data, these HTML can be an obstacle.
To strip all the HTML tags from the HTML text in your
HTMLTable, you can create a user-defined function that uses a pattern to replace the HTML with an empty string.
Here is the function:
CREATE FUNCTION dbo.udf_StripHTML (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @Start INT DECLARE @End INT DECLARE @Length INT SET @Start = CHARINDEX('<', @HTMLText) WHILE @Start > 0 AND @Start < LEN(@HTMLText) AND CHARINDEX('>', @HTMLText, @Start) > 0 BEGIN SET @End = CHARINDEX('>', @HTMLText, @Start) SET @Length = (@End - @Start) + 1 IF @Length > 0 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) END RETURN @HTMLText END; GO
This function loops through the HTML text, and for each text that starts with
< and ends with
>, it replaces it with an empty string. This effectively removes all HTML tags from the text.
You can then apply this function to your
HTMLTable like so:
UPDATE HTMLTable SET htmlText = dbo.udf_StripHTML(htmlText);
This updates the
htmlText column in your
HTMLTable, stripping all HTML from the text.
You can also check the data by running the following script:
SELECT * FROM HTMLTable
This method provides a simple way to clean your text data from HTML tags in SQL Server. However, it’s important to note that this function won’t decode HTML entities, and it won’t handle incorrectly formatted HTML tags. Moreover, regular expressions are not recommended for parsing HTML in a production environment, as there are many edge cases that regular expressions cannot handle correctly. For complex HTML parsing, consider using a dedicated HTML parser.
Source of the Function
dbo.udf_StripHTML is a user-defined function in SQL Server and is not built into the SQL Server system. It is a commonly used function in the SQL Server community for stripping HTML tags from a string and can be found in various SQL Server forums and blogs online.
Remember, the function should be used wisely and tested thoroughly in a controlled environment before applying it to production systems. Happy coding!
You can always reach out to me on Twitter.
Reference: Pinal Dave (https://blog.sqlauthority.com)