Counter of Festivals

Ashok Blog for SQL Learners and Beginners and Experts

Thursday, 3 January 2013

Understanding Full-Text Indexing in SQL Server

Understanding Full-Text Indexing in SQL Server:


Article 1:

SQL SERVER – 2008 – Creating Full Text Catalog and Full Text Search

Full Text Index helps to perform complex queries against character data.  These queries can include word or phrase searching. We can create a full-text index on a table or indexed view in a database. Only one full-text index is allowed per table or indexed view. The index can contain up to 1024 columns.This feature works with RTM (Ready to Manufacture) version of SQL Server 2008 and does not work on CTP (Community Technology Preview) versions.
To create an Index, follow the steps:
  1. Create a Full-Text Catalog
  2. Create a Full-Text Index
  3. Populate the Index
1) Create a Full-Text Catalog

Full – Text can also be created while creating a Full-Text Index in its Wizard.
2) Create a Full-Text Index
3) Populate the Index
As the Index Is created and populated, you can write the query and use in searching records on that table which provides better performance.
For Example,
We will find the Employee Records who has “Marking “in their Job Title.
FREETEXT( ) Is predicate used to search columns containing character-based data types. It will not match the exact word, but the meaning of the words in the search condition. When FREETEXT is used, the full-text query engine internally performs the following actions on the freetext_string, assigns each term a weight, and then finds the matches.
  • Separates the string into individual words based on word boundaries (word-breaking).
  • Generates inflectional forms of the words (stemming).
  • Identifies a list of expansions or replacements for the terms based on matches in the thesaurus.
CONTAINS( ) is similar to the Freetext but with the difference that it takes one keyword to match with the records, and if we want to combine other words as well in the search then we need to provide the “and” or “or” in search else it will throw an error.
USE AdventureWorks2008
GO
SELECT BusinessEntityID, JobTitle
FROM HumanResources.Employee
WHERE FREETEXT(*, 'Marketing Assistant');
SELECT BusinessEntityID,JobTitle
FROM HumanResources.Employee
WHERE CONTAINS(JobTitle, 'Marketing OR Assistant');
SELECT BusinessEntityID,JobTitle
FROM HumanResources.Employee
WHERE CONTAINS(JobTitle, 'Marketing AND Assistant');
GO
Conclusion
Full text indexing is a great feature that solves a database problem, the searching of textual data columns for specific words and phrases in SQL Server databases. Full Text Index can be used to search words, phrases and multiple forms of word or phrase using FREETEXT() and CANTAINS() with “and” or “or” operators.




Article 2:

Microsoft has quietly been improving full-text indexing in SQL Server. It is time to take a good look at what it offers. Who better to give us that look than Robert Sheldon, in the first of a series.
The most commonly used indexes in a SQL Server database are clustered and nonclustered indexes that are organized in a B-tree structure. You can create these types of indexes on most columns in a table or a view, except those columns configured with large object (LOB) data types, such as text and varchar(max). Although this limitation is not a problem in many cases, there will be times when you’ll want to query such column types. However, without indexes defined on the columns, the query engine must perform a full table scan to locate the necessary data. But there is a solution—full-text search.
Full-text search refers to the functionality in SQL Server that supports full-text queries against character-based data. These types of queries can include words and phrases as well as multiple forms of a word or phrase. To support full-text queries, full-text indexes must be implemented on the columns referenced in the query. The columns can be configured with character data types (such as char and varchar) or with binary data types (such as varbinary and image). A full-text index is made up of word tokens that are derived from the text being indexed. For example, if the indexed text contains the phrase “tables can include indexes,” the full-text index would contain four tokens: “tables,” “can,” “include,” and “indexes.” Because the list of tokens can be easily searched, full-text queries can quickly locate the necessary records.
In this article, I explain how to implement full-text indexing in your SQL Server 2005 or 2008 database, and I provide a number of examples to demonstrate how this is done. For these examples, I used the following script to create and populate the ProductDocs table in the AdventureWorks2008 sample database in SQL Server 2008:

USE AdventureWorks2008  
GO 
 IF OBJECT_ID (N'ProductDocs', N'U') IS NOT NULL DROP TABLE ProductDocs  
GO 
 CREATE TABLE ProductDocs (   DocID INT NOT NULL IDENTITY,   DocTitle NVARCHAR(50) NOT NULL,   DocFilename NVARCHAR(400) NOT NULL,   FileExtension NVARCHAR(8) NOT NULL,   DocSummary NVARCHAR(MAX) NULL,   DocContent VARBINARY(MAX) NULL,   CONSTRAINT [PK_ProductDocs_DocID] PRIMARY KEY CLUSTERED (DocID ASC) )  
GO 
 INSERT INTO ProductDocs (DocTitle, DocFilename, FileExtension, DocSummary, DocContent) SELECT Title, FileName, FileExtension, DocumentSummary, Document FROM Production.Document 
GO
If you’re using the AdventureWorks sample database in SQL Server 2005, simply change the name of the database in the USE statement.
To implement full-text indexing in SQL Server, you should take the following steps:
  1. Create a full-text catalog, if necessary.
  2. Create the full-text index.
  3. Modify the list of noise words (SQL Server 2005) or stop words (SQL Server 2008), if necessary.
  4. Modify the thesaurus for the language being used, if necessary.
These steps are the same in both SQL Server 2005 and 2008, although the specific features supported in each version differ. Part of that difference results from the fact that full-text search is now integrated in the SQL Server 2008 database engine. Prior to 2008, full-text search was based on the MSSearch engine, which sits outside SQL Server and is part of the Microsoft Office group.

Creating the Full-Text Catalog

A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog. Catalogs are implemented differently in SQL Server 2005 and 2008:
  • SQL Server 2005: A full-text catalog is a physical structure that must reside on the local hard drive associated with the SQL Server instance. Each catalog is part of a specific filegroup. If no filegroup is specified when the catalog is created, the default filegroup is used.
  • SQL Server 2008: A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a filegroup.
To create a full-text catalog in either version of SQL Server, you can use the CREATE FULLTEXT CATALOG statement, as shown in the following example:
USE AdventureWorks2008  
GO  
CREATE FULLTEXT CATALOG ProductFTS
WITH ACCENT_SENSITIVITY = OFF
Full-text catalogs are associated with specific databases. In this case, I added the catalog to the AdventureWorks2008 database, but you can create the catalog on any user-defined database.
The only required clause in a CREATE FULLTEXT CATALOG statement is the CREATE FULLTEXT CATALOG clause, which requires that you provide a name for the catalog (in this case, ProductFTS). You can also specify whether this catalog is the default catalog, the catalog owner, or, as in the example above, whether the indexes in the catalog are accent sensitive or insensitive.
If the WITH ACCENT_SENSITIVITY clause is not specified, accent sensitivity is based on the default database collation. You can determine a database’s collation by querying the sys.databases catalog view, as shown in the following example:
SELECT name, collation_name FROM sys.databases
WHERE name = 'AdventureWorks2008'
The follow table shows the statement’s results:
Name
collation_name
AdventureWorks2008
SQL_Latin1_General_CP1_CI_AS
Note: The results shown above, along with the results shown for other examples in this article, are based on queries issued in SQL Server 2008 against the AdventureWorks2008 database. You results may be different depending on the SQL Server version and the database used.
As mentioned above, SQL Server 2005 full-text catalogs are part of a filegroup. If you’re running the CREATE FULLTEXT CATALOG statement against SQL Server 2005, you can specify the name of the filegroup by using the ON FILEGROUP option. You can also specify the directory on which to store the catalog by using the IN PATH option. Both of these options are ignored in SQL Server 2008. For details about the ON FILEGROUP and IN PATH options, see the topic “CREATE FULLTEXT CATALOG (Transact-SQL)” in SQL Server 2005 Books Online.
After you create a full-text catalog, you can use the sys.fulltext_catalogs catalog view to verify that the catalog has been created:
SELECT fulltext_catalog_id, name FROM sys.fulltext_catalogs
In this case, the statement returns the results shown in the following table:
fulltext_catalog_id
name
5
AW2008FullTextCatalog
10
ProductFTS
Notice that the ProductFTS full-text catalog is included in the results. Also notice that in this case the fulltext_catalog_id value is 10. The ID is a good number to note because you can use it in other statements that are related to the full-text catalog.

Creating the Full-Text Index

After you create your full-text catalog, you’re ready to create your full-text index. You can then associate the index with the new catalog. If you don’t specify a catalog when you create the index, the index is associated with the database’s default catalog, whether it is the system catalog or a user-defined catalog that has been configured as the default.
A full-text index is defined at the table level, and only one full-text index can be defined per table. For a table to support a full-text index, a unique index must be defined on that table. In addition, the index must be defined on a single column and be non-nullable. This column is referred to as the key index in the full-text index definition. For best performance, the key index should be defined on a column configured with an integer data type. Often, the primary key is a good candidate for a key index.
You can define a full-text index by using the CREATE FULLTEXT INDEX statement, as shown in the following example.
CREATE FULLTEXT INDEX ON ProductDocs (DocSummary, DocContent TYPE COLUMN FileExtension LANGUAGE 1033) KEY INDEX PK_ProductDocs_DocID ON ProductFTS
WITH STOPLIST = SYSTEM
The first line of the statement includes the ON clause, which specifies the table name (in this case, ProductDocs). The statement’s next line is a list of the columns that should be indexed (DocSummary and DocContent).
Notice that the DocContent column definition includes the TYPE COLUMN clause and the LANGUAGE clause. The TYPE COLUMN clause is necessary when the indexed column stores binary data. The TYPE COLUMN clause specifies the name of a different column in the table that stores the file extension for the binary data. For example, the binary data might be a .doc file or .xls file. SQL Server uses the column specified in the TYPE COLUMN clause to associate the binary data with the program, such as Word or Excel.
You can use the sys.fulltext_document_types catalog view to return a list of the document types supported by full-text search, as shown in the following statement:
SELECT document_type, version, manufacturer
FROM sys.fulltext_document_types
The following table provides a sample of some of the document types (indicated by their file extension) supported by full-text search:
document_type
Version
.ascx
12.0.6828.0
.asm
12.0.6828.0
.asp
12.0.6828.0
.aspx
12.0.6828.0
.bat
12.0.6828.0
.c
12.0.6828.0
.cmd
12.0.6828.0
.cpp
12.0.6828.0
.cxx
12.0.6828.0
.def
12.0.6828.0
.dic
12.0.6828.0
.doc
2006.0.6001.16503
The second clause in the DocContent column definition is LANGUAGE. For each column that you include in your index, you can specify the language of the documents in that column. You reference the language by using the locale identifier (LCID). You can view a list of the identifiers and their associated languages by using the sys.fulltext_languages catalog view:
SELECT * FROM sys.fulltext_languages
ORDER BY lcid
The following table shows a part of the result set. As you can see, 1033, the ID I use in the LANGUAGE clause, is English.
lcid
Name
0
Neutral
1025
Arabic
1026
Bulgarian
1027
Catalan
1028
Traditional Chinese
1031
German
1033
English
1036
French
1037
Hebrew
1039
Icelandic
1040
Italian
The next line of the CREATE FULLTEXT INDEX statement in the preceding example is the KEY INDEX clause. This is the name of the unique key index (in this case, PK_ProductDocs_DocID) that is defined on the ProductDocs table. Be sure to specify the index name, and not the column name, when defining your full-text index.
Following the KEY INDEX clause in the full-text index definition is the ON clause, which specifies the name of the full-text catalog (ProductFTS) that the index will join. In SQL Server 2008, you can also specify a filegroup where the index will be stored. However, this option isn’t available in SQL Server 2005 because filegroup association is at the catalog level.
The final clause in the example CREATE FULLTEXT INDEX statement is WITH STOPLIST. This option, available only in SQL Server 2008, lets you specify the name of the stoplist that will be used for this index. In this case, the system stoplist is used, but you can instead specify a user-defined stoplist. (Stoplists are covered in more detail later in the article.)
After you create a full-text index on a table, you can use several catalog views to verify the index has been created and is associated with the correct catalog. The following SELECT statement joins the sys.tables, sys.fulltext_indexes, and sys.fulltext_catalogs views:
SELECT t.name AS TableName, c.name AS FTCatalogName FROM sys.tables t JOIN sys.fulltext_indexes i   ON t.object_id = i.object_id JOIN sys.fulltext_catalogs c
  ON i.fulltext_catalog_id = c.fulltext_catalog_id
As you can see in the following results, the ProductDocs table is associated with the ProductFTS catalog. Only tables with full-text indexes are listed in the results.
TableName
FTCatalogName
ProductReview
AW2008FullTextCatalog
Document
AW2008FullTextCatalog
JobCandidate
AW2008FullTextCatalog
ProductDocs
ProductFTS
One of the advantages of SQL Server 2008 is that the index is now stored within the database. That means you can issue a query that lists the contents of the index, something you cannot do in SQL Server 2005. The following SELECT statement uses the sys.dm_fts_index_keywords dynamic management function to return the list of terms stored in the full-text index created on the ProductDocs table:
SELECT display_term, column_id, document_count FROM sys.dm_fts_index_keywords
  (DB_ID('AdventureWorks2008'), OBJECT_ID('ProductDocs'))
The results returned by the statement include the indexed terms, along with the column ID and document count (number of rows) that contain the term. The column ID is based on the order the columns are defined in the table definition. In the ProductDocs table, the two indexed columns are DocSummary (the fifth column defined) and DocContent (the sixth column defined). The following table shows the first 25 terms stored in the ProductDocs full-text index.
display_term
column_id
document_count
100
5
1
100
6
1
1000
6
1
150
6
1
16
6
2
20
6
1
2000
6
1
20w
6
1
23ft
6
1
248
6
1
250
6
1
3000
6
1
44
6
1
48
6
1
500
5
1
500
6
1
618
6
1
619
6
1
620
6
1
Above
6
2
Absorbing
6
1
Acceptable
6
1
Accessories
6
1
Add
6
1
Adding
6
1

Modifying the List of Noise Words or Stop Words

When implementing full-text indexing in SQL Server, the area in which you will probably see the greatest differences between SQL Server 2005 and 2008 is in the way each version handles noise words or stop word.

Noise Words in SQL Server 2005

Noise words are those words that are automatically removed from a full-text index when that index is created. For example, in the phrase “an apple and an orange,” the words “an” and “and” are considered noise words and are not be included in the index. Only “apple” and “orange” are tokenized and added to the index.
SQL Server 2005 defines noise words in a set of text files, which by default, are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. Each text file is associated with a specific language and contains the noise words associated in that language. For example, the English noise words are stored in the noiseENG.txt file. The following list shows a sample of some of the words in the file:
about
1
after
2
all
also
3
an
4
and
5
another
6
any
7
are
8
as
9
at
0
be
$
because
been
before
being
between
both
but
by
came
can
come
could
did
do
You can edit a noise word file in any text editor. Simply add or remove words as necessary, and then save your changes. (Be sure to save a copy of the original file before changing it.)
Note: After you modify a noise word file, you must repopulate the applicable full-text indexes for the changes to take effect.

Stop Words in SQL Server 2008

SQL Server 2008 uses stop words, not noise words. Stop words are saved to stoplists, which are stored within SQL Server. You can create a stoplist by using a CREATE FULLTEXT STOPLIST statement, as shown in the following example:
CREATE FULLTEXT STOPLIST ProductSL
FROM SYSTEM STOPLIST;
In the first line, the statement creates a stoplist named ProductSL. The second line retrieves the stop words from the system stoplist and uses those stop words to populate the new stoplist. You can verify that the stoplist has been created by querying the sys.fulltext_stoplists catalog view:
SELECT stoplist_id, name FROM sys.fulltext_stoplists
As you can see in the following results, the ProductSL stoplist has been added to the database, and the stoplist ID is 5.
stoplist_id
Name
5
ProductSL
After you create a stoplist, you can view its contents by querying the sys.fulltext_stopwords catalog view, as shown in the following statement:
SELECT stopword FROM sys.fulltext_stopwords
WHERE stoplist_id = 5 AND language_id = 1033
Notice that I specified the stoplist ID and language ID to retrieve only the necessary content. The following table shows a partial list of the stop words saved to the ProductSL stoplist.
Stopword
About
After
All
Also
An
And
another
Any
Are
As
At
Be
because
been
before
being
between
both
But
By
came
Can
come
could
Did
Do
does
each
else
For
from
Get
Got
Had
Has
have
He
Her
here
Him
Himself
His
How
In SQL Server 2008, you can add words to or remove words from a stoplist by using the ALTER FULLTEXT STOPLIST statement. The following statement adds the word “nuts” to the stoplist:
ALTER FULLTEXT STOPLIST ProductSL
ADD 'nuts' LANGUAGE 1033;
When you generate a full-text index based on this stoplist, any occurrence of “nuts” will be treated as a stop word and be removed from the index.
SQL Server 2008 also includes the sys.dm_fts_parser dynamic management function. The function lets you test how SQL Server will tokenize a string based on a specific language and stoplist. In the following SELECT statement, the function parses the phrase “testing for fruit and nuts, any type of nut.”
SELECT special_term, display_term FROM sys.dm_fts_parser
  (' "testing for fruit and nuts, any type of nut" ', 1033, 5, 0)
The first argument in the function is the string that will be parsed, the second argument is the language ID, the third argument is the stoplist ID, and the fourth argument specifies whether the parsing should be accent insensitive (0) or accent sensitive (1). The following table shows the query results. Notice that “nuts” is considered a noise word.
special_term
display_term
Exact Match
Testing
Noise Word
For
Exact Match
Fruit
Noise Word
And
Noise Word
Nuts
Noise Word
Any
Exact Match
Type
Noise Word
Of
Exact Match
Nut
If you want to drop “nuts” from the stoplist, you can use the following statement:
ALTER FULLTEXT STOPLIST ProductSL
DROP 'nuts' LANGUAGE 1033;
Now when you use the sys.dm_fts_parser function to view the stoplist, you will receive the following results, which show that “nuts” is now considered an exact match.
special_term
display_term
Exact Match
Testing
Noise Word
For
Exact Match
Fruit
Noise Word
And
Exact Match
Nuts
Noise Word
Any
Exact Match
Type
Noise Word
Of
Exact Match
Nut
Earlier in the article, when I created the full-text index on the ProductDocs table, I specified that the index should use the system stoplist. I can now modify the index definition to instead use the ProductSL index. In the following ALTER FULLTEXT INDEX statement, I set the stoplist to ProductSL:
ALTER FULLTEXT INDEX ON ProductDocs
SET STOPLIST ProductSL
Note: After you modify a stoplist or full-text index definition, you must repopulate the applicable full-text indexes for the changes to take effect.

Modifying the Full-Text Thesaurus

Both SQL Server 2005 and SQL Server 2008 provide a set of XML thesaurus files that let you define synonyms to support full-text queries. For example, you can define a set of synonyms for “song,” “tune,” and “music.” That way, whenever a query is issued against any one of these terms, the results include every other term defined in the set.
SQL Server includes a thesaurus file for each language supported by full-text search. The files are named according to the language they support. For instance, the thesaurus file that supports English synonyms is named tseng.xml.
In a default installation of SQL Server 2005, the thesaurus files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. In a default installation of SQL Server 2008, the files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData.
Each thesaurus file comes preconfigured with the following XML code. For example, the following code is included in the tseng.xml file:
<XML ID="Microsoft Search Thesaurus">

<!--  Commented out (SQL Server 2008)

    <thesaurus xmlns="x-schema:tsSchema.xml">
   <diacritics_sensitive>0</diacritics_sensitive>
        <expansion>
            <sub>Internet Explorer</sub>
            <sub>IE</sub>
            <sub>IE5</sub>
        </expansion>
        <replacement>
            <pat>NT5</pat>
            <pat>W2K</pat>
            <sub>Windows 2000</sub>
        </replacement>
        <expansion>
            <sub>run</sub>
            <sub>jog</sub>
        </expansion>
    </thesaurus>
-->
</XML>
As you can see, most of the code is commented out, but you can modify the file as necessary. The first element to note is <diacritics_sensitive>. This element determines whether the thesaurus file is accent sensitive (1) or insensitive (0). By default, a thesaurus file is accent insensitive.
The other two elements worth noting are <expansion> and <replacement>. The first lets you define expansion sets, and the second lets you define replacement sets:
  • Expansion set: A set of terms that can be substituted for each other. For example, in the sample <expansion> element in the tseng.xml file, three terms are included: “Internet Explorer,” “IE,” and “IE5.” As a result, if a full-text query includes the term “Internet Explorer,” the query results can include content that contains any of the three terms.
  • Replacement set: A set of terms in which one term is substituted for another. For example, in the example <replacement> element in the tseng.xml file, the terms “NT5” and “W2K” are replaced by “Windows 2000.” As a result, if a full-text query includes the term “NT5” or “W2K,” the search results include only content that contains “Windows 2000.”
You can edit a thesaurus file in any XML or text editor. After you save your changes, you must take steps to ensure that the thesaurus is applied to your full-text searches. In SQL Server 2005, this means you must restart the full-text search engine. However, in SQL Server 2008, you need only run the sys.sp_fulltext_load_thesaurus_file system stored procedure, as shown in the following example:
EXEC sys.sp_fulltext_load_thesaurus_file 1033
As you can see, when you run the stored procedure, you must specify the LCID. The stored procedure then parses and loads the data from the applicable thesaurus file, without having to restart the full-text engin

No comments:

Post a Comment