Understanding Full-Text Indexing in SQL Server:
Article 1:SQL SERVER – 2008 – Creating Full Text Catalog and Full Text Search
September 5, 2008 by pinaldave
Full Text Index helps to perform
complex queries against character data. These queries can include word
or phrase searching. We can create a full-text index on a table or
indexed view in a database. Only one full-text index is allowed per
table or indexed view. The index can contain up to 1024 columns.This feature works with RTM (Ready to
Manufacture) version of SQL Server 2008 and does not work on CTP
(Community Technology Preview) versions.
To create an Index, follow the steps:
- Create a Full-Text Catalog
- Create a Full-Text Index
- Populate the Index
1) Create a Full-Text Catalog
Full – Text can also be created while creating a Full-Text Index in its Wizard.
2) Create a Full-Text Index
3) Populate the Index
As the Index Is created and populated,
you can write the query and use in searching records on that table which
provides better performance.
For Example,
We will find the Employee Records who has “Marking “in their Job Title.
FREETEXT( ) Is
predicate used to search columns containing character-based data types.
It will not match the exact word, but the meaning of the words in the
search condition. When FREETEXT is used, the full-text query engine
internally performs the following actions on the freetext_string, assigns each term a weight, and then finds the matches.
- Separates the string into individual words based on word boundaries (word-breaking).
- Generates inflectional forms of the words (stemming).
- Identifies a list of expansions or replacements for the terms based on matches in the thesaurus.
CONTAINS( ) is
similar to the Freetext but with the difference that it takes one
keyword to match with the records, and if we want to combine other words
as well in the search then we need to provide the “and” or “or” in
search else it will throw an error.
USE AdventureWorks2008
GO
SELECT BusinessEntityID, JobTitle
FROM HumanResources.Employee
WHERE FREETEXT(*, 'Marketing Assistant');
SELECT BusinessEntityID,JobTitle
FROM HumanResources.Employee
WHERE CONTAINS(JobTitle, 'Marketing OR Assistant');
SELECT BusinessEntityID,JobTitle
FROM HumanResources.Employee
WHERE CONTAINS(JobTitle, 'Marketing AND Assistant');
GO
Conclusion
Full text indexing is a great feature
that solves a database problem, the searching of textual data columns
for specific words and phrases in SQL Server databases. Full Text Index
can be used to search words, phrases and multiple forms of word or
phrase using FREETEXT() and CANTAINS() with “and” or “or” operators.
Article 2:
Microsoft has quietly been improving full-text indexing
in SQL Server. It is time to take a good look at what it offers. Who
better to give us that look than Robert Sheldon, in the first of a
series.
The most commonly used indexes in a SQL Server database
are clustered and nonclustered indexes that are organized in a B-tree
structure. You can create these types of indexes on most columns in a
table or a view, except those columns configured with large object (LOB)
data types, such as text and varchar(max). Although this limitation is
not a problem in many cases, there will be times when you’ll want to
query such column types. However, without indexes defined on the
columns, the query engine must perform a full table scan to locate the
necessary data. But there is a solution—full-text search.
Full-text search refers to the functionality in SQL Server that
supports full-text queries against character-based data. These types of
queries can include words and phrases as well as multiple forms of a
word or phrase. To support full-text queries, full-text indexes must be
implemented on the columns referenced in the query. The columns can be
configured with character data types (such as char and varchar) or with
binary data types (such as varbinary and image). A full-text index is
made up of word tokens that are derived from the text being indexed. For
example, if the indexed text contains the phrase “tables can include
indexes,” the full-text index would contain four tokens: “tables,”
“can,” “include,” and “indexes.” Because the list of tokens can be
easily searched, full-text queries can quickly locate the necessary
records.In this article, I explain how to implement full-text indexing in your SQL Server 2005 or 2008 database, and I provide a number of examples to demonstrate how this is done. For these examples, I used the following script to create and populate the ProductDocs table in the AdventureWorks2008 sample database in SQL Server 2008:
USE AdventureWorks2008
GO
IF OBJECT_ID (N'ProductDocs', N'U') IS NOT NULL
DROP TABLE ProductDocs
GO
CREATE TABLE ProductDocs (
DocID INT NOT NULL IDENTITY,
DocTitle NVARCHAR(50) NOT NULL,
DocFilename NVARCHAR(400) NOT NULL,
FileExtension NVARCHAR(8) NOT NULL,
DocSummary NVARCHAR(MAX) NULL,
DocContent VARBINARY(MAX) NULL,
CONSTRAINT [PK_ProductDocs_DocID] PRIMARY KEY CLUSTERED (DocID ASC)
)
GO
INSERT INTO ProductDocs
(DocTitle, DocFilename, FileExtension, DocSummary, DocContent)
SELECT Title, FileName, FileExtension, DocumentSummary, Document
FROM Production.Document
GO
If you’re using the AdventureWorks sample database in SQL Server 2005, simply change the name of the database in the USE statement.To implement full-text indexing in SQL Server, you should take the following steps:
- Create a full-text catalog, if necessary.
- Create the full-text index.
- Modify the list of noise words (SQL Server 2005) or stop words (SQL Server 2008), if necessary.
- Modify the thesaurus for the language being used, if necessary.
Creating the Full-Text Catalog
A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog. Catalogs are implemented differently in SQL Server 2005 and 2008:- SQL Server 2005: A full-text catalog is a physical structure that must reside on the local hard drive associated with the SQL Server instance. Each catalog is part of a specific filegroup. If no filegroup is specified when the catalog is created, the default filegroup is used.
- SQL Server 2008: A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a filegroup.
USE AdventureWorks2008
GO
CREATE FULLTEXT CATALOG ProductFTS
Full-text catalogs are associated with specific databases. In this case, I added the catalog to the AdventureWorks2008 database, but you can create the catalog on any user-defined database.
WITH ACCENT_SENSITIVITY = OFF
The only required clause in a CREATE FULLTEXT CATALOG statement is the CREATE FULLTEXT CATALOG clause, which requires that you provide a name for the catalog (in this case, ProductFTS). You can also specify whether this catalog is the default catalog, the catalog owner, or, as in the example above, whether the indexes in the catalog are accent sensitive or insensitive.
If the WITH ACCENT_SENSITIVITY clause is not specified, accent sensitivity is based on the default database collation. You can determine a database’s collation by querying the sys.databases catalog view, as shown in the following example:
SELECT name, collation_name FROM sys.databases
The follow table shows the statement’s results:
WHERE name = 'AdventureWorks2008'
Name
|
collation_name
|
AdventureWorks2008
|
SQL_Latin1_General_CP1_CI_AS
|
Note: The results shown above, along with the
results shown for other examples in this article, are based on queries
issued in SQL Server 2008 against the AdventureWorks2008 database. You
results may be different depending on the SQL Server version and the
database used.
After you create a full-text catalog, you can use the sys.fulltext_catalogs catalog view to verify that the catalog has been created:
SELECT fulltext_catalog_id, name FROM sys.fulltext_catalogs
fulltext_catalog_id
|
name
|
5
|
AW2008FullTextCatalog
|
10
|
ProductFTS
|
Creating the Full-Text Index
After you create your full-text catalog, you’re ready to create your full-text index. You can then associate the index with the new catalog. If you don’t specify a catalog when you create the index, the index is associated with the database’s default catalog, whether it is the system catalog or a user-defined catalog that has been configured as the default.A full-text index is defined at the table level, and only one full-text index can be defined per table. For a table to support a full-text index, a unique index must be defined on that table. In addition, the index must be defined on a single column and be non-nullable. This column is referred to as the key index in the full-text index definition. For best performance, the key index should be defined on a column configured with an integer data type. Often, the primary key is a good candidate for a key index.
You can define a full-text index by using the CREATE FULLTEXT INDEX statement, as shown in the following example.
CREATE FULLTEXT INDEX ON ProductDocs
(DocSummary, DocContent TYPE COLUMN FileExtension LANGUAGE 1033)
KEY INDEX PK_ProductDocs_DocID
ON ProductFTS
The first line of the statement includes the ON clause, which specifies the table name (in this case, ProductDocs). The statement’s next line is a list of the columns that should be indexed (DocSummary and DocContent).
WITH STOPLIST = SYSTEM
Notice that the DocContent column definition includes the TYPE COLUMN clause and the LANGUAGE clause. The TYPE COLUMN clause is necessary when the indexed column stores binary data. The TYPE COLUMN clause specifies the name of a different column in the table that stores the file extension for the binary data. For example, the binary data might be a .doc file or .xls file. SQL Server uses the column specified in the TYPE COLUMN clause to associate the binary data with the program, such as Word or Excel.
You can use the sys.fulltext_document_types catalog view to return a list of the document types supported by full-text search, as shown in the following statement:
SELECT document_type, version, manufacturer
The following table provides a sample of some of the document types
(indicated by their file extension) supported by full-text search:
FROM sys.fulltext_document_types
document_type
|
Version
|
.ascx
|
12.0.6828.0
|
.asm
|
12.0.6828.0
|
.asp
|
12.0.6828.0
|
.aspx
|
12.0.6828.0
|
.bat
|
12.0.6828.0
|
.c
|
12.0.6828.0
|
.cmd
|
12.0.6828.0
|
.cpp
|
12.0.6828.0
|
.cxx
|
12.0.6828.0
|
.def
|
12.0.6828.0
|
.dic
|
12.0.6828.0
|
.doc
|
2006.0.6001.16503
|
SELECT * FROM sys.fulltext_languages
The following table shows a part of the result set. As you can see, 1033, the ID I use in the LANGUAGE clause, is English.
ORDER BY lcid
lcid
|
Name
|
0
|
Neutral
|
1025
|
Arabic
|
1026
|
Bulgarian
|
1027
|
Catalan
|
1028
|
Traditional Chinese
|
1031
|
German
|
1033
|
English
|
1036
|
French
|
1037
|
Hebrew
|
1039
|
Icelandic
|
1040
|
Italian
|
Following the KEY INDEX clause in the full-text index definition is the ON clause, which specifies the name of the full-text catalog (ProductFTS) that the index will join. In SQL Server 2008, you can also specify a filegroup where the index will be stored. However, this option isn’t available in SQL Server 2005 because filegroup association is at the catalog level.
The final clause in the example CREATE FULLTEXT INDEX statement is WITH STOPLIST. This option, available only in SQL Server 2008, lets you specify the name of the stoplist that will be used for this index. In this case, the system stoplist is used, but you can instead specify a user-defined stoplist. (Stoplists are covered in more detail later in the article.)
After you create a full-text index on a table, you can use several catalog views to verify the index has been created and is associated with the correct catalog. The following SELECT statement joins the sys.tables, sys.fulltext_indexes, and sys.fulltext_catalogs views:
SELECT t.name AS TableName, c.name AS FTCatalogName
FROM sys.tables t JOIN sys.fulltext_indexes i
ON t.object_id = i.object_id
JOIN sys.fulltext_catalogs c
As you can see in the following results, the ProductDocs table is associated with the ProductFTS catalog. Only tables with full-text indexes are listed in the results.
ON i.fulltext_catalog_id = c.fulltext_catalog_id
TableName
|
FTCatalogName
|
ProductReview
|
AW2008FullTextCatalog
|
Document
|
AW2008FullTextCatalog
|
JobCandidate
|
AW2008FullTextCatalog
|
ProductDocs
|
ProductFTS
|
SELECT display_term, column_id, document_count
FROM sys.dm_fts_index_keywords
The results returned by the statement include the indexed terms,
along with the column ID and document count (number of rows) that
contain the term. The column ID is based on the order the columns are
defined in the table definition. In the ProductDocs table, the two indexed columns are DocSummary (the fifth column defined) and DocContent (the sixth column defined). The following table shows the first 25 terms stored in the ProductDocs full-text index.
(DB_ID('AdventureWorks2008'), OBJECT_ID('ProductDocs'))
display_term
|
column_id
|
document_count
|
100
|
5
|
1
|
100
|
6
|
1
|
1000
|
6
|
1
|
150
|
6
|
1
|
16
|
6
|
2
|
20
|
6
|
1
|
2000
|
6
|
1
|
20w
|
6
|
1
|
23ft
|
6
|
1
|
248
|
6
|
1
|
250
|
6
|
1
|
3000
|
6
|
1
|
44
|
6
|
1
|
48
|
6
|
1
|
500
|
5
|
1
|
500
|
6
|
1
|
618
|
6
|
1
|
619
|
6
|
1
|
620
|
6
|
1
|
Above
|
6
|
2
|
Absorbing
|
6
|
1
|
Acceptable
|
6
|
1
|
Accessories
|
6
|
1
|
Add
|
6
|
1
|
Adding
|
6
|
1
|
Modifying the List of Noise Words or Stop Words
When implementing full-text indexing in SQL Server, the area in which you will probably see the greatest differences between SQL Server 2005 and 2008 is in the way each version handles noise words or stop word.Noise Words in SQL Server 2005
Noise words are those words that are automatically removed from a full-text index when that index is created. For example, in the phrase “an apple and an orange,” the words “an” and “and” are considered noise words and are not be included in the index. Only “apple” and “orange” are tokenized and added to the index.SQL Server 2005 defines noise words in a set of text files, which by default, are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. Each text file is associated with a specific language and contains the noise words associated in that language. For example, the English noise words are stored in the noiseENG.txt file. The following list shows a sample of some of the words in the file:
about
1
after
2
all
also
3
an
4
and
5
another
6
any
7
are
8
as
9
at
0
be
$
because
been
before
being
between
both
but
by
came
can
come
could
did
do
Note: After you modify a noise word file, you must repopulate the applicable full-text indexes for the changes to take effect.
Stop Words in SQL Server 2008
SQL Server 2008 uses stop words, not noise words. Stop words are saved to stoplists, which are stored within SQL Server. You can create a stoplist by using a CREATE FULLTEXT STOPLIST statement, as shown in the following example:
CREATE FULLTEXT STOPLIST ProductSL
In the first line, the statement creates a stoplist named ProductSL.
The second line retrieves the stop words from the system stoplist and
uses those stop words to populate the new stoplist. You can verify that
the stoplist has been created by querying the sys.fulltext_stoplists catalog view:
FROM SYSTEM STOPLIST;
SELECT stoplist_id, name FROM sys.fulltext_stoplists
stoplist_id
|
Name
|
5
|
ProductSL
|
SELECT stopword FROM sys.fulltext_stopwords
Notice that I specified the stoplist ID and language ID to retrieve
only the necessary content. The following table shows a partial list of
the stop words saved to the ProductSL stoplist.
WHERE stoplist_id = 5 AND language_id = 1033
Stopword
|
About
|
After
|
All
|
Also
|
An
|
And
|
another
|
Any
|
Are
|
As
|
At
|
Be
|
because
|
been
|
before
|
being
|
between
|
both
|
But
|
By
|
came
|
Can
|
come
|
could
|
Did
|
Do
|
does
|
each
|
else
|
For
|
from
|
Get
|
Got
|
Had
|
Has
|
have
|
He
|
Her
|
here
|
Him
|
Himself
|
His
|
How
|
ALTER FULLTEXT STOPLIST ProductSL
When you generate a full-text index based on this stoplist, any
occurrence of “nuts” will be treated as a stop word and be removed from
the index.
ADD 'nuts' LANGUAGE 1033;
SQL Server 2008 also includes the sys.dm_fts_parser dynamic management function. The function lets you test how SQL Server will tokenize a string based on a specific language and stoplist. In the following SELECT statement, the function parses the phrase “testing for fruit and nuts, any type of nut.”
SELECT special_term, display_term
FROM sys.dm_fts_parser
The first argument in the function is the string that will be parsed,
the second argument is the language ID, the third argument is the
stoplist ID, and the fourth argument specifies whether the parsing
should be accent insensitive (0) or accent sensitive (1). The following
table shows the query results. Notice that “nuts” is considered a noise
word.
(' "testing for fruit and nuts, any type of nut" ', 1033, 5, 0)
special_term
|
display_term
|
Exact Match
|
Testing
|
Noise Word
|
For
|
Exact Match
|
Fruit
|
Noise Word
|
And
|
Noise Word
|
Nuts
|
Noise Word
|
Any
|
Exact Match
|
Type
|
Noise Word
|
Of
|
Exact Match
|
Nut
|
ALTER FULLTEXT STOPLIST ProductSL
Now when you use the sys.dm_fts_parser function to view the stoplist, you will receive the following results, which show that “nuts” is now considered an exact match.
DROP 'nuts' LANGUAGE 1033;
special_term
|
display_term
|
Exact Match
|
Testing
|
Noise Word
|
For
|
Exact Match
|
Fruit
|
Noise Word
|
And
|
Exact Match
|
Nuts
|
Noise Word
|
Any
|
Exact Match
|
Type
|
Noise Word
|
Of
|
Exact Match
|
Nut
|
ALTER FULLTEXT INDEX ON ProductDocs
SET STOPLIST ProductSL
Note: After you modify a stoplist or full-text
index definition, you must repopulate the applicable full-text indexes
for the changes to take effect.
Modifying the Full-Text Thesaurus
Both SQL Server 2005 and SQL Server 2008 provide a set of XML thesaurus files that let you define synonyms to support full-text queries. For example, you can define a set of synonyms for “song,” “tune,” and “music.” That way, whenever a query is issued against any one of these terms, the results include every other term defined in the set.SQL Server includes a thesaurus file for each language supported by full-text search. The files are named according to the language they support. For instance, the thesaurus file that supports English synonyms is named tseng.xml.
In a default installation of SQL Server 2005, the thesaurus files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. In a default installation of SQL Server 2008, the files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData.
Each thesaurus file comes preconfigured with the following XML code.
For example, the following code is included in the tseng.xml file:
<XML ID="Microsoft Search Thesaurus">
<!-- Commented out (SQL Server 2008)
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>
</thesaurus>
-->
</XML>
The other two elements worth noting are <expansion> and <replacement>. The first lets you define expansion sets, and the second lets you define replacement sets:
- Expansion set: A set of terms that can be substituted for each other. For example, in the sample <expansion> element in the tseng.xml file, three terms are included: “Internet Explorer,” “IE,” and “IE5.” As a result, if a full-text query includes the term “Internet Explorer,” the query results can include content that contains any of the three terms.
- Replacement set: A set of terms in which one term is substituted for another. For example, in the example <replacement> element in the tseng.xml file, the terms “NT5” and “W2K” are replaced by “Windows 2000.” As a result, if a full-text query includes the term “NT5” or “W2K,” the search results include only content that contains “Windows 2000.”
You can edit a thesaurus file in any XML or
text editor. After you save your changes, you must take steps to ensure
that the thesaurus is applied to your full-text searches. In SQL Server
2005, this means you must restart the full-text search engine. However,
in SQL Server 2008, you need only run the sys.sp_fulltext_load_thesaurus_file system stored procedure, as shown in the following example:
EXEC sys.sp_fulltext_load_thesaurus_file 1033
No comments:
Post a Comment