Home » Server Options » Text & interMedia » Problem with special chars in BLOB data type using contains keyword (oracle,9i, windows XP professional 2002)
Problem with special chars in BLOB data type using contains keyword [message #446334] Sun, 07 March 2010 22:38 Go to next message
hasham.alis
Messages: 1
Registered: March 2010
Junior Member
Facing problem, when part searching with special chars in BLOB datatype. It is considering the non alpha-numeric chars as a separtor in a provided string
EX:
SELECT *
FROM RESUME_TEST P,grst_candidate d
WHERE d.candidate_id = p.candidate_id
AND CONTAINS(P.CAND_RESUME,'%VB.NET%',1) > 0

Strings: , VB.NET , PL/SQL AS/400 , C etc..

Followed the below approaches
1) created a table:
Syntax: create table resume_Test(cand_id number(10),cand_resume blob);
2) inserted the values into this table upto 60,000
3) created a context index
3.1 created preferences

Syntax:
BEGIN
ctx_ddl.create_preference('try_lexer3','BASIC_LEXER');
ctx_ddl.set_attribute('try_lexer3','printjoins','-_~!@#$%^&*(){}[],=?\;|><.+');
END;

3.2 created context index
Syntax:

CREATE INDEX CANDRESUME_CTX_IDX ON resume_test (cand_resume)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('LEXER try_lexer3 memory 500M');


4) while executing this index, it is taking much time approx 6 hrs(plz explain why it is taking time)

5) Problems:
5.1 when searching with string(VB.NET , PL/SQL AS/400 , C etc..) it is considering the special char as a separator
5.2 used escape char (\) also, but no effect
5.3 when searching with single char, it is giving error (ORA-29902,ORA-20000,DRG-51030)
5.4 getting the above error with wild card chars (& ,_, (),{},[])


So, please explain the clear scenarios, why am getting this error , and how to get the proper results.
Re: Problem with special chars in BLOB data type using contains keyword [message #446449 is a reply to message #446334] Mon, 08 March 2010 08:20 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
Since the two tables in your query are joined, there should be indexes on the columns (cand_id) used in joining, or even better primary and foreign keys if appropriate. That enables the query to execute faster.

You left the forward slash "/" out of your printjoins values, so that would be why the "PL" and "SQL" of "PL/SQL" were tokenized separately, since it considered "/" as a separator.

If you want to search on single characters, then you need to use an empty stoplist when creating the index. If you do not specify a stoplist, then the default stoplist is used, which considers all single letters as stopwords, so searching for just a single letter, would be like searching for an empty string, which would cause an error.

Please see the following section of the online documentation for tips on how long to expect indexing to take and how to decrease the time:

http://download.oracle.com/docs/cd/B10501_01/text.920/a96517/aoptim.htm#1008973

After indexing, you can select the token_text column from the dr$...$i domain index table that is created by the indexing, to see how things are tokenized and check whether your sinlge characters are tokenized and whether your printjoins are working as expected.

You should gather current statistics on both tables, to enable the optimizer to choose the best execution plan. You can use dbms_stats.gather_table_stats to do this.

When searching, it is not necessary to use %. If you search for 'VB.NET' without the %, it will locate the row with 'VB.NET' anywhere in the text of the column. The % would be for searching for words that have that somwhere in the middle, like 'AVB.NETC'. The %, especially the leading one, slows down the query greatly. If you do need that leading wildcard, then you shold look into substring indexing to optimize those querie.

Instead of "select *", you should select the individual columns or your blob column will not be legible. You can use utl_raw.cast_to_varchar2 or ctx_doc.snippet or some such thing to display the value of the blob column in a legible manner.

You should use a bind variable for your search and assign the different values to be searched for to your bind variable, instead of searching for the literale values. That way the query in the SGA can be re-used, instead of re-parsing, which can save a lot of execution time.

Pleae see the demonstration below.

SCOTT@orcl_11g> -- create tables with primary and foreign keys
SCOTT@orcl_11g> -- (the keys will speed queries that join the tables by them):
SCOTT@orcl_11g> CREATE TABLE grst_candidate
  2    (cand_id      NUMBER (10),
  3  	cand_name    VARCHAR2 (10),
  4  	CONSTRAINT   cand_pk PRIMARY KEY (cand_id))
  5  /

Table created.

SCOTT@orcl_11g> CREATE TABLE resume_Test
  2    (cand_id      NUMBER (10),
  3  	cand_resume  BLOB,
  4  	CONSTRAINT   cand_fk FOREIGN KEY (cand_id)
  5  		     REFERENCES grst_candidate (cand_id))
  6  /

Table created.

SCOTT@orcl_11g> -- insert data:
SCOTT@orcl_11g> INSERT ALL
  2  INTO grst_candidate VALUES (1, 'cand1')
  3  INTO grst_candidate VALUES (2, 'cand2')
  4  INTO grst_candidate VALUES (3, 'cand3')
  5  INTO grst_candidate VALUES (4, 'cand4')
  6  SELECT * FROM DUAL
  7  /

4 rows created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO resume_test VALUES (1, UTL_RAW.CAST_TO_RAW ('word1 VB.NET word3'))
  3  INTO resume_test VALUES (2, UTL_RAW.CAST_TO_RAW ('word1 PL/SQL word3'))
  4  INTO resume_test VALUES (3, UTL_RAW.CAST_TO_RAW ('word1 AS/400 word3'))
  5  INTO resume_test VALUES (4, UTL_RAW.CAST_TO_RAW ('word1 C word3'))
  6  SELECT * FROM DUAL
  7  /

4 rows created.

SCOTT@orcl_11g> -- create lexer preference and set printjoins attribute
SCOTT@orcl_11g> -- (added / to prinjoin values):
SCOTT@orcl_11g> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE
  3  	 ('try_lexer3',
  4  	  'BASIC_LEXER');
  5    CTX_DDL.SET_ATTRIBUTE
  6  	 ('try_lexer3',
  7  	  'PRINTJOINS',
  8  	  '-_~!@#$%^&*(){}[],=?\;|><.+/');
  9  END;
 10  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> -- create index using lexer
SCOTT@orcl_11g> -- and empty stoplist (so it won't treat single letters as stopwords):
SCOTT@orcl_11g> CREATE INDEX candresume_ctx_idx
  2  ON resume_test (cand_resume)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  PARAMETERS
  5    ('LEXER	   try_lexer3
  6  	 STOPLIST  CTXSYS.EMPTY_STOPLIST')
  7  /

Index created.

SCOTT@orcl_11g> -- view what is indexed;
SCOTT@orcl_11g> SELECT token_text
  2  FROM   dr$candresume_ctx_idx$i
  3  /

TOKEN_TEXT
----------------------------------------------------------------
AS/400
C
PL/SQL
VB.NET
WORD1
WORD3

6 rows selected.

SCOTT@orcl_11g> -- gather statistics
SCOTT@orcl_11g> -- (will enable the optimizer to choose the best execution plan):
SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS (USER, 'GRST_CANDIDATE')

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS (USER, 'RESUME_TEST')

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> -- select without % (% is not necessary and slows it down greatly)
SCOTT@orcl_11g> -- (* is not selected; cand_resume is converted for display)
SCOTT@orcl_11g> -- (a bind variable is used, so that the query in the SGA can be re-used with re-parsing):
SCOTT@orcl_11g> VARIABLE search_string VARCHAR2 (100)
SCOTT@orcl_11g> EXEC :search_string := 'VB.NET'

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> COLUMN cand_resume FORMAT A45 WORD_WRAPPED
SCOTT@orcl_11g> SELECT p.cand_id,
  2  	    d.cand_name,
  3  	    UTL_RAW.CAST_TO_VARCHAR2 (p.cand_resume) AS cand_resume
  4  FROM   resume_test p, grst_candidate d
  5  WHERE  d.cand_id = p.cand_id
  6  AND    CONTAINS (p.cand_resume, :search_string, 1) > 0
  7  /

   CAND_ID CAND_NAME  CAND_RESUME
---------- ---------- ---------------------------------------------
         1 cand1      word1 VB.NET word3

SCOTT@orcl_11g> EXEC :search_string := 'PL/SQL'

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> /

   CAND_ID CAND_NAME  CAND_RESUME
---------- ---------- ---------------------------------------------
         2 cand2      word1 PL/SQL word3

SCOTT@orcl_11g> EXEC :search_string := 'AS/400'

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> /

   CAND_ID CAND_NAME  CAND_RESUME
---------- ---------- ---------------------------------------------
         3 cand3      word1 AS/400 word3

SCOTT@orcl_11g> EXEC :search_string := 'C'

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> /

   CAND_ID CAND_NAME  CAND_RESUME
---------- ---------- ---------------------------------------------
         4 cand4      word1 C word3

SCOTT@orcl_11g>

Re: Problem with special chars in BLOB data type using contains keyword [message #478305 is a reply to message #446449] Thu, 07 October 2010 14:49 Go to previous messageGo to next message
ppatel
Messages: 3
Registered: October 2010
Junior Member
I had a question.

I am currently on Oracle 10g. I followed the same details provided by you, but I am not getting desired results.

I do see the index contains that word, but when I search for it using CONTAINS, get no rows.

Can you help out.

Thanks.
Re: Problem with special chars in BLOB data type using contains keyword [message #478307 is a reply to message #478305] Thu, 07 October 2010 14:53 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
Please post a copy and paste of a complete run of what you did including line numbers, any results, and any error messages.


Re: Problem with special chars in BLOB data type using contains keyword [message #478309 is a reply to message #478307] Thu, 07 October 2010 15:08 Go to previous messageGo to next message
ppatel
Messages: 3
Registered: October 2010
Junior Member
Thanks for you quick reply.

I did the following:

1. Created the lexer preference and attribute

BEGIN
CTX_DDL.CREATE_PREFERENCE ('LEXER1', 'BASIC_LEXER');
CTX_DDL.SET_ATTRIBUTE ('LEXER1',
'PRINTJOINS',
'-_~!@#$%^&*(){}[]=?\;|><.+/');
END;

2. Created the index

CREATE INDEX tab_idx03
ON test_table (test_data)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS
('LEXER LEXER1
STOPLIST CTXSYS.EMPTY_STOPLIST');

3. I query the index for certain value:

select * from dr$tab_idx03$i where TOKEN_TEXT = ( '#05202-BLUE');

This is what I get:

TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
#05202-BLUE 0 140525 140525 1

4. I run the queries across my test table:

Here is what I ran:

SELECT * from test_table
WHERE CONTAINS (test_data,
'#05202-BLUE',1)>0;

--
No rows returned
--

SELECT * from test_table
WHERE test_data like '%#05202-BLUE%';

output:

----
PAINT-DABBER ARRO-MARK #05202-BLUE, 2
----

Appreciate all of your help.

Thanks.


Re: Problem with special chars in BLOB data type using contains keyword [message #478310 is a reply to message #478309] Thu, 07 October 2010 15:55 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
The hyphen has special meaning in Oracle Text; It means MINUS. So, if you search for 'a-b' it takes the score from a and subtracts the score from b from it, so if the string contains a and b, then the score is 0, so using contains ... > 0 does not find it. In order for Oracle Text to treat such special characters in a search string as normal text, you have to escape them, either by placing a \ in front of each such character or putting { and } around the whole string. Please see the reproduction and solutions below.

-- test environment:
SCOTT@orcl_11gR2> CREATE TABLE test_table
  2    (test_data  VARCHAR2 (60))
  3  /

Table created.

SCOTT@orcl_11gR2> INSERT INTO test_table (test_data)
  2  VALUES ('PAINT-DABBER ARRO-MARK #05202-BLUE, 2')
  3  /

1 row created.

SCOTT@orcl_11gR2> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('LEXER1', 'BASIC_LEXER');
  3    CTX_DDL.SET_ATTRIBUTE
  4  	 ('LEXER1',
  5  	  'PRINTJOINS',
  6  	  '-_~!@#$%^&*(){}[]=?\;|><.+/');
  7  END;
  8  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11gR2> CREATE INDEX tab_idx03
  2  ON test_table (test_data)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  PARAMETERS
  5    ('LEXER	   LEXER1
  6  	 STOPLIST  CTXSYS.EMPTY_STOPLIST')
  7  /

Index created.

SCOTT@orcl_11gR2> select token_text from dr$tab_idx03$i
  2  /

TOKEN_TEXT
----------------------------------------------------------------
#05202-BLUE
2
ARRO-MARK
PAINT-DABBER

4 rows selected.


-- reproduction of problem:
SCOTT@orcl_11gR2> SELECT * from test_table
  2  WHERE   CONTAINS (test_data, '#05202-BLUE', 1) > 0
  3  /

no rows selected


-- solutions:
SCOTT@orcl_11gR2> SELECT * from test_table
  2  WHERE   CONTAINS (test_data, '#05202\-BLUE', 1) > 0
  3  /

TEST_DATA
------------------------------------------------------------
PAINT-DABBER ARRO-MARK #05202-BLUE, 2

1 row selected.

SCOTT@orcl_11gR2> SELECT * from test_table
  2  WHERE   CONTAINS (test_data, '{#05202-BLUE}', 1) > 0
  3  /

TEST_DATA
------------------------------------------------------------
PAINT-DABBER ARRO-MARK #05202-BLUE, 2

1 row selected.

SCOTT@orcl_11gR2> 

Re: Problem with special chars in BLOB data type using contains keyword [message #478403 is a reply to message #478310] Fri, 08 October 2010 08:50 Go to previous message
ppatel
Messages: 3
Registered: October 2010
Junior Member
That worked.

Thanks so much for your detailed explanation.
Previous Topic: Search for Boolean Operator
Next Topic: Special characters issues with CATSEARCH query
Goto Forum:
  


Current Time: Thu Mar 28 04:47:16 CDT 2024