Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD OF IDENTIFYING SHARED RESOURCES ON A NETWORK
Document Type and Number:
WIPO Patent Application WO/2007/121490
Kind Code:
A3
Abstract:
A networked information indexing and search apparatus and method provide access, including indexing and search access, to information located on one or more intranets, the Internet, or both. The networked search apparatus, also referred to herein as a network search device or network search appliance, and method comprise configuration, indexing, and searching capabilities to facilitate networked information search and retrieval.

More Like This:
JPH11184674REGISTER FILE
JPS61109138COMPUTER CONTROLLER
Inventors:
ERICKSON ROBERT (US)
FOX DAVID (US)
Application Number:
PCT/US2007/066969
Publication Date:
November 27, 2008
Filing Date:
April 19, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DEEPDIVE TECHNOLOGIES INC (US)
ERICKSON ROBERT (US)
FOX DAVID (US)
International Classes:
G06F7/00; G06F17/30
Foreign References:
US20040205244A12004-10-14
US20030078987A12003-04-24
Attorney, Agent or Firm:
GLUCK, Peter, J. (LLP2450 Colorado Avenue,Suite 400, Santa Monica CA, US)
Download PDF:
Claims:

CLAIMS

1. In a networked information and search apparatus system, the implement comprising:

An improved network search appliance operatively linked to at least a configuration component, an index component and a search component; 2. Whereby networked information may be selectively accessed from at least one intranet, the internet and simultaneously from at least one intranet and the internet.

3. The networked information and search apparatus system of claim 1, further comprising:

4. Said configuration component including dynamically established network settings comprised of network addresses and related data.

5. The networked information and search apparatus of claim 2, wherein said indexing component is comprised of at least a repository of search data including identifiable sharable network resources.

6. The networked information and search apparatus system of claim 3, whereby a mechanism for ranking search results resides in the index component.

7. A method executed by a network search device, comprising, in combination: configuring the network device by dynamically establishing network settings corresponding to the network device; creating an index of sharable resource on the network; maintaining a search repository" database; and, searching for information on the network using the search repository database-;

8. whereby at least one mechanism for ranking search results, sorts, scores and ranks the search result.

9. The method of claim 5, further comprising the network search device being configured using a user datagram protocol (UDP) client/ server model wherein messages ate transmitted between the search device and a network device to assign internet protocol (IP) settings including an IP address for the search appliance. 10. The method of claim 6, further comprising a bootstrap client which executes on the network server, polling the network devices ph } sicaϊly connected to the network.

11. The method of claim 7, whereby each network search device further comprises a plurality of crawling mechanism providing identification information including Medium Access Control (MλC) addresses and hostnames, whereby the bootstrap client on the network server ut>es the network search devices identification information to communicate with the network search device to set an IP state of the search appliance.

12. An improved system and method for identifying shareable resources comprising: generating a list of network servers; queuing the list to identify shareable resources, indexing the results of the queued list; displaying the results, and searching further and repeating.

13. The system of claim 9, further comprised of at least one network environment selected from the group of NetBio workgroups, Windows NT domains, Windows 2002/2003 domains in backward-compatible modes with Windows name-service enable network environments.

Description:

SYSTEM AND METHOD OF

IDENTIFYING SHARED RESOURCES ON A NETWORK

[0001] This provisional patent application is related to U.S. Provisional Patent Application No. 60/793,431 , entitled System and Method of Identifying Shared Resources on a Network, filed on April 19, 2006, the content of which is herein incorporated by reference in its entirety.

FIELD OF INVENTION

(0002] The present invention relates generally to the field of searching for and identifying shared resources, and more particularly to identifying shared resources accessible via a network for search and retrieval, and to an apparatus and method for same.

BACKGROUND

[0003] Computer systems are typically used for various business, education, and entertainment-related applications, many of which store, retrieve and process information. The increased availability of computer systems and computer networks, such as the Internet, has made vast repositories of information available to a huge segment of our population,

[0004] While computers have undoubtedly provided enhanced abilities for information accessibility and information processing, the sheer volume of information available in electronic form has, in many ways, exceeded our ability to manage it. This phenomenon has

been termed "information overload" and means that, now awash in information, it is becoming increasingly difficult to find information when desired. Accordingly, many new- tools are being developed to deal with the ever-expanding volume of information that is now available for consumption in an electronic form.

[0005] For example, the World Wide Web ("WWW" or "web") can provide access to a vast amount of information. Locating the desired information, however, can be quite challenging. This problem is compounded because both the amount of information available on the web and the number of inexperienced users searching the web are growing exponentially. In an attempt to deal with this problem, a number of specialized search tools, known as "search engines," have been developed. Several of the more well-known search engines are Google, Yahoo, and MSN Search.

[0006] Given the rising popularity of search engines for the Internet, many search engine providers are contemplating the use of their search engine algorithms and methodologies in alternate computer environments. Once such use is the advent of search engine tools for local area networks (LiVNs) and intranets. This type of deployment allows a company to search the data files stored on their own corporate computers, thereby retrieving desked files from those servers. While relatively primitive search capabilities are provided in mam desktop environments, the ability to index, categorize, search and retrieve desired documents is c j uite limited.

[0007] In general, search engines attempt to return hyperlinks to specific web pages in

which a user may be interested. Most search engines base their determination of die user's interest on a collection of search terms (called a search query) entered by the user. The goal of the search engine is to provide the user with multiple links to high quality, relevant results based on the user's search query. Typically, the search engine accomplishes this by matching the terms in the search query against a corpus of pre-stored, pre-indexed web pages. Web pages that contain the user's search terms are called "hits" and are returned to the user.

[0008] In an attempt to increase the relevancy and quality of the web pages returned to the user, a search engine may also attempt to sort the list of hits so that the most relevant and/ or highest quality pages are at the top of the list of hits returned to the user. For example, the search engine may assign a rank or score to each hit, where the score is designed to correspond to the relevance or importance of the web page. Determining appropriate scores can be a difficult task. Hot one thing, the importance of a web page to the user is inherently subjective and depends on the user's interests, knowledge, and attitudes. There is, however, much that can be determined objectively about the relative importance of a web page.

[0009] Conventional methods of determining relevance are typically based on the contents of the web page. More advanced techniques determine the importance of a web page based on more than the content of the web page. For example, one known method, described in the article entitled "The Anatomy of a Large-Scale Hypertextual Search Engine," by Sergey Brin and Lawrence Page, assigns a degree of importance to a web page

based on the link structure of the web page. In other words, the Brin and Page algorithm attempts to quantify the importance of a web page based on more than just the content of the web page. While this algorithm, is useful m certain situations for certain users, it likely cannot provide the optimal results for all users in all situations, particularly in a LAN environment where much of the data is contained in other than web page format.

[0001 OJ I 1 Or example, while a data repository on the Internet or an intranet may contain a video clip, the search engine may not be capable of indexing and/or accessing the video clip to identify content, depending on the format and/or content of the video clip and the sophistication of the search engine. A similar problem may be encountered with other forms of content such as word processing documents, graphic image files, MP3 clips, interactive blogs, etc. Once again, if the owners of the currently deployed search engine technology do not adapt their search engines for these now ubiquitous data types, then the results may not include the desired data, even if the data is available.

[00011] Finally, even if all of the available data are accurately indexed and searched to provide a given result set, it may not be desirable for the entire set of links, pages, and/or documents to be made available to a given user. This is especially true in the rapidly evolving area of desktop computer searching and local area network (LAN) and intra-net search engines. For example, in certain intranet environments, while one user may be authorized to view certain documents, other users on the same intranet should not be provided with access to a particular document, even if their search query indicates that the document should be

returned as part of the result. Since current search engine technologies operate outside of the control of most operating systems, it is extremely difficult to customize access to search results based on any type of security model.

[00012J As shown by the discussion herein, without additional improvements in the systems and methods utilized in locating and processing information for users, search results provided by standard search engines will continue to be sub-optimal, at least for certain classes of users and certain types of searches.

[00013j SUMMARY OF THE INVENTION

[00014] A networked information indexing and search apparatus and method provide access, including indexing and search access, to information located on one or more intranets, the Internet, or both. The networked search apparatus, also referred to herein as a network search device or network search appliance, and method comprise configuration, indexing, and searching capabilities to facilitate networked information search and retrieval.

[00015] In at least one embodiment of die invention, a network search device comprises configuration, indexing and searching components. During configuration, network settings, such as a network address, are dynamically established for the network search device. The indexing component of the network search device searches the network, identifies sharable resources available on the network, and maintains a search repository, or database, of search information. In response to a search request, the network search device's searching

component uses the search database to search for information on the network. In one embodiment of the invention, search results are scored, or ranked, according to one or more scoring mechanisms.

[00016] Tn another embodiment of the invention, a method executed by a network search device comprises configuring the network device, including dynamically establishing network settings, such as a network address, corresponding to the network device, creating an index of sharable resources on the network, including searching the network to identify the sharable resources on the network and maintaining a search repository, or database, of search information, and searching for information on the network using the search database, in response to a search request. According to one embodiment of the invention, search results arc scored, or ranked, according to one or more scoring mechanisms.

[00017] According to at least one embodiment of the invention, the network search device is configured using a user datagram protocol (UDP) client/server model, wherein messages are transmitted between the search device and a network device (e.g., a network server) to assign Internet Protocol (IP) settings, which include an IP address, for the search appliance. A bootstrap client executes on the network server, which polls the network via a message broadcast to each of the network search devices physically connected to the network, or network segment. In response, each network search device provides identification information, e.g., its Medium Access Control (MAC) address and hostname. The bootstrap client on the network server uses the network search device's identification

information to communicate with the network search device to set an IP state of the search appliance, and to reset the search appliance.

[00018] The network search device searches, also referred to herein as crawling or web crawling, the network for sharable resources, or shares, and maintains/updates a repository of information associated with each share to facilitate indexing and/or search. For example, a sharable resource may be a hard disk drive, or other storage media, fixed or removable, or one or more file system directories, files, documents, pages etc. stored thereon, with

"sharable" access rights. According to one embodiment of the invention, a database stores information corresponding to these sharable resources, which is used for indexing and search.

[00019] In accordance with at least one embodiment, a system and method of identifying sharable resources is provided. A list of network servers is generated, and each listed server is queried to identify of sharable resources. Information identifying the sharable resources located can dien be indexed, searched, and displayed. The system/method performs an iterative search of the network for sharable resources, taking into account different environments of the network.

[00020] Embodiments of the invention can be used to search in a wide array of network environments to identify sharable resources. Such network environments include without limitation NetBIOS workgroups, Windows NT domains, and Windows 2000/2003 domains in backward-compatible modes with Windows Name Service-enabled network

environments. Tn addition, embodiments have the ability to iπteroperate with various advanced Windows networking and security features.

[00021 ] A list of shared resources can be identified by generating a list of servers. Each server on the list is then queried to obtain a list of shared resources, a "share list". The "share list" can be used to identify sharable resources (e.g., a disc drive storing shared files), resolve unresolved IP addresses, and/or identity new servers to be queried. The list of servers can be generated using a network browser service to browse the network, an active director}' service to access a directory of network objects, including servers, and predetermined configuration information.

(00022] In accordance with at least one aspect, an option to specify/obtain network configuration information is provided. In accordance with this aspect, a user, e.g., an administrative user can use a graphical user interface to specify network services (e.g., NetBIOS peer-to-peer services, arid/or a Windows Internet Name Service (WINS). In addition, other tools (e.g., Dynamic Host Configuration Protocol, or DHCP), can be used to retrieve network configuration information. This feature can be used to find configuration information that can be used to provide support for NetBIOS networks that span network segments.

(00023J In accordance with one or more embodiments, a netwotk browser tool is used to identify sharable resources. A network browser can be implemented using a NetBIOS-over- TCP/IP protocol set. A collection of candidate servers on the current network can be found

by broadcasting a message to a given poit (e.g., port 139 and/or 445) associated with possible addresses on the network. Servers that respond to the message are determined to be candidates for browsing. A set of servers identified using a network browser can be queried using, for example, a tool such as SAλlBA's nmblookup to identify a corresponding NetBIOS name which can be mapped to a corresponding IP address. In addition, the browser tool can be used to identify active directory services, ADS, LDAP server, by broadcasting a message to a known "director)' services" port (e.g., port 389). An LDAP server can then be queried to identifier the names of "Domain member" computers using LDAP.

[00024] Active director ) * services can be used to identify' available shares. For example, and by joining an active direct domain, it is possible to entct into a "trust" relationship with a domain controller. This can sometimes be necessary to obtain the lists of available shares from domain member servers.

[0002SJ An active director ) ' can be used to find domain member servers. Obtaining die names of domain member servers from the director)' rather than searching for them on the network can help to streamline things with regard to certain networks (e.g., class A /B networks), on which using a broadcast technique (e.g., send a query and waiting for a response) might take considerable time, especially in a case that several thousand or more addresses need to be queried.

[00026] An active directory can be used to find available shared folders. Obtaining share names and locations from the directory can be advantageous advantages over a direct query to a server, especially when some resources may be located on a server that is not available (e.g., not running) at the time of an initial network survey.

[00027] In addition, a global catalog server can be used to find available shared folders. For example, a domain's "global catalog servers" can be used in order to identify shared resources in an entice forest of domains rather than just a "current" domain.

[00028] The configuration information can be used to identify a network that uses WINS, e.g., a WINS server, which maps NetBIOS names to IP addresses, can be used to identify servers. λ WINS server can be used to identify servers as a supplement to, or in place of, a broadcast approach (e.g., broadcasting a network message and wait for a response). If a server does not support WINS, it is carried forward in the search with its IP addresses as its name. A server in this category is queried via an SMB protocol (with NetBIOS session wrapping or raw) to obtain its browse list, if available, and its list of shared disk resources.

[00029] Domain Name Service (DNS) and a reverse lookup (i.e., using a known IP address to identify a server name) can be used to resolve an IP address identified by a network browser. For example, a DNS reverse lookup can be used identify a server name given an IP address identified during a browse of the network, and/or an IP address that failed to respond to a broadcast. If the DNS reverse lookup successfully returns a name, it

can be identified, e.g., in a browse list, by name rather than by IP address. This feature can be used to support "local network segment" indexing for many Windows 20Ox Active Director}' domains.

[00030] In accordance with one or more embodiments, the final list of servers and shares can be provided to the administrative GUI for presentation to the user.

{00031] The database includes domain, uri, and page tables used to store information corresponding to pages within documents stored as files at a location, or domain, on the network. The domain table includes a name corresponding to each domain. The uri table includes a universal resource indicator, or uri, for each document, together with other document information (e.g., last modification date and index time). The page table has an entry for each page (e.g., web page, email, page widiin a word processing document, etc.).

[00032] The database further includes a lexicon, or dictionary, of "original" words, which is dynamically updated to include new words. In addition, the database includes parts of speech of each word. One or more, preferably every, stem words constructed from an original word is stored in the lexicon, with each stem word being related in the database to the original word from which it was constructed. A rank table stores entries, each of which records the frequency of occurrence of a stem word with a document/ page, as it is currently known (i.e., at the time of the last index and/or modification). A word table identifies locations of original words within a document/page.

[00033] In at least one embodiment of the invention, the database model is such that new records can be added to one or more database tables using a file import mechanism, instead of a database insert command (e.g., structured query language, SQL, insert command). Existing records arc updated using an SQL update command. For example, using a file import mechanism, data used to populate records in one or more of the uri, page, rank and word tables is buffered, and thereafter written to the database (e.g., at the end of indexing and/ or as the data buffers become full).

[00034] In one or more embodiments of the invention, an N-ary trie is used to buffer the lexicon and provides efficient word lookup. The value of "N" is based on the particular character set used to represent the words in the lexicon. For example, "N" can represent the number of characters in an alphabet, together with a number of digits and punctuation marks. In one or more embodiments of the invention, prior to performing an indexing operation, the contents of the lexicon table ate written to the N-ary trie buffer structure. Updates made during an indexing operation, such as new words found in new or updated documents/pages, are first written to the N~ary trie buffer structure, and then written to the database using the file import mechanism.

[00035] In one or more embodiments of the invention, a scoring mechanism, which may include one or more "weighting" methodologies is used to provide enhanced search results.

More particularly, a scoring mechanism is used to rank results from a search, to determine a relevance score for each item (e.g., document, page, etc.) identified from a keyword search.

Even more particularly and in accordance with one or more embodiments of the invention, the scoring mechanism is used to rank an item's relevance based on both a frequency of occurrence of a keyword found in a document and a correlation between multiple keywords found in the document. Advantageously, for example, in a case that aggregation of frequency of occurrence corresponding to each keyword found in a search result item identified in the search arc comparable for all search result items, the scoring mechanism can be used to determine correlations between multiple keywords found within a given search result item, to assist in differentiating the relevance of a search result item relative to the other search result items uncovered in the search.

[00036] In one embodiment of the invention, the scoring algorithm scales products of frequencies of occurrence, using different combinations of frequencies of occurrence associated with the keyword terms, beginning with a first order and increasing to an order equal to the number of keywords m the search, to determine relevance corresponding to a search result item having multiple keywords. According to one embodiment, the relevance can be determined for each search result item having multiple keywords. In an alternate embodiment, a threshold number, which identifies a number of multiple keywords, is used to determine the relevance score assigned to a search result item. Wore particularly, if a search result item contains less than the threshold number of multiple keywords, its relevance score is set to zero. However, in a case that the search result item contains at least the threshold number of keywords, the scoring algorithm is used to determine a relevance score using the scoring algorithm.

BRIEF DESCRIPTION OF THE DRAWINGS

[00037] Embodiments of the present invention, will hereinafter be described in conjunction with the appended drawings wherein like designations denote like elements and:

[00038] FIG. 1 illustrates a block diagram of a representation of a network of computing devices and peripherals in which one or more embodiments of the present invention can be used in provided;

[00039] FIG. 2 provides an illustrative example of a block diagram of an internal architecture of a search appliance in accordance with one or more embodiments of the present invention;

[00040] FIG. 3 illustrates a flowchart of process steps to create and update an index in accordance with one or more embodiments of the present invention;

[00041] FIG. 4 provides an illustrative example of a block diagram of a search appliance used in indexing and searching in accordance with one or more embodiments of the present invention;

[00042] FIGv 5 illustrates a flowchart of process steps to score and rank search results in accordance with one or more embodiments of the present invention;

[00043] FIG. 6, which includes FIG. 6A to FIG. 6O, provides illustrative examples of

screens from a user interface of a search appliance in accordance with one or more embodiments of the invention; and

[00044] FIG. 7, which includes FIG. 7A to FIG 7Y, provides illustrative examples of screens from a user interface used in configuration operations for, and/ or associated with, a search appliance in accordance with one or more embodiments of the present invention.

[00045] FIG. 8, which comprises FIGs. 8A and 8B, provides an example of pseudo code of a script for use in discovering shared resources in accordance with one or more embodiments.

DETAILED DESCRIPTION

(00046] A networked information indexing and search apparatus and method provide access, including indexing and search access, to information located on one or more intranets, the Internet, or both. The networked search apparatus, also referred to herein as a network search device or network search appliance, and method comprise configuration, indexing, and searching capabilities to facilitate networked information search and retrieval

|00047] Referring now to FIG. 1, a block diagram of a representation 100 of a network of computing devices and peripherals in which one or more embodiments of the present invention can be used in provided. According to one or more embodiments of the invention, computers 150, 160, and 170, at least one instance of search appliance 180, and at

least one data server 190 are coupled via a network 120. Additionally, an optional printer 110 and an optional fax machine 140 are shown. Using the present invention, individuals, business entities and the like, for example can efficiently and effectively access and manage the storing, indexing, accessing, and retrieving of electronic data as described herein in conjunction with the various embodiments of the present invention.

[00048] Optional printer 110 and an optional fax machine 140 are standard peripheral devices that may be used for transmitting or outputting paper-based documents, notes, search results, reports, etc. in conjunction with die queries and transactions processed by computer-based system 100. It should be noted that optional printer 110 and optional fax machine 140 are merely representative of the many types of peripherals that may be utilized in conjunction with the present invention, and that other peripheral devices can be used with one or more embodiments of the present invention and no such device is excluded by its omission in FIG. 1.

[00049] Network 120 is any suitable computer communication link or communication mechanism, including a hardwired connection, an internal or external bus, a connection for telephone access via a modem or high-speed 11 line, radio, infrared or other wireless communications, private or proprietary local area networks (IANs) and wide area networks

(WANs), as well as standard computer network Communications over the Internet or an internal network (e.g. "intranet"] via a wired or wireless connection, or any other suitable connection between computers and computer components known to those skilled in the art,

whether currently known or developed in the future. It should be noted that portions of network 120 may suitably include a dial-up phone connection, broadcast cable transmission line, Digital Subscriber Line (DSL), ISDN line, of similar public utility-like access link.

[00050J In one or more embodiments of the present invention, at least a portion of network 120 comprises a standard wired or wireless Internet connection between the various components of computer-based system 100. Network 120 provides for communication between the various components coupled to network 120, which allows for information to be transmitted between devices coupled thereto. Using embodiments of the invention, a user of computer system, e.g., computer 150, 160 and 170, connected to network 120, for example, can gain access, based on access privileges corresponding to the user, to data and information accessible via network 120. Regardless of physical nature and topology, network 120 serves to link the physical components of computer-based system 100 together, regardless of their physical proximity. This is especially important because it is contemplated that, in one or more embodiments of the present invention, data server 190 and computers 150, 160, and 170 may be geographically remote and physically separated from each other.

[00051] Computers 150, 160 and 170 may be any type of computer known to those skilled in the art that is capable of being configured for use with computer-based system 100 as described herein. This includes laptop computers, desktop computers, tablet computers, pen-based computers and the like. Computers 150, 160, and 170 are most preferably commercially available computers such as a Linux-based computer, IBM compatible

computers, or Macintosh computers. Howeveϊ, those skilled in the art will appreciate that the methods and apparatus of the present invention apply equally to any computer or computer system, regardless of whether the computer is a traditional "mainframe" computer, a complicated multi-user computing apparatus or a single user device such as a personal computer or workstation.

[00052J Additionally, handheld and palmtop devices are also specifically included within the description of devices that may be deployed as computers 150, 160 and 170. It should be noted that no specific operating system or hardware platform is excluded and it is anticipated that many different hardware and software platforms may be configured to be deployed as computers 150, 160 and 170. Various hardware components and software components (not shown this HG.) known to those skilled in the art may be used in conjunction with computers 150, 160 and 170.

[00053] Data server 190, together with computers 150, 160 and 170, are preferably configured to store and retrieve data, some or all of which is sharablc via network 120. Various hardware components (not shown this FIG.) such as external monitors, keyboards, mice, tablets, hard disk drives, recordable CD-ROM/DVD drives, jukeboxes, fax servers, magnetic tapes, and other devices known to those skilled in the art may be used in conjunction with data server 190, and computers 150, 160 and 170.

[00054J Data server 190 may also be configured with various additional software components (not shown this FIG.) such as database servers, web servers, firewalls, security

software, and the like. While only a single data server 190 is shown connected to network 120 of FIG. 1, embodiments of the present invention contemplate and embrace a virtually unlimited number of data servers 190. The vatious data servers may vary in size, complexity and capability, but will all generally be capable of storing and retrieving information via network 120, in response to user requests.

[00055] In general, data server 190 represents a network accessible data server that is configured to store data files for later retrieval by the users of computers 150, 160 and 170 via network 120. A typical transaction may be represented by a request to store information or access information directly stored on data server 190 or on some other computer or computer system that is logically connected to data server 190. The request to store or retrieve information may include requests involving any type of digitized data, whether voice, text, graphics, etc. and the information may be stored in any format known to those skilled in the art.

[00056] In general, search appliance 180 represents a network accessible computing system configured to act as a network-based indexing and search apparatus capable of indexing data, receiving search queries and processing die search queries to return one or more data files accessible via network 120, and any other appropriately designated computers, that are responsive to the search queries. A typical transaction may be represented by a request to all files containing certain keywords or phrases from the data store contained on data server 190 or stored on some other computer or computer system

that is logically connected to data server 190. The request to retrieve data may include search requests involving any type of digitized data, whether voice, text, graphics, etc. and the information may be stored in any format known to those skilled in the art.

fOO0S7] In accordance with one or more embodiments of the present invention, search appliance 180 is configurable automatically via a UDP client/ server model, or using a user interface comprising displayable web pages using a standard web browser.

[00058] In accordance with the UDP client/ server model and prior to configuration on network 120, the search appliance 180 is physically connected to network 120. After the physical connection has been made, as is described in more detail below, search appliance 180 transmits a message containing identification information via User Datagram Protocol (UDP) and network 120 to configure search appliance 180. Once configured on network 120, search appliance 180 can be used to identity sharable resources available on the network, and maintain a search repository, or database, of search information. In response to a search request, the search appliance 180 uses the search database to search information on the network. In one embodiment of the invention, search results are scored, or ranked, according to one or more scoring mechanisms.

[ 00059 j The UDP client/ server model used in one or more embodiments of the invention addresses an issue present when installing a network appliance on a network, such as network 120. That is, when configuring a network appliance, such as search appliance 180, on network 120, it is necessary to configure the device for network communications, e.g.,

TCP/IP Ethernet communication. For example, in a TCP/IP network environment, an IP address and subnet mask should be established for search appliance in order to operate over TCP/IP within the network in which it is deployed.

[00060] It is possible to use a manual configuration approach, e.g., manually setting network parameters for search appliance 180. The manual configuration approach assumes a fakly sophisticated knowledge of network configuration needs. It would therefore be beneficial to be able to configure search appliance 180 for network 120 automatically.

[00061] Another approach, which can be used with embodiments of the present invention, to configure a network device such as the search appliance 180 involves the use of BOOTP, or the superseding and encompassing DHCP, to obtain IP settings. However, not all networks can be expected to provide such a service.

[00062J With reference to the automatic configuration using the UDP client/server model, which can be used with one or more embodiments of the invention, it is possible to, automatically and with minimal user intervention, configure search appliance 180, e.g., identify valid IP settings, for communication on network 120. In other words, when a network device, such as search appliance 180, is initially connected to a physical network, such as network 120, without a priori knowledge of the subnet within which it resides, this approach provides an ability to establish initial communication, between search appliance and data server 190.

[00063] The UDP dient/servet contemplates the use of a set of connectionless UDP broadcast messages that can be used to communicate between a network device, e.g., network data server 190, and search appliance 180, without the need for search appliance 180 to be configured with TCP/IP settings, e.g., a TCP/IP address. It should be apparent that although the client/ server model is described with reference to UDP, other protocols may be used. In addition, although a communication protocol defining a set of messages used to communicate with search appliance 180 is described, it should be apparent that other, messages types can be used to communicate with search appliance 180 via UDP, or other network protocol.

[00064] An illustrative set of message types used used in one or more embodiments of the invention for configuring search appliance 180 is described and set forth below. The communication protocol defines a structure for messages used in implementing the UDP client/ s erver model. In addition, examples are provided of use cases to illustrate end-user network setup using the UDP client/ server model.

[00065] It is also worth noting that implementation of a simple protocol atop UDP within a client/server model is a convenient solution to network device configuration, whether or not a BOOlT, or a DHCP, server is present in the network segment

[00066] According to the UDP client/ server model of the present invention, messages can be passed between UDP client and server. More particularly, message types are presented in terms of commands issued by the UDP client, e.g., a networked device such as

data server 190, to one Oϊ more UDP servers, e.g., search appEance 180. A typical command consists of a message sent by a UDP client to one or more UDP servers listening on a dedicated port. To reply, a response message can be in the form of a message sent by one or more UDP servers back to the UDP client, which in turn listens on its own dedicated port. The command types are different from remote procedure calls at least with respect to the transmission of messages in the form of UDP limited broadcasts, which are connectionless, and thus, without state. In particular, there is no guarantee that an intended recipient of a message will actually receive the message, λlessages ate broadcast to all devices on the network segment. Examples of messages/commands that can be used with the UDP client/ server model of one or more embodiments of the invention are as follows:

Table - UDP CHent/Servet Model Message Type Examples.

[00067] In the examples given in the UDP Client/Server Message Type Table, the first command, the POL message, is issued by a UDP client, e.g., data server 190, to identify all of the UDP servers, e.g., instances of search appliance 180, in a network, or network segment. Each UDP server that receives a POL message replies with a PLR message. Using identification information provided with the PLR message, additional messages can be sent to specific ones of search appliance 180 to cause search appliance 180 to perform an operation specified by the message. For example, another message that can be issued by a UDP client, a GET message, which requests IP information from a specific UDP server, a specific instance of search appliance 180. The intended UDP server replies with a GTR message, which contains the requested information,

[00068] Another message issued by a UDP client, a SET message, requests the recipient UDP server to set its IP state. The intended UDP server replies with a STR message, which indicates the result, e.g., success or failure, of the requested operation. A RES message can be issued by a UDP client to instruct a specific instance of the UDP server to initiate a reset operation to reset its state, which is accompanied by a restart of the appliance. Each of these types of messages is discussed rn more detail below.

[00069J According to otic example of a syntax used for the message types described herein, each message is no greater than 512 bytes in length. Of these seven types of messages, as discussed generally above, four of the messages are sent by the UDP client to

the UDP server to inmate an operation to be performed by the UDP server. The remaining types of messages identified above are sent by a UDP server to the UDP client in reply. Kach message body identifies the sender via a MAC address field. The POL message sent by the UDP client is intended for all UDP servers that might be listening. The remaining message types axe intended for a specific recipient, as is identified by its λlAC address in the message body. One example of the structure and syntax used for the seven message types is shown below

[00070] It is contemplated, in accordance with one or more embodiments of the present invention, that each instance of search appliance 180 continuously runs a LDP server and is configured in the factory' to accept an IP addressed leased to it by a DHCP server running in its network If a DHCP server does not exist in the network, then 1CP/IP configuration of search appliance 180 occurs through commands received by the UDP server executing in search appliance 180, using the UDP client/server model described above.

[00071 J The UDP client/server model described herein for use with one or more embodiments of the invention is provided to the end user for uses including the following: (i) discovering all search appliances 180 connected to the network, e g , network 120, (ii) obtaining the IP address and subnet mask of a specified search appliance 180 so discovered, and (iii; setting the IP address and subnet mask of a specified search appliance 180 so disco\ ered borne example scenarios encoumeied b\ the end user, and the actions that can be taken, are categorized belou

[00072] In one such scenario, search appliance 180 boots in a network containing a DHCP server. In such a case, search appliance 180 obtains a valid IP address from the DHCP server, and network setup of the search appliance 180 can be completed without a need for the UDP client/ server model described herein. The following are among the alternatives available to the user in a case that the network contains a DHCP server:

1. The end user need not take any action.

2. If so desired, however, the end user may run the UDP client/server bootstrap client on the network server to discover a search appliance 180 connected to the network, for example, to:

a. obtain the IP settings as provided by the DHCP server, or

b. change the IP settings to another static IP address.

[00073] Tn another scenario, search appliance 180 boots in a network that does not contain a DHCP server. In such a case, search appliance 180 waits for its IP address and subnet mask to be set, e.g., using the SET command of the UDP client/server model from the UDP server. In this case, the end user configures the appliance within the network by running the program code which implements the UDP bootstrap client on the network device, e.g., data server 190. The UDP bootstrap client communicates with instances of search appliance 180, as described above, to allow the user to discover each instance of search appliance 180 and issue the command to set its IP address and subnet mask, to configure search appliance 180 for network communications.

[00074] At any time after successful completion of network setup, the end user may run the UDP boostrap client to discover one or more instances of search appliance 180 to, for example:

1. obtain an IP address and subnet mask of one or more instances of search appliance 180,

2. reset am IP address and subnet mask of one or more instances of search appliance 180 to static values, or

3. reset one or more instances of search appliance 180 to a factory configuration.

^00075] Referring again to FIG. 1, it should be noted that while FIG. 1 shows only a few computers 150, 160, and 170 connected to network 120, if is anticipated that dozens or hundreds or even thousands of similarly configured computers 150, 160, and 170 can be "indexed" and searched using instances of search appliance 180. In one or more embodiments of the present invention, multiple computers 150, 160, and 170 will all be configured to communicate with search appliance 180 and one or mote data servers 190 and with each other via network 120.

{00076] Using search appliance 180, a user of a computer, such as one of computers 150, 160, and 170, can initiate a search request to locate and retrieve desired data files from data server 190, for example, with the search request being received and processed by search appliance 180. In response to receipt of such a request, search appliance 180 will, if appropriate, provide access to die requested data files to the requester. λs discussed above,

using search appliance 180, a user of one of computers 150, 160, and 170 , for example, may request and retrieve information in this fashion from not only data server 190, but from any other computer or computer system coupled to network 120, indexed using search appliance 180. Using search appliance 180, it is possible to submit a search request, review the results of a search, and mdex volumes of data located on a local shared resource, at a remote location connected to network 120, and across an intranet and die Internet. In addition to some of the typical applications that embodiments of the present invention can be used, it is contemplated that the present invention may be used for other searching applications, including for example, electronic discover ) ' and computer forensics.

(00077] Referring now to FIG. 2, a block diagram illustrates one example of an internal architecture of search appliance 180 in accordance with one or more embodiments of the invention. Search appliance 180 may also be configured with various additional software components (not shown this FlG.) such as servers, firewalls, comprehensive security softwarc, and the like. Given the relative advances m the state-of-the-art computer systems available today, it is anticipated that functions of search appliance 180 may be provided by many standard, readily available computing devices and systems.

[00078] Search appliance 180 suitably comprises at least one Central Processing Unit

(CPU) or processor 210, a main memory 220, a memory controller 230, an auxiliary storage interface 240, and a terminal interface 250, all of which are interconnected via a system bus 260. Note that various modifications, additions, or deletions may be made to search

appliance 180 illustrated in FIG. 2 within the scope of the present invention such as the addition of cache memory or other peripheral devices. FIG. 2 is not intended to be an exhaustive example, but is presented to simply illustrate some of the salient features of search appliance 180.

[00079] Processor 210 performs computation and control functions of search appliance 180, and comprises a suitable central processing unit (CPU). Processor 210 may comprise a single integrated circuit, such as a microprocessor, or may comprise any suitable number of integrated circuit devices and /or circuit boards working in cooperation to accomplish the functions of a processor. Processor 210 suitably executes one or more software programs contained within main memory 220.

[00080] Auxiliary storage interface 240 allows search appliance 180 to store and retrieve information from auxiliary storage devices, such as external storage mechanism 270, magnetic disk drives (e.g., hard disks or floppy diskettes) or optical storage devices (e.g,, CD- ROM). One such suitable storage device is a direct access storage device (DASD) 280. As shown in FIG. 2, DASD 280 may be a floppy disk drive that may read programs and data from a floppy disk 290.

[00081 J It is important to note that while the present invention has been (and will continue to be) described in the context of a fully functional computer system, those skilled in the art will appreciate that the various software mechanisms of the present invention are capable of being distributed in conjunction with signal bearing media as one or more

program products in a variety of forms, and that embodiments of the present invention apply equally regardless of the particular type or location of signal bearing media used to actually cany out the distribution. Examples of signal bearing media include: recordable type media such as floppy disks {e.g., disk 290) and CD ROMS, and transmission type media such as digital and analog communication links, including wireless communication links.

[00082] Memory controller 230, through use of an auxiliary processor (not shown) separate from processor 210, is responsible for moving requested information from main memory 220 and/or through auxiliary storage interface 240 to processor 210, While for the purposes of explanation, memory controller 230 is shown as a separate entity-; those skilled in the art understand that, in practice, portions of the function provided by memory controller 230 may actually reside in the circuitry associated with processor 210, main memory 220, and/ or auxiliary storage interface 240.

[00083 J Terminal interface 250 allows users, system administrators and computer programmers to communicate with search appliance 180, normally through separate workstations or through stand-alone computer systems such as computer systems 170 of FIG. 1. Although search appliance 180 depicted in FIG. 2 contains only a single main processor 210 and a single system bus 260, it should be understood that the present invention applies equally to computer systems having multiple processors and multiple system buses. Similarly, although the system bus 260 of one or mote embodiments of the present invention is a typical hardwired, multi-drop bus, any connection means that supports

bi-directional communication in a computer-related environment could be used.

(00084] Main memory 220 preferably contains an operating system 221, a user interface 222, a database management system 223, together with program code to implement an index mechanism 224, a search mechanism 225, a report mechanism 226, a scoring mechanism 227, and preferably a security mechanism 228. The term "memory" as used herein refers to any storage location in the virtual memory space of search appliance 180. It should be understood that main memory 220 may not necessarily contain ail parts of all components shown. For example, portions of operating system 221 may be loaded into an instruction cache (not shown) for processor 210 to execute, while other files may well be stored on magnetic or optical disk storage devices (not shown). In addition, although user interface 222, database 223, index mechanism 224, search mechanism 225, report mechanism 226, scoring mechanism 227, and security mechanism 228 are shown to reside kt the same memory location as operating system 221, it is to be understood that main memory 220 may consist of multiple disparate memory locations.

[00085] Database management system 223 is preferably a relational database management system, together with various data model, or schema, definitions, and data stored according to the data model, such as is described in more detail herein. The data stored using database management system 223 may change from query to query, depending on updated made ro the stored data using database management system 223. It should also be noted that any and all of the individual components shown in main memory 220 may be

combined in various forms and distributed as a stand-alone program product. Finally, it should be noted that search appliance 180 can include additional components, not shown in this FiG.

[00086] For example, while not required, embodiments of the present invention include a security mechanism 228 for verifying and validating user access to the data files located by search appliance 180. Security mechanism 228 may be incorporated into operating system 221. Once again, depending on the type and quantity of information stored in database 223, security mechanism 228 may also be configured to provide different levels of security and/or encryption for computers 150, 160, and 170 and data server 190 of FIG. 1.

[00087 j Additionally, the level and type of security- measures applied by security' mechanism 228 may be determined by the nature of a given search request and/or response to the search request, including the identity of the requestor. In some embodiments of the present invention, security' mechanism 228 may be contained in or implemented in conjunction with certain hardware components (not shown this FIG.) such as hardware- based firewalls, routers, switches, dongles, and the like.

[00088] Operating system 221 includes the software that is used to operate and control search appliance 180. In general, processor 210 typically executes operating system 221, Operating system 221 may be a single program or, alternatively, a collection of multiple programs that act in concert to perform the functions of an operating system. Any operating system known to those skilled in the art may be considered for inclusion with the various

embodiments of the present invention.

[00089] Although user interface 222 may take another form, it preferably comprises web pages, which can be displayed, using a browsing software application such as those identified heiein, on a monitor locally coupled to search appliance 180 and/or displayed on a monitor coupled to computer connected to search appliance 180 via network 120, such as computer systems 150, 160 and 170. User interface 222 may be used to configure the various components shown in memory 220, including index mechanism 224, search mechanism 225, report mechanism 226, scoring mechanism 227, and security mechanism 228.

[00090] Database management system 223 is representative of any suitable database known to those skilled in the art. As discussed above, in one or more embodiments of the invention, database management system 223 is a relational database. As such, database management system 223 uses a Structured Query Language (SQL) to manipulate (e.g., create, update, query, etc.) data stored in the database. While database management system 223 is shown residing in main memory 220, it should be noted that database management system 223 may also be physically stored in a location other than main memory 220. For example, database management system 223 ma ) ' be stored on external storage device 270 or DASD

280 and coupled to search appliance 180 via auxiliary storage I/F 240. In one or more embodiments of the present invention, database 223 will contain keywords for the content contained or accessible via a corporate intranet or the Internet. Database management system 223 can consist of multiple disparate databases stored on many different computers

or computer systems.

[00091] Although not shown in FIG. 2, search appliance 180 includes a network interface for connecting to network 120, together with the network protocols needed to communicate via network 120. For example, in one or more embodiments of the invention, search appliance 180 includes the suite of protocols typically referred to as the Transmission Control Protocol/Internet Protocol, or TCP/IP.

[00092] Index mechanism 224 is a user configurable indexing tool for categorizing various types of information and creating an index to be used in conjunction with searching and retrieving information over network 120, such as from data server 190. Index mechanism 224 may be configured manually with various levels of user intervention or programmatically, depending on the specific type of data to be indexed. Index mechanism will perform an initial index and will be configured to re-index the data files contained in database 223 at user-specified intervals, thereby ensuring that the contents of database 223 are capable of being searched in an effective and efficient manner.

[00093J Search mechanism 225 can include a web-based software application accessible via a graphical user interface such as user interface 222 for the purpose of requesting and retrieving information from database 223. In one or more embodiments of the present invention, search mechanism 225 can include a Natural Language Processor (NLP) based search engine that, in conjunction with the other components of search appliance 180, such as indexing mechanism 224, index 229, scoring mechanism 227 and report mechanism 226,

for example, provides a robust search tool for locating and retrieving desired content.

[00094] In general, a user of computers 150, 160, and 170 of FIG. 1 will access search mechanism 225 via a standard web browser such as Safari, FireFox, Netscape, Internet Explorer, etc. By using search mechanism 225, the user will be able to request information. This requested information, if available, will be provided by accessing database 223. Search mechanism 225 will serve as the interface to the information stored in database 223. It is anticipated that various reports related to the information contained in database 223 will be generated by report mechanism 226, which preferably includes a browser-based user interface for displaying search results.

[00095] Report mechanism 226 preferably provides for output, either via a hard copy or display on a monitor, a variety of reports, including reports of the results from accessing database 223 via search mechanism 225. These reports will typically include the results of die various searches performed by a computer user, such as computer system 170 of FIG.l. These various reports will be formatted and presented to the user based on the specific type of request made by the user and the type of information to be returned to the user.

[00096] Scoring mechanism 227 is provided and configured to score and rank the results obtained by search mechanism 225 in response to a user's search query. While those skilled in the art will recognize that various scoring methodologies may be employed, scoring mechanism 227 is specifically designed to provide an easily implemented yet highly effective methodology for presenting search results in a way most likely to rank the most relevant

results first. In one or more embodiments of the present invention, scoring mechanism 227 is user configurable, allowing the user to determine which features and scoring factors (weighting methods) to apply when search results are returned in response to given search

query.

[00097] In one embodiment of the invention, scoring mechanism 227 comprises a ranking of documents returned from a search query in order based on the total number of occurrences of the λ τ unique stem words contained in the original search query. In the case where M document results arc returned, then the /?l b result is ranked according to the formula shown below:

where K^ is the frequency of occurrence of the ϊ h stem word within the m' h document Note, however, that this "frequency weighting" formula does not provide any special consideration for occurrences of more than one stem word in a document. Using this ranking scheme, the sum of the frequencies of all the stem words is measured.

[00098] For example, consider a search query involving two unique stem words where two results are returned with the same score, as shown in Table 1 below.

TABLE 1

[00099] As seen in Table 1, the first result contains 10 occurrences of the first stem word while the second contains 5 occurrences of each stem word. If the only measure of relcvancy is the total number of "hits" for the stem words, then both documents would be scored the same and would have the same relevance in the search result. However, in this case, it would probably be appropriate to consider the second tesult to be more relevant than the fkst result since the second result contains both stem words identified in the original search query. Accordingly, it is desirable to modify the original scoring formula to account for the occurrences of both stem words occurring in the same document, thereby increasing the probability that the most relevant documents will be identified m response to a given search query.

[000100) One way to account for multiple keywords, or stem words, in a single document

result is to introduce the products of frequencies, e.g., x t , into the scoring formula. Essentially, this is a way of quantifying correlation. In other words, by introducing products

of frequencies x t it is possible to account for the correlation between multiple stem words appearing in the same document. This may be termed "enhanced frequency weighting."

[000101 j To accomplish this, the original formula is expanded using combinatorial analysis, by introducing combinations of the products of frequencies, in ever higher-order products, to an order equal to the number of stem words in a given multi-keyword search query. Additionally, in order to maintain scale, each product created in this fashion is most preferably scaled to the size of the original term, and thus, to each term that precedes it in the expansion. This is accomplished by dividing each product by the appropriate

multiplicative power of the original scoring formula, e.g., ^.v of Eq. (1) above. The result is the original scoring formula corrected by higher-order correlations between stem words within the document. The general formula for a query involving A 1 unique stem words is then as shown below:

with the original scoring formula denoted, as before, by Eq. (1).

The modified formula may also be written in a fottn that is easier to apply within a computational setting, as shown below:

[000102 ] Given the formula represented by Eq. (2) and Eq. (3), several examples can be

presented to demonstrate the effect of the relevancy scoring algorithm for different numbers of keywords in a given search query. When A 1 = /, for example, the scoring formula produces the results set forth in Table 1 since all that is being used to determine relevancy is the frequency, or number, of occurrences of the stem word in each file. In other words,

[000103] However, when IS— 2, corresponding to a two-keyword search, the scoring formula becomes.

[000104] In this case, the results of the search are as shown below in Table 2:

Table 2

[000105J λs seen in l able 2, the document result corresponding to tn—2 is scored with higher relevancy because the correction term added to the original scoring algorithm accounts for the fact that this document contains both stem words in die search query.

Given this progression, it would be relevant to consider the example of λ τ — 5 and the results depicted in Table 3 below. In the case of λ ~ J>, the scoring formula becomes:

[000106] The second term of this formula corrects for the simultaneous occurrences of pairs of the three words within the document while the third term corrects for the simultaneous occurrence of all three words in the document. As shown in Table 3 below, the document containing all three stem words, corresponding to m—2 y is ranked higher than others of the same overall count. Additionally, the document ol m—3 is ranked second since it contains two out of the three stem words.

[000107] Those skilled in the art will recognize that a document containing only one stem word will not always be scored or ranked lower than a document containing multiple stem words. For example, if the document corresponding to m—4 had a third-word count of 30 instead of 15 it would be deemed more relevant than the document of m—2. Nevertheless, when total frequency counts for keywords are comparable, the scoring formula of the present invention produces increased relevancy when multiple keywords from a search query are found in a given document.

Table 3 (000108] t τ smg Kq, (3) and if the document corresponding to m—4 had a third-word count of 30 instead of 15 it would out tank the document of tn—2. Thus, while Eq. (3) accounts for multiple keywords appearing in the same document, under certain circumstances, it may overemphasise the relevance of lesser matches that happen to have large total counts of occurrences.

[000109] To address the above, the scoring formula of Lq. (3) can be modified for those cases where N>1. More particularly, it is possible to intioduce an adjustable cutoff

number ^ - N , where λ represents the minimum allowable number of unique stem words that can appear in a document. The score corresponding to a document is set to zero if the number of unique stem words appearing in the document is less than λ, This allows document results with correspondingly high total frequencies but little correlation betu een

unique stem words to be eliminated from the search results Specifically , with Qm equal to the number of distinct stem words appearing in the m th document, the scoring formula of

Hq. (3) is modified as follows:

(000110] As an example, consider again the λ'— i case of Table (3). With A. chosen such thn.t λ—2, the computed scores are depicted m Table (4). The formula of Eq. (6) still applies,

but only for the results for which t/m = * and ic m ~ ^ . All other results depicted in the table have scores of zero.

[000111] In this fashion, it is possible to identify which files provided in response to a

given search query are most relevant. While the improvement in the ranking of the results from enhanced frequency weighting is significant, additional improvements can be made.

[000112] For example, "proximity weighting" can be added to further enhance the relevancy of the search results. In proximity weighting, when using more than one keyword in a search query, additional emphasis can be given to those search results where the key ¬ words are in close physical proximity to each other in the document. This allows the result set to consider those instances to be more relevant than those results where the key words are not located in close proximity to each other,

[000113] "Category weighting" allows a user to specify specific document types as being more relevant for a particular search request. For example, if two documents (one an email document and one a word processing document) are found to be responsive to a given search request and both documents contain the same number and frequency of keywords; category weighting may be used to break the tie. If the user has specified that the most important document category is email, then the email document will be deemed more relevant and will be displayed higher in the search result listing than the word processing document.

[000114] Finally, "location weighting" can be used to further identify the most relevant results provided in response to a search query. In location weighting, when keywords are found in the most prominent locations of a document, that document is given a higher score or ranking in the overall search results. The most prominent location may vary for a given

document or documents but examples include the title of a document for word processing documents, the subject line for an email message, etc. Those skilled in the art will recognize that other prominent locations could be identified and incorporated into various embodiments of the present invention.

[000115] In one or more embodiments of the present invention, the user will be able to select any or all of the various features of scoring mechanism 227 including standard frequency weighting, enhanced frequency weighting, proximity weighting, category weighting, and location weighting.

[000116] While not required, in one or more embodiments of the invention of the present invention, search appliance 180 of FIG. 2 will typically include a security mechanism 228.

Security mechanism 228 is configured to provide a security' model for providing enhanced search results, based on the identity and role of the searcher. In one or more embodiments of the present invention, security mechanism 228 employs a log-in model where each user must have a user ID and a password to authenticate their identity on the network and to access search mechanism 225. Security mechanism 228 is described in more detail below.

[0001171 Index 229 represents the index that is constructed by index mechanism 224, based on the content stored m shares accessible via network 120. Index 229 is used by search mechanism 225 to locate content relevant to a given search query presented by a user of a computer, such as one of computers 150, 160, and 170. Index 229 will be periodically rebuilt at a configurable interval in order to accurately reflect any changes made to the content m

shares accessible via network 120.

[000118] Although index 229 is shown separate]} from database management system 223, it should be appreciated that index 229 can be created and maintained using database management system 223. A discussion of one example of a data model used for indexing and searching is provided below.

f 000119] Those skilled in the art will recognize that although index mechanism 224, search mechanism 225, report mechanism 226, scoring mechanism 227, and security mechanism 228 are shown as separate entities in FIG. 2, index mechanism 224, search mechanism 225, report mechanism 226, scoring mechanism 227, and security mechanism 228 may be combined into a single software program or application or program product.

(000120] Referring now to FIG. 3, a process 300 of maintaining and updating the index for the data files used in conjunction with a search appliance in accordance with one or more embodiments of the present invention is depicted. As shown in FIG, 3, the initial indexing of the data files to be searched is accomplished by first mounting all appropriate target volumes (step 310).

[000121] As part of step 310, network 120 is searched to identify sharable resources, or shares. More particular]} , search appliance 180 searches, also referred to herein as crawling of web crawling, the network for sharable resources, or shares, and maintains/updates a repository of information, using database management system 223, associated with each

share to facilitate indexing and/or search. It is important to note that search appliance 180 is capable of performing network searches, including all files stored on a server or network of servers determined to be shared, not mere HTlT (rndex.htm) searches. For example, a sharable resource may be a hard disk drive, or odier storage media, fixed or removable, or one or more folders, files, documents, pages etc.stored thereon, with "sharable" access rights. In addition, sharable resources can include web pages typically displayed via web browser.

[000122J Next, the initial index will be built using database management system 223 index mechanism 224 (step 320). The original indexing may be accomplished by any means known to those skilled in die art. λs part of the indexing methodology, the creation date and/or last modified date for each data file is captured and stored. In conjunction with the construction of the index, a keyword database is constructed (step 330) using the key words or terms contained in the data files stored on data server 190. This keyword database will be later accessed by search mechanism 225 when a search query is submitted by a user. The database model used to store indexing and shared resource information is discussed in more detail below.

[000123] Once the initial index and keyword database is constructed, according to one or more embodiments of the invention, the index is re-bvult to identify changes in sharable resources, e.g., resources for which the sharable characteristics have changed, and/or to identify changes in content to be reflected in the index. Typically, a period of time is

identified after which the index is re-built. If the time period has not elapsed (step 340 — "NO"), then the waiting period will continue (step 350). However, once the period of time has elapsed (step 340 = "YES"), the target volumes will once again be mounted (step 350) and the index will be re-built (step 360).

[000124] When re-building the index, the previously captured creation date and/or last modified date will be examined and compared with a modification date associated with each file that is to be indexed. If there has been no change in the relevant date, then the file need not be re-indcxed and the key words associated with that file need not be modified in the keyword database. However, if an existing file has been modified, as determined by examining the previously captured date with the new file modification date, the new modification date will be captured and the document will be re-indexed and the keywords associated with that document will be updated in the keyword database.

[000125] Additionally, if a new file has been added, e.g., to data server 190, then it will be added to the index and the appropriate keywords will be added to the keyword database. However, if a given file no longer exists, e.g., on data server 190, then all references to that file in the index and all keywords associated with that file stored in the keyword database will be removed. In this fashion, the keyword database is re-built (step 370).

[000126] Referring now to FIG. 4, the use of security mechanism 228 to provide customized search results in accordance with one or more embodiments of the present invention is depicted. Security mechanism 228 is preferably configured to provide various

levels of security functionality, In one or more embodiments of the present invention, both indexed content and query results are protected from unauthorized access by security mechanism 228. The approach to securing data from unauthorized access may be implemented at the enterprise level and also deployed at the desktop, as appropriate or desired. In one or more embodiments of the present invention, security mechanism 228 comprises an internal database, used by security mechanism 228 to track a variety of user and context sensitive information m order to ensure access to information only by approved system users.

[000127] After indexing is performed to identify shared resources and keyword database is created as an index for database 440, where the fully-indexed content is stored, in accordance with embodiments of the present invention, the security" of the indexed content is implemented in conjunction with the security desired for database 440. As previously explained, database 440 may comprise data from multiple disparate data stores and the security assigned to the data in database 440 may van- from dataset to dataset. In the case of FIG. 4, database 440 is comprised of three separate data stores identified as domain 1, domain 2, and domain 3. Those skilled in the art will recognize that the use of three separate data stores and domains is for illustration only and more or fewer data stores and/or domains may be used in conjunction with various embodiments of the present invention.

f000128| Security for search results returned by search mechanism 225 and reported \ * ia report mechanism 226 may be implemented via the role-based administration of web

services. More particularly, a system of one or more federated servers is constructed in which a password-protected, server-shared database is used to define relational tables that store various types of administrative information and correspondences. In embodiments of the present invention, users, groups, domains, user roles, and domain groups are defined security components and used by security * mechanism 228 to allow or deny access to various types of data stored in database 440 or potentially accessible via search mechanism 225, depending on die status of the various security components.

[000129] As shown in FIG. 4, the users are placed in different groups, with each gϊoup identified as having access to particular domains and/or data files. In this fashion, security mechanism 228 can be used to provide customized search results and protect sensitive data files. User 1, User 2, and User 3 are assigned to user group 410. User 3 and User 4 ate assigned to user group 420. Similarly, user 4 and user 5 are assigned to user group 430.

[00013Oj In the example shown in FIG. 4, each of user 2, user 3, and user 4 submit the same search query to database 440. However, because each of these users is assigned to different user groups, the results that are provided in response to their respective queries is substantially different. In response to the search request from user 2, security mechanism 228 allows dataset 450 to be returned to user 2. In response to the same search request received from user 3, dataset 460 is returned. Finally, in response to the same search request submitted by user 5, security mechanism 228 allows dataset 470 to be returned.

[000131] Since user 3 belongs to both user group 410 and 420, a decision must be made as

to which groups' access rights will be granted to user 3, This can be accomplished so as to ensure that the desired security levels are maintained. For example, in one security model, the mote restrictive access rights of the two user groups will be applied. In a different security model, the more liberal access rights of the two user groups ma ) be applied. Those skilled in the art will recognize that the assignment of users to various user groups may be accomplished in any way necessary to achieve the desired security results. Additionally, each user group may be as small as a single user.

[000132] Taken together, the various sy stem user security components define all registered users of the system and provide a framework or methodology for determining which users may access which information, lhe information relative to each user is stored in die database tables associated with the database for security mechanism 228 The various fields typically include at least the unique username and a password for each user of search appliance 180 of FIG. 1.

[000133] Group permissions are similarly stored in a database table which includes fields such as a name for each permission group, where a permission gtoup is a customized text string descriptive of a role or function of the enterprise, such as "sales," "support," or "admin " A user may inherit security -related permissions and restrictions, based on the specific group permissions for the group to which the user is assigned.

[000134] Searchable domains are stored in a database table whose fields define the location, such as a website URI text string, of each domain from which content may be

extracted by indexing operations conducted by index mechanism 224 at the request of a user. In general, a user may be restricted to searching only those domains that are identified in the searchable domains tables for that user and/or for the specific group to which that user belongs.

[000135] User roles are stored in a database table whose fields serve to relate system users to group permissions, thus defining one or more roles a user plays within the enterprise. Specifically, a field exists in which a primary key of the system users table may appear in multiple records, each time uniquely corresponding to a second field containing a primary key of an entry of the group permissions table.

[000136] Domain groups are similarly stored in a database table whose fields serve to relate searchable domains to group permissions, thus associating a domain with one or more group permissions of the enterprise. Specifically, a field exists in which a primary key of the searchable domains table may appear in multiple records, each time uniquely corresponding to a second field containing a primary key of an entry of the group permissions table

[000137] The above database tables and their relationships are sufficient to provide a role- based security protocol for protecting the results returned from a given user search request. More particularly, using the same security components and sequence/numbering scheme identified above, a specific security protocol can be implemented. User authentication is provided via a match of input username and password to those stored in the system users table, identifying the user as the individual claimed. The text string names of groups of the

enterprise are obtained from the group permissions table. Domains of content within ot without the enterprise are obtained from the searchable domains table. The user roles table indicates the groups to which the authenticated user belongs. The domain groups table indicates, for a given searchable domain, what groups of users may access that domain's content, and thus, via the user roles table and the matching of group permissions primary keys, what searchable domains the authenticated user has privilege to see

[000138J Thus, the above administrative information can be applied to filter the query of a search request, so as to return only information from those domains the authenticated user is permitted to see, based on that individual's role within the enterprise. It should be noted that the level of granularity of search restriction is generally that of a searchable domain since group permissions are assigned to searchable domains. In other words, the access granted users is not usually granted at the level of individual documents, as in a typical file system. However, in one or more embodiments of the present invention, an administrator may define searchable domains with a granularity that can vat ) * from finely grained (as a single file), to medium grained (as a set of sub directories), or coarsely grained (as an entire website). Thus, the granularity of group permissions is variable, depending on how the searchable domains are defined. Since documents of a common level of sensitivity are t) pically grouped together, domains, are generally defined correspondingly.

[000139J Referring now to FIG. 2 and FIG, 5, a method 500 for scoring and ranking search results m accordance with one or more embodiments of the present invention is

shown. When a search request is received from a user (step SlO), search mechanism 225, in conjunction with database 223 and index mechanism 224 can be deployed to perform the requested search and retrieve the results (step 520).

[000140] Once the results have been obtained, scoring mechanism 227 may be deployed to further enhance the search results. As shown In FIG. 5, any or all of the various weighting mechanisms previously described may be used to enhance the search results. For example, a user may determine that the desired search results can be enhanced by applying frequency weighting (step 530), proximity weighting (step 540), category weighting (step 550), and or location weighting (step 560). Since the application of these various weighting factors is user configurable, it is possible for each user to configure scoring mechanism 227 for maximum benefit.

(000141 ] Once the desired user-selected weighting factors have been applied to the search results, the search results can be ordered (step 570) and presented to the user (step 580). In this fashion, the search results can be enhanced and customized for each individual user of search appliance 180.

[000142J In one or more embodiments of the present invention, a search model is used to facilitate searching performed in response to a query consisting of one or more keywords, for example. The search model includes a data model used for searching, indexing and ranking operations, techniques such as word stemming and parts-of-speech tagging, and a lexicon that can learn new words encountered while performing initial and incremental

indexing. In addition, the search model can use a pipeline architecture, as is described in more detail below. The search model can also include scoring, or ranking, of search result items, e.g., documents, such as that performed using scoring mechanism 227 to rank the results of a query used with one or more embodiments of the present invention.

[000143) Word stemming can be used to remove common morphological and inflectional endings from words, so as to normalize terms. One example of such a word stemming mechanism is the Martin Porter Stemming Algorithm, a fuller discussion of which is found at http://www.tartarus.org/~marun/PorterStemmer/, which discussion is incorporated herein by reference. One example of parts-of-spcech tagging is the University of Pennsylvania (Penn) Treebank Tagset For example, see the discusssion found at http.//www.comp.leeds.ac uk/arnalgam/tagscts/upenn.htrnl, which discussion is incorporated herein by reference.

[000144] With regard to the search model, an illustrative description of a design of data structures used, the layout of die supporting database, and incremental indexing is provided. More particularly, the layout of the database and how it is used to maintain long-term storage of the index constructed from document content is discussed. In addition, a design of data structures that exist in memory to provide a short-term working store for the indexing procedure is discussed. A discussion of an indexing procedure is provided, and a principal use case of the search quen, showing how a keyword search model is applied to return results to the end user, based on prior indexing, is provided.

[000145] An illustrative example ot a database schema used in one or more embodiments of the invention is shown below:

atabase Schema Example

[000146] In one or more embodiments of the invention, relationships between database tables, which are used to form the inner joins of search queries, are not explicitly stated Use of prtrnan or foreign ke\s is limited in order to allow the insert of new records via a file import mechanism rather than through the use of die SQL INSERT statement. It is worth noting that most, if not all, database vendors do not permit a file import if the table to which

data is being imported defines an auto incrementing field and/ or explicit foreign key relationships.

[000147] A file import mechanism is used in embodiments of the present invention to achieve efficiencies. More particularly, m view of the numbers of records to be created in generating a search model index, use of an SQL IKSERT to insert records in database tables m a relational database is particularly time consuming and impractical. Accordingly, in embodiments of the invention, data that is to be inserted into the database is first written to temporan files, or buffers, and then imported into the database. One example of an exception to this approach involves the domain table, which defines an auto incremented index field, and the key table, which maintains counts of indices. Since relatively few records are involved, the file import mechanism need not be used m creating records in the domain and key tables.

[000148] In one or more embodiments of the invention, the domain, Mπ, and page tables are used to store information about the document pages that arc visited during indexing. A domain refers to a location where documents are stored, such as a website or file directory. According to dns model, every- domain that is indexed is recorded as an entry in the domain table. A document is referred to by its Universal Resource Indicator, or URI, which is associated with a specific domain. liven document that is indexed is recorded as an entry in the un table, tor even page visited there is a record entered into the page table that corresponds to a specific document and domain, for example, when an e-mail archive that

resides in a file system director;' is indexed, each e-mail of the archive is recorded as an entry of the page table.

[000149] To further illustrate with reference to the data model set forth above, the lexicon and rank tables are used in indexing the information accessible via network 120, λlore particularly, the lexicon table, which contains the learning dictionary of the keyword search model, contains an entry for every original, case-insensitive word known to the indexing algorithm, including the parts of speech of each word. The pos field, which is a comma delimited list of tags constructed, for example, from the Penn Treebank tag set. In addition, the lexicon table contains an entry for every stem word that can be constructed from the set of known original words. Every entry in the lexicon table is associated with a unique index, denoted by the /key field. In addition, the ukey field is a specific lkey index corresponding to a stem word. The ukey field is used to establish a relationship between ever original word and its corresponding stem word, within the same table. That Is, for example, every stem word entry in lexicon is self-referential, such that the values of lkey and skey of a stem word entry are identical. λn entry in the rank table records the frequency of occurrence of a stem word within a document page, as it is known within the lexicon table.

[000150] The word table records the positions of original words encountered during indexing, so that they may be highlighted in subsequent search result presentations. The original words need only be referred to by their corresponding stem words, hence the appearance of the field skey within the definition of the word table.

[000151] As discussed above, buffering and a file import mechanism can be used in one or more embodiments of the present invention. A data structure is used to provide a buffer for data before it is written to the database. The data that is buffered corresponds to die fields in the mi, page, rank, and word tables. A more detailed discussion of the data structure and the buffering process is discussed in more detail below, however, buffered data is preferably written at the end of indexing, or when memory availability reaches a predefined threshold, requiring a flush of data to free the memory. New records are written to the tables from the buffered data via a file import mechanism, and existing records can be updated via an SQL UPDATFi command.

(000152] Another type of data structure used in indexing is an λ r ~ary Trie tree, where λ " is the number of (upper case) characters in die alphabet, plus digits and punctuation marks. This tree structure can be used to hold in memory die contents of the entire lexicon and to provide fast lookups (e.g., a word lookup). Initially and prior to commencing indexing, the tree structure is populated using the contents of the lexicon table. If new words are encountered during indexing, they are added to die tree. At the end of indexing, the contents of die tree are written back to die lexicon table. Preferably, the tree's contents are written back to the lexicon table using a file import mechanism, as discussed For example, entries in the tree which represent new words found during indexing are imported to the lexicon table via a temporary buffer, or file, using a file import mechanism.

[000153] The λ " -ary Trie tree structure is ideally suited for use with large dictionaries of words because text-string lookup within the Trie structure is quite fast. Each node of the tree contains an array of size N, where each element of the array is potentially a child node. Memory considerations are relatively minimal for implementations within pointer-accessible programming environments that implement the λ τ -size array as a pointer array. In one illustrative implementation, N=69, which is sufficiently small to limit excessive memory allocation.

[000154] To illustrate, an example of a 3-ary Trie tree is provided below, which constructed from an alphabet consisting of the upper case letters A, B, and C. The elements (circles) of the 3-size arrays (rectangles) depicted below follow this same sequence. That is, the first element corresponds to A, the second to B, and the third to C. Shaded circles represent allocated nodes. The squares represent the allocation of data at a node, such as the parts of speech of a word. The example of the 3-ary Trie tree depicts the storage of data for the words AB, ABC, C, and CC.

An Example of a 3-ary Trie

[000155] In one or more embodiments of the present invention, indexing can be performed using a pipeline thread architecture. More particularly, the sequential nature of indexing can be broken up into segments and assigned to the multiplexing stages of the pipeline, so as to enhance throughput. For example, web crawling can be assigned to the first stage of the pipeline, the second stage can be used to perform initial format parsing of documents. Additional stages may be needed for further passes through documents (such as to apply sophisticated image recognition algorithms). In one of the final stages of die pipeline, indexed content can be written to the working store.

[000156] In one example of an application of the pipeline used in embodiments of the present invention, a single multiplexing stage can be assigned to perform all of the tasks of indexing, from web crawling, to format parsing, to indexing of words. In the present paper, we refer to the concatenation of all of these sequential tasks as the indexing procedure.

Moreover, ue discuss only the salient features of it, those aspects that might be construed as unique or noteworthy.

[000157] In one or more embodiments of the invention, as discussed herein, indexing includes a parsing of documents, or other items found on network 120, to identify new words to be added to the lexicon. In addition, with respect to each document, indexing identifies the words contained within the document, the locations of each of these words, and a frequency of occurrence of the words found m the document.

[0Q0158J Thus, in the course of the indexing, embodiments of the present invention contemplate the ability of the lexicon to learn new words. When indexing begins, the current content of the lexicon is loaded into memory, as discussed herein. This includes any predefined entries whose parts of speech and corresponding stem words have been carefully reviewed, such as by visual inspection. When new words are entered as part of the learning process, their stem words are estimated using the Porter stemming algorithm, for example. Also, each new word is assigned a default part of speech, such as by using the NN tag of the Penn Trcebank tag set, for example. It is further contemplated that, m connection with one or more embodiments of the invention, the lexicon of the keyword search model can be initialized, e.g., in a version shipped to the end customer, with predefined entries or no entries at all

[000159] The following provides a discussion of incremental indexing, which can be used uith a keyword search model used in one or more embodiments of the present invention.

According to one approach in implementing incremental indexing, two distinct time values, (i) the start time, index _tιme, of the indexing procedure and (ii) the last modification time, lasf_mod_lιme, arc maintained for each document visited. These values are stored, respectively, m the ιtιdex_tιme and Jast_jnod_tιjne fields of each record of the un table of the database schema set forth above.

[00016OJ When indexing commences, document information stored in the un table is preferably loaded into a data structure m memory to facilitate comparison of last modification times. If the document cannot be found in the data structure, it is added to the data structure, together with its last modification time and the start time of the present indexing. If the document is found in the data structure, dien its modification time is compared to the modification stored tn the data structure corresponding to the document. If the two times are equal then the document is not indexed again. Otherwise, the document is again fully indexed, i.e., every page, and the information pertaining to the document, including its last_mod_tιme and index _hme, is updated in the data structure.

[000161] Prior to completing an indexing operation is about to complete, once indexed content has been either inserted or updated, a "final scrub" of the database can be performed. This, final scrub can remove obsolete records from the database. For example, those entries that correspond to documents that are identified during the indexing operation as no longer existing (e.g., a document no longer resides within the domains indexed by the current indexing operation) or for whatev er reason no longer able to be indexed. Documents

6?

so identified during an indexing operation can be removed by deleting their corresponding entries from the uπ table, along within any explicit ot implicit relationships to other tables in the database. Thus, for example, all pages of such documents also will be deleted from the page table. Obsolete records of the uri table are those whose values within the indexjime field do not equal the present start time of indexing.

[000162] An example of application of a query generated based on a request from an end user is shown below.

SELECT skey, pos FROM lexicon WHERE word= r FOO';

[000163] In the example, the query is processed against the search model described above. The example query includes a keyword, "FOO", which is taken from the user request (e.g., the user request might involve a request for documents containing the word "FOO"), The query shown below is an SQL query involving the lexicon table of the keyword search model, which is used look up each unique keyword in the lexicon table of the model database. The lexicon table of the database contains entries for words and their stems and maintains a relationship between each word and its stem. When a keyword of the query is found in the database using the sample SQL query, the parts of speech, pos, of the word, and a reference to its stem word, skey, are obtained. If the word is principally a noun, i.e., in die Penn Treebank notation, an NK or NNP part of speech, a further SQL query of the database can be performed to obtain the frequencies of occurrence of the stem word within the pages of indexed documents. An example of this later SQL query follows:

SELECT domain_name, uri, page_num, ρage__title, ρage_freq FROM tank, page, uri, domain

WHERE (rank.skey=12345 AND rank.pkey=page.pkey AND page.ukey— uπ.ukey AND uri. dkey™ domain. dkey);

[000164] The above SQL query is an example of an inner join that exploits the relationships between the document, page, and rank tables, which were introduced earlier. In this way, the relevant pages of documents can be returned to the end user after the scoring operation, such as that performed by scoring mechanism 227 described herein, is applied to sort the results. In accordance with one or more embodiments of the invention, results with a score of zero can be pruned from the list before return to the end user.

[000165] In one or more embodiments of the present invention, search appliance 180 identifies servers which provide shared resources, or shares. Servers are identified using several methods depending on the characteristics of the target network. In accordance with one or more embodiments, search appliance 180 can browse the network address space (e.g., the network address space of search appliance 180) using network browsing tools and/or use director)' services to End shared resources.

[000166] More particularly, the search appliance 180 can locate resources by browsing the network using a browser service. A browser service, or server, provides a list of available resources on a network domain. A master browser maintains the main or master list of computers and shared resources. For example, all workgroups or domains can have one master browser. Thus, a master browser maintains a master list of shared resources, and

browser servers maintain a subset of the master list of shared resources. These lists are updated periodically to reflect shared resources added or removed.

[000167] According to at least one embodiment of the present invention, search appliance 180 searches network 120 to identify sharable resources using SAMBA, an open source utility suite which provides information about shared resources. Documentation for the SAMBA utility suite can be found at www.samba.org.

[000168] One such utility provided with the SAMBA utility suite is SMBtree, which can be used to browse the network to identify a list, e.g., in the form of a tree, showing known domains, the servers in those domains, and the shares on the servers. It has been determined by the inventor of the present invention that this utility does not necessarily provide an accurate and complete listing of the domains, servers and/or shares. Accordingly, in accordance with embodiments of the present invention, other SAMBA utilities are used to supplement the SMBtree utility, in order to obtain a more complete identification of shares accessible via the network.

[000169 J Another SAMBA utility, a master and browser lookup utility, used to supplement, or in place of, the SMBtree utility, locates all of the browsers, i.e., the master browser and browser servers, on the network, together with their XetBIOS names. Another utility, the SMBclient utility, is then used in embodiments of the present invention to obtain directory information from the servers identified by the former utility. In addition, the SMBtree utility can be used to provide a list of the servers and shares on the servers.

[000170] The search appliance 180 can be configured to find shared resources by consulting a director)- service. In one or more embodiments, search appliance 180 uses a director)' access protocol (e.g., Light-weight Director}' Access Protocol, or "LDAP") to consult directories, such as those directories maintained by Windows Domain Controllers, and Windows Catalog Servers, for example.

[000171] The process can be itcrarively performed until no new servers are returned. In one or more embodiments of the invention, the iterative process is implemented as a PERL script.

[000172] FIG. 8, which comprises FIGs. 8A and 8B, provides an example of pseudo code of a script for use in discovering shared resources in accordance with one or more embodiments.

[000173] During an initialization phase, search appliance 180 can examine network configuration information to determine the type of network services that are being used on the network. The network configuration information can be obtained from information entered via a graphical user interface, for example. In addition, search appliance 180 can be configured as a DHCP client, which communicates with a DHCP server to request network configuration information (e.g., IP address information, information regarding available domain name servers, NetBIOS servers and/or Windows ™ Name Service-enabled servers, etc.). This additional configuration option, using manual configuration and/or DHCP

configuration information retrieval, provides support for NetBIOS networks that span network segments.

[000174] In addition, search appliance 180 can retrieve shared resource information identified in a previous network search, as well as previously-supplied authentication information, In some cases, if not most, authentication information (e.g., username and password) must be supplied to a server to obtain information regarding the server's shared resources, or other information regarding the network.

[000175] In accordance with at least one embodiment and as shown in the example pseudo code, search appliance 180 can use its IP address to identify an address space, e.g., a network block extent, and the IP addresses in the address space. Search appliance 180 can search for devices that accept TCP connections on ports known to correspond to specific file sharing services. For example, a NetBIOS-over-TCP protocol set can be used to attempt to open a connection to a port (e.g., an SMB ports 139 and/or 445). An Active Directory Service (ADS) LDAP can be identified by accessing port 389. An accessible server is identified, and each server identified can be queried directly to identify shared resources (e.g., by obtaining a "share list" from an identified server).

[000176] A server name list is generated using the servers identified by a search of the address space. Each LDAP server found (e.g.. by attempting to open a connection to port

389) is queried to identify name of "Domain Member" servers/computers. Each IP address found (e.g., by attempting to open a connection to ports 139 and 445) is used to identify a

corresponding server name. For example, the NetBIOS or WINS protocols can be used to retrieve a server name corresponding to an IP address. If a server name corresponding to an IP address cannot be determined, the IP address is used as the server name. An IP address can be resolved, and a corresponding server name identified, using a reverse lookup operation. For example a Domain Name Service (or DNS), which can typically be used to supply an IP address for a given server/domain name, can be used to identify a server name corresponding to an IP address.

[000177] Each named server, or unresolvable IP address, identified, can then be queried to obtain a share list. In some cases, domain or server-level authentication credentials (e.g., login name and password) are needed to obtain a server's list of shared resources, or "share list". Accordingly, available authentication credentials (e.g., from configuration/initialization information) are retrieved for the named servers and unresolvable IP addresses.

[000178] A utility, such as the SAMB A's SMBclient, can be used to request a "share list" from a named server, or IP address. For those servers/IP addresses lacking authentication credentials, or in a case that a server/IP address does not require authentication, the SMBclient can be used without authentication credentials. If authentication credentials are needed to retrieve the "share list", the SMBclient can be used with authentication credentials.

[000179] If a "share list" is obtained, it can be examined, and server name information contained in the "share list" can be used to resolve a server name. In addition, in a case that

a new server name is identified from the "share list" (e.g., a new server name is listed in the "share list" and/or information contained in the "share list" is used to resolve and previously-unresolvable IP address), authentication credentials are identified (if available), and the server can be queried to retrieve its "share list", as previously discussed. An obtained "share list" can be examined to identify shared resources, or shares, which can be accessed for shared files.

[000180] In addition, the "share list" can be examined to determine whether a previously- undiscovered domain and/or workgroup is identified, which can be added to a domain/workgroup list. In addition, if available, domain-level authentication credentials might be available for a newly-discovered domain, which credentials can be used to obtain a "share list". In addition, previously-undiscovered peer servers can be identified and added to the list of servers to be queried for a listing of shared resources.

[000181 ] An iterative discover)' process is used to discover named servers and IP addresses. In accordance with at least one embodiment, the iterative process continues until no new servers can be identified.

[000182] Shares discovered using the above-identified iterative process, for example, can be mounted to provide access to shared files. That is, for example, a mount operation which references a network device, such as a server or storage appliance and/or a file system, storage device, directory, file, etc. of the network device, makes the referenced item available for access. While the SMB protocol/ file system implementation of SAMBA can be used to

mount shared files discovered using the above-described iterative process, older versions of the SMB protocol do not support digital signatures, or digital signing. This can result in an incompatibility with file systems that use an authentication technique, such as digital signing, in connection with, or as part of, a mounting operation. For example, more recent implementations of Microsoft's implementation of the CIFS protocol use digital signing for mount authentications.

[000183} Thus, in at least one embodiment of the invention, the CIFS VFS (i.e., Common Internet File System Virtual File System) is used to mount shares discovered using die above-described iterative process. CIFS VFS is an open source initiative in collaboration with Samba, which allows access to such shares as servers and storage appEances, CIFS VFS implements digital signing, and encompasses the SλlB protocol, and is compatible with newer Microsoft implementations of the CIFS protocol, of which SλlB is a predecessor. CIFS VFS, which implements digital signing and encompasses the SMB protocol, can be used to mount SMB file shares and the newer CIFS file shares, for example, particularly when digital signing is used within mount authentications. The document entided "Common Internet File System (CIFS) - Technical Reference (Rev. 1,0)", SNIA CIFS Technical Workgroup, dated February 27, 2002, provides additional information regarding CIFS VFS, and is incorporated herein by reference. The technical reference is available at httρ://www. sma.org/ tech_activities/CIFS/CIFS-TR-lρUOJFIXλL.pdf.

[000184] FIG. 6, which includes FIG. 6A to FIG. 6O, provides illustrative examples of

screens ftom a user interface of a search appliance in accordance with one or more embodiments of the .m ention. More particularly, the screens provide examples of selections/options offered via a user interface used in one or more embodiments of the invention It should be appaient that the examples provided in these figures are not exhaustive, and that other and/or additional screens and information can be displayed in connection with one or more embodiments of the present invention.

[000185] λ user login screen is shown in FIG 6A 5 which allows a user to log into and gain access to functionality provided by search appliance 180, in accordance with various embodiments of the present invention, for example, after successfully logging in, a user can be presented with a screen as shown in FIG 6B, which provides a number of options for indexing configuration Tt should be apparent that die options shown m FIG 6B are examples of indexing configuration options, and are not meant to limit or exclude other options that might be provided with one or more embodiments of the present invention.

[000186] One of the options shown in FIG. 6B is the "Monitor Indexing" option, which provides a \ lew the status of an indexing operation, start an indexing operation or stop an indexing operation. FIG. 6H illustrates a screen which includes information showing the status of a indexing operation m progress For example, the start, end and elapsed times associated with an indexing operation can be displayed. In addition, information related to a pipelined indexing operation can be monitored using the "Monitoring Indexing" option. It is also possible to terminate an indexing operation.

[000187] Selection of the "Schedule Indexing" option in FIG. 6B provides the ability to schedule an indexing operation to automatically begin at the designated time. FIG. 61 shows a sample screen displayed in response to selection of the "Schedule Indexing" option, wherein day of the week and start time can be specified for an indexing operation.

[000188} With reference to FIG. 6B 5 the "Define Searchable Locations" option selection provides the ability to define location that are to be indexed, and thus from where search results may be obtained. FIG. 6D and FIG. 6G illustrate display screens responsive to selection of the "Define Searchable Locations" option.

[000189 j Referring again to FIG. 6B, the "Choose Document Types" option allows a user to select the types of documents that are to be indexed in an indexing operation. The scope of a search as well as the search results can be indirectly identified using this option. FIG.

6C provides an example of a screen displayed in response to selection of the "Choose

Document Types" option. As illustrated by the sample selections shown in FIG. 6C, examples of document types include electronic mail, generic text, presentation, publication and spreadsheet. In addition, as illustrated, it is also possible to specify document type by the application used to generate the document.

[000190] The "Set Operational Parameters" option shown in FIG. 6B allows a user to set parameters associated with the operation of search appliance 180. FIG, 6J provides an example of a screen displayed in response to selection of the "Set Operational Parameters" option. For example, a maximum number of documents indexed from searchable locations

can be specified, as well as a level of messages to be logged during operation of search appliance 180, e.g., during a search or indexing operation.

[000191] FIG. 6K illustrates an example of a help screen displayed in response to selection of a help option. For example, help can be obtained for search appliance 180, and/or contents of a log file can be displayed.

[000192] FIG. 6L provides an example of a screen in which a search is entered according to one or more embodiments of the invention. FIG. 6M and FIG 6N provide examples of results of a search, using keywords "alan", "larry", "presentation" and "publication", conducted using search appliance 180, m accordance with one or more embodiments of the present invention. As can be seen in FIG. 6N, the contents of a document uncovered in a search can be displayed.

[000193] FIG. 6O shows examples of options which can be used to perform "Users Administration" operations, such as "Add User", "Change User Password", "Change User Permissions", "Remove User", "Add Groups", and "Remove Groups".

(000194] FIG. 7, which includes FIG. 7A to FIG. Ti, provides illustrative examples of screens from a user interface used in configuration operations for, and/or associated with, search appliance 180 in accordance with one or more embodiments of the present invention. It should be apparent that the examples provided in these figures are not exhaustive, and that other and/or additional screens and information can be displayed in connection with

one or more embodiments of the present invention.

[000195] FIG. 7A depicts a login screen, in which a user can enter a username and password to gain access to some or all of the remaining portions of the user interface. For example, after a successful login, the screen shown in FIG. 7B can be displayed to allow the user to select between "Network & Internet Connections", "Network File Sharing & Security" and "Search Appliance File Sharing".

[000196] The "Network & Internet Connections" option can be used to configure search appliance 180 for a specific computer network, in order for the search appliance 180 to communicate with other computers on the network and/or the Internet. FIG. 7C to FIG. 7G provide examples of screens that can be displayed in response to selection of this option. FIG. 7C can be used to specify host and domain names associated with search appliance 180. FIG. 7D provides the option to either manually or automatically discover the IP settings for search appliance 180. As discussed above, in accordance with one or more embodiments of the invention, the IP settings corresponding to an instance of search appliance 180 can be established automatically using a LIDP client/ server model.

[000197} In a case that manual configuration of the IP settings of a search appliance 180 is selected, a screen such as that shown m FIG. 7E can be displayed, to allow a user to enter an IP address, subnet mask, and default gateway for search appliance 180. In addition, FIG. 7F can be used to enter IP addresses corresponding to primary and secondary domain name servers which will assist search appliance 180 in obtaining network domain names- FIG. 7G

provides an example of a screen displayed at the successful completion of the manual configuration of IP setting for search appliance 180.

[000198} λ screen such as that shown in FIG. 7H can be displayed in response to selection of the "Network File Sharing & Security" option given in FIG. 7B. Referring to FIG. 7H, a workgroup and domain for search appliance 180 can be identified. FIG. 71 and

FIG. 7J provide the ability to specif} enhanced file sharing features for search appliance 180, e.g., use of local master browsing Search appliance 180 can communicate via using encrypted transmissions based on options provided in the screen shown in FIG. 7K. FIG.

7L provides an example of a screen displayed at the successful completion of the network file sharing and security configuration options performed,

(000199] FIG. 7M to FIG. 7R provide examples of screens containing options to "mount" file shares, for purposes of indexing and searching using search appliance 180. FIG. 7O and FIG. 7P illustrate a screen, bottom and top, respectively, which lists shared resources obtained by search appliance 180 browsing network 120. The file system volumes that are to be mounted can be selected using this screen. FIG. 7Q provides a screen containing a listing of file system volumes confirming the selections made using the screen shown in FIG. 7O and FIG. 7P The screen shown in FIG. 7R provides a status of the mounting opeiation.

(000200] FIG. 7S provides an example of a maintenance screen, which can be used to determine the status of updates, for example, that have alread ) been or should be installed

on search appliance 180. FIG. 7T provides an example of a log displayed in response to selection of the "View Message Log" option of FIG. 6K. FIG. 7U to FIG. 7Y illustrate screens related to various system-level options, e.g., security and restarts, as well as some help topics.

[000201 J In summary, the present invention provides an apparatus and method for the broad application of indexing, locating and retrieving desired information in an efficient and effective manner. Lastly, it should be appreciated that the illustrated embodiments are exemplar) embodiments only, and are not intended to limit the scope, applicability, or configuration of the present invention in any way. Rather, the foregoing detailed description provides those skilled m the art with a convenient road map for implementing the exemplary embodiments of the present invention. Accordingly, it should be understood that various changes may be made m the function and arrangement of elements described in the various exemplary embodiments without departing from the spirit and scope of the present invention as set forth in the appended claims.