<< Back
Assistance in Locating Educational Information On the Internet Assistance in Locating Educational Information On the Internet 1 Aim 1 Introduction 2 Technologies: 5 Methods Encountered 6 Repeated Patterns/Paths 6 Recording Waypoints 7 Local Page Scouting 7 Dynamic Query Augmentation - Search Engines 7 Document Examination 7 User Profiling Only 8 Personal Web Neighbourhood 8 Link Personalisation 8 Display Content Personalisation 8 Rec by Semantic Based Web Page Filtering 11 Similar Navigation Path Processing 12 BackLinks 12 Web Page Annotation 13 Digital Libraries/Repositories & Archiving Systems 13 Further Sub-Research 14 CiteSeer: 14 Document Classification: 16 User Profiling/Grouping of Users: 16 Agents 19 Filtering 23 Searching/Search Engines 24 Collaborative Filtering 26 Recommender Systems 28 The Web and Teaching 28 Communication/ knowledge sharing within Learning 32 Adaptive Hypermedia 33 The Semantic Web 35 To check out: 40 Aim To create a system or set of systems which will: 1) Assist the user in locating information, primarily for educational purposes. 2) Searches should be context aware; a 16 year old student can be expected to be searching for less in-depth information than a PhD student. 3) Be as autonomous as possible. 4) Take into account knowledge gained from previous searches done by other related users. No two users should need to go though a searching process in order to access the same information. The results of one should be available to the other. 5) Group users by depth of subject search and by individual subjects. 6) Not necessarily archive documents, but should instead index and archive URL’s of documents. 7) Classify articles and documents according to the relevancy. 8) The system should not need to know any personal details of the user other than the course of study, possibly identified by a unique number. The system may be used in conjunction with an existing archive such as CiteSeer (79) or AltaVista. Address the challenges of information overload by: 1 Minimising the quantity of information offered to the user. 2 Maximising the quality of the information offered to the user. 3 Maintaining the transparency of the system. Introduction People accessing the web fall into two categories; those searching for something specific, and those who are browsing for something new or interesting. The web offers huge amounts of informal and semi-structured information; these are among the main reasons for its success. Traditionally, a person wishing to learn a subject would get an appropriate book or paper. The process of obtaining the book or paper may mean that by the time it is read, it may be out of date. Using the Web for serious research is almost essential, offering the most up to date material. Typically though, research material is dispersed and difficult to locate within the plethora of available information. Convenience. Used by organisations to gain competitive advantage; search engines represent decision support tools. Certain tasks are enjoyable, other deserve and should be automated, locating info on the Internet is one such. Filtering systems help address the problem of information overload. Target scenario: Student A is 21 years old, lives in Tokyo, Japan, and is studying for a degree titled ‘Metrics within Business Process Engineering’. As part of her research she spends quite a large amount of time surfing the web looking for relevant research documents. At some point, she finds a research paper which includes figures and graphs for a study an organisation made into several newly proposed metrics. Finding this interesting, she saves a copy to her computer. A week later, Student B, 22, who live in London, England who is also studying for a degree in Business Process Engineering ventures onto the Internet looking for ideas for a section on Metrics He enters his keywords into a standard web search engine and finds at the top of the returned list a reference to the document that Student A found useful. Student C, a 16 year old from Raleigh, NC, USA is doing a project on Computer System Design and as part of it, needs to give a brief explanation of what Business Process Engineering is. He does a quick search on the web, and the returned references do not include the article both Students A and B found useful. He finds what he needs through a normal query. The key concept is that if Student A finds an article of interest whilst surfing for her key subject, then Student B should also find that article being recommended to him by any search engine he cares to use, so long as his submitted keywords relate to the article. We would not want student B to be looking for video listing and being recommended research articles as a result. The system must be able to differentiate between when a student is studying and when he/she is not. Student C will never be offered a reference to that particular article from a search engine unless by luck of query keyword. We need to cover the point that ‘why should we make it hard for students to discover knowledge’. If an article explains Business Process Reengineering to a degree which is suitable for 16 year old, why can it not be made readily available to that student. Some would argue that the researching is part of the learning process, and in a research degree, this is true up to a point. If a 16 year old has the capacity, or the will power to put only 2 hours work into an essay in one evening, will he/she benefit by spending that time researching or would it be more productive for him/her to spend it writing up the actual essay? We should also examine the purpose of providing assistance to students and researchers; there are a number of scenarios: 1) To make the task of research easier by enabling them to home in on relevant information more quickly? 2) To make the task of research easier by enabling them to home in on known or recommended sources of relevant information more quickly? This implies some sort of recommending system be in place. 3) To assist them in locating new information, not necessarily widely known about within their peer groups. Different technologies are more suitable to some scenarios than others. The possible use of mailing lists/emails to notify peers of interesting articles found. A lot of what people are searching for may be answered in the FAQ’s already in existence – this could be a simple solution to simple queries. Information Retrieval (IR): The study of IR has been ongoing for over 4000 years. Tables, indexes, repositories, libraries and classification systems are the resulting tools which were originally manually created. Since the invention of computers, it has been possible to automate these processes. Unfortunately, due to the size of the Internet compared to the processing capacity of computers, repositories of indexed data have had the choice of either limited more up to date information, or more expansive but less up to date information. There are methods which help this process such as the use of signature files to identify the contents of documents; these are more easily searched than using the whole document. Information retrieval on the web can be traced back to Wide Area Information Server (WAIS) which was one of the first tools to allow indexing and searching. The Gopher browser system was then introduced and after which the Web as we know it came into being. The Web was accessible by browsers such as Mosaic which allowed the user to link with search engines. Precision (ratio of relevant docs to retrieved docs) = # of relevant docs/# of retrieved docs Recall (proportion of docs retrieved) = # of relevant retrieved docs/ # of relevant docs There is a three way trade off between precision, recall and speed. The technologies behind more traditional databases do not scale well for Internet usage; the quantity of simultaneous users, and the amount of the data. There is also the difficulty in the added dimension of the numerous types of data needing to be accessed. Spamming - the use of inserting keywords numerous times into documents to promote retrieval from SE’s. Sometimes the text is set to match the colour of the background so it is invisible. We need to state the benefits of using a recc system of a semantic system – e.g. students can be grouped easily and are in a contained area – the semantic web is all encompassing. A rec system can be up and running much more quickly than a semantic system which need the material to be reproduced with the additional semantic info. There may be benefits of having a state of mind that the student is ‘working’ when using the rec system rather than just ‘surfing’ with the SW. There is also the added benefit of the student feeling part of a community? Technologies: Three areas where processing can take place: Server Side: Security issues, the need for all servers to run software – no good. No user activated processing Client Side: Browser enhanced / Java. Can tie up user’s proc power. Security/ acceptance issues. Sometimes known as Browsing Associates. Agents Proxy based: Few security issues – flexible – expandable – any kind of computation possible esp. when combined with a CS agent. Doc 58 Sometimes known as stream transducers The two primary ways to personalise a search is: 1) Query augmentation: System may add to the query by using known info e.g. previous searches, or even by knowing what is not needed. 2) Result processing such as demographic filtering when searching for restaurants. Java programming to developers, tutorials and overviews to teachers and coffee for foodies. Filters such as have seen/have not seen allows user to home in on info. User context switching is difficult to cater for. A major drawback of post retrieval analysis is the computation time and resources needed: Doc 75. Methods Encountered Repeated Patterns/Paths Create shortcuts for repeated patterns. Suppose a user regularly uses the Java.com main page to browse to the programming area page and then onto the Java manual page. An agent can place a shortcut onto the Java main page. See doc 52. Web Page Link Weighting Doc 92 considered the idea that if an author of a web page included links to other web pages, he/she must consider the other web page to be of relevance. This is a viable idea within its concept boundaries, but if the data we are looking for is not linked to it will not be found. The paper actually states that ‘On average, 43% of web sites are isolated”; this means that using this method would not access any of these 43% of web sites. The concept will return high quality knowledge in the immediate sphere of interest, but could suffer from being out of date and missing key concept not covered by those authors. Trodden Paths: The ‘path’ concept, where a well trodden path between documents can be identified might work in an environment such as CiteSeer. Doc 71 attempted to establish relationships between documents; how many times a document is referenced by others. May be based on the relationships/weightings between Nodes (URLs) and edges (links). This technology requires a specialist server either rented or owned to index crawled page relationships. Topics dominated by commercial sites are not collaborative in their linking. Recording Waypoints Users find ‘History’ confusing and tend to go back to a major waypoint to find a recently visited site. By collecting paths, a graphical notation may be used to illustrate where the user has been. See doc 52. Local Page Scouting Letizia uses a user profile to make suggestions about pages in the neighbourhood of the page the user is currently viewing. Uses BFS searching Dynamic Query Augmentation - Search Engines Previous methods: 1: Expanding topic keyword by thesauri and co-occurrence lists – incorrect keywords in first place? 2: Relevant feedback – complicates by additional search steps. 3: Hand crafted question templates – difficult to scale as Q’s must be pre created. Powerscout uses DQA to make suggestions of suitable web pages based on user profile and viewed web pages. Doc 67 proposes GOOSE – a search engines which augments queries by using common sense statements provided by the Open Mind database. Good point: the search engine offers a selection of context options i.e. Start the search with: I want to research or I want to find people who, or I want help solving this problem. I think this will struggle with more complex queries. Document Examination This is where docs/pages viewed by the user are parsed for keywords which are given weightings. Those with high weightings are queried against a search engine and the result displayed to the user. Doc 68 – WebSeer uses this method to index images into their own search engine. Info gained from a web page text includes: 1) The image file name and directory it is held in. The directory may be named after the image i.e. java/images/waterfalls. 2) The image caption. 3) Alt = text. This filed is used to store the laternative text when a viewer requests a non graphics version of the web page. 4) HTML titles. The description of the page is displayed in the web browser title bar. 5) Hyperlinks. The text of a hyperlink usually gives clues to the picture. 6) Other text. Other text on the web pages may give hints The agent may use a best-first search; the agent takes the first good link in a doc and browses to it. It parses that page and again goes to the best link on that page etc. User Profiling Only Outride used UPO to make suggestions of suitable web pages based on user profile, bookmarks, history, click trails and time spent. Personal Web Neighbourhood A PWN is created by examining the user’s local documents, URL’s and Bookmarks and keeping a copy of the web pages which they point to (up to a 3d depth). The PWN is held locally and can be accessed by peers who have the same interests. Could be stored Server Side. See doc 53 Link Personalisation This is where the links available in a web page change according to the users profile/interests. Amazon uses this to suggest suitable books to a user. Display Content Personalisation This is when a web browser or an Agent filters what it thinks is un/suitable for a web user in/out of view, or may make additions to the content. This may depend on chosen setting and filters. There are system overheads and changes depend on users input and so are not so dynamic. An example of this is WebWatcher (86, 84) which adds certain available commands like search, highlighted links which it thinks suitable, and adds graphics (a pair of eyes) to view. Pros: Allows agent to converse with users in discreet manner. Cons: Needs to go through a proxy – may be delay by processing Another Agent which uses this method is Syskill & Webert (Doc 90) which learns from user feedback to create a separate profile for each topic the user is interested in. The result is that the Agent suggests suitable pages and can even process a query to Lycos based on the profile. Recommendation Systems (RS): Main paradigms: Collaborative Approach: CF works by matching users based on their profiles and grouping them into neighbourhoods. This neighbourhood is used to make suggestions to users as to the suitability of papers/docs/web pages. The media is rated in some cases by certain users, or any users. Could be peer groups. But often, individual’s information needs and tastes differ making recommending difficult. The relationships between users profiles is an approach which matches real-life, and can take into account professional, political, personal and social. They have been studied in depth by researchers such as Scott, J., Social Network Analysis: A Handbook, 2nd Edition, Sage Publications, London 2000 The job of classifying and indexing documents historically was done by editors and publishers, but due to the massive increase in submitted documents, this has become increasingly difficult. Researchers have developed tools to aid in the selection of that information which they deem valuable. Information Retrieval Systems allow users to express queries to select documents that match a topic of interest. These systems can rank query results by a number of heuristics; relative term frequency, adjacency of query terms, and position of query terms. Information Filtering Systems are like those above but are optimised for longer term information needs from many incoming documents. Some systems use user profiling and incorporate Agents. Collaborative Filtering recommend documents through recommendation from others within similar communities. GroupLens is one such system. The results of the process manifest in the relative placement of documents within a search. Litizia and WebWatcher do this autonomously but have problems with resetting the search when the user changes subject. Amazon use collab. filtering to rec. books to customers: Doc 75. Collaborative Filtering is a widely used technique in recommender systems. Users are grouped into a ‘neighbourhood’ based on the similarity of each user’s past preferences. This can be used to generate recommendations by suggesting items to the user that he/she has not viewed but others in the neighbourhood have rated highly. There are problems with these systems: Sparcity: There may be many users and many documents, but few recommendations or ratings. Without ratings, there cannot be any recommendations. A vicious circle. One start-up method is to import data from a pre-existing data source. Collaborative filtering models real-life where there exists relationships between people through reasons such as professional, political or shared interests. Setting up: When a new neighbourhood is set up, there will be few documents and fewer ratings. Very difficult to get up to speed. There are a number of methods which can help alleviate these problems: Partitioning: GroupLens showed that by partitioning the ratings database into groups, higher accuracy and density was achieved. Sparsity was still a problem. Dimensionality Reduction: By using clustering, decomposition and factor analysis, it may be possible to reduce the amount of unused data. Implicit Ratings: By observing user behaviour, it is possible to augment the document ratings. An example of this is the Fab system. These problems were considered by Sarwar98 et al(Doc 105) as part of the GroupLens project. They proposed that Citation indexing uses set of heuristics to process documents. Woodruff – Enhancing a digital book – uses one such algorithm. McNee et al. (Doc 87) proposes CF using a ‘citation web’ – how document citations link docs together in a matrix. Various methods are used to create the index links between citations; recommendations, co-citation matching, weightings through probability tests of docs in CF groups. Initial tests prove above average satisfaction from users and only average benefit form the system. Collaborative filtering focuses on identification of other users with similar tastes and the use of their opinions to recommend items – Doc 99. These systems will not be very useful if not widely used. May involve explicit input. There is difficulty in matching the profiles of 2 users to get a relevant return. Rec by Content Filtering: Recs are made by comparing the content of a webpage with what is perceived that the user is interested in. Users are clustered in to probable classes or profiles [Surflen 57], and depending upon their class, comparisons are made with available web pages. Text docs are recommended based on a comparison between their content and the user profile. Data structures are used alongside word weighting techniques such as TF*IDF to rank documents. One of the downsides is due to the ability for only a shallow analysis of the data, especially movies and pictures. Rec by Semantic Based Web Page Filtering Similar to content in the way it works. A View of Semantic Web Languages: A SWL must describe meaning in a machine readable way. Therefore the language needs to be able to specify a vocabulary and to be able to define it. The language must be based on HTML or XML for integration purposes. The language must be extensible and adaptable just as the web, humans and human’s understanding of the web changes. The schemas used to define ontology should be standardised yet expandable – this allows differing levels of parsing and different SWL’s. An ontology is way in which a domain in described. The benefit of SWL is in the searching of web pages. Queries against search engines can be much more clearly defined by using the embedded extra info found in the Web Page. One of the major hurdles is user’s unwillingness to share info. Orlikowski: Learning from Notes, ACM 1992 Don’t scale well: 1) Amount of info on WWW increasing at a fast rate. 2) A user may have many interest, not just one. It is difficult for a system to differentiate when a user changes tack in browsing. Fab (Doc 101) combines Collaborative and Content based filtering. Feedback from the user may be explicit or implicit. With implicit feedback, the key is in establishing an accurate profiling engine. A RS can be split into 2 stages: 1) The collection of data to be placed into an organising database 2) The extraction of that data as a result of a query. Similar Navigation Path Processing If user A browses through a set of documents, this is recorded via a proxy. When user B browses a path which is similar, it may be suitable to propose to user B the remainder of the path browsed by A. User A: D -> E -> F -> G User B: X -> Y -> Z -> D -> E -> F -> therefore G is proposed? See Doc 69 for Broadway. BackLinks See Doc 62. When you consider that a web page (a) may only be accessible via a link(s) from another web page (b) which was created after the webpage (a), it may be that web page (b) has more up to date info upon that subject. This phenomenon may increase over time as the web matures. One such language is Xanadu One of the key factors in Backlinks is that they are humanly created and may enjoy an advantage over automated links. The main problem is that Internet standards are not set up to cope with this at the moment and probably never will be, so the solution will lie in the creation of repositories such as the OAI [184]. Hitchock et al. [HITC02] stated that 'The real value in collected reference data is not in producing links that point to works in the past, the authorised links, but in creating links that transport the user forward in time.' Web Page Annotation Doc 64 proposes an extension to HTML to allow authors to augment the ontological info held within a web page. An associated crawler was developed to located and index this info. This would allow a standard search on ‘Java’ to be augmented with ‘PhD research’ or ‘Uni lecturer’ etc. Doc 124 proposed Ontobroker; an HTML extension which could be used to give additional information about the document, it author and the subject area. It came with tools to assist the searching of the web for these ontological extensions. The annotations would not affect the view of the Web Page, only the source text which would hold extra searchable information. Digital Libraries/Repositories & Archiving Systems It has always been a need for humans to keep up to date on important matters; this almost always involves a large amount of work. Historically, periodicals were used, but as the quantity and diversity of both periodicals and the information printed in these has increased, it has become increasingly difficult to keep them up to date. This prompted the introduction of digital libraries, but still it is necessary for researchers to spend time looking for interesting publications. The easiest solution is that all existing documents and all future created documents are placed onto a central repository. There are a number of problems with this: 1) Needing incentives so that authors readily contact the repository to notify it of new papers. 2) Copyright infringement issues. 3) Knowledge hiding – some papers may contain knowledge which the author may not want in the public domain. 4) Storage and saleability issues. But as Cameron (82) pointed out, it would: 1) Ensure that publications in any form are visible – BUT not necessarily accessible. 2) Allow fairer competition between publication venues -> good point. There are a number of useful repositories currently in existence; 1) The Berkeley Personal Libraries provide cataloguing and full search capabilities and offers a privacy mechanism. Can be accessed at http://sunsite.berkeley.edu/. 2) The Greenstone system is a public, extensible, open source project intended to grow in functionality as people contribute to it. Docs must be converted into GML, a form of HTML in order to store them. 3) Haystack is a repository which can be search by personalised mechanism so that the resulting set reflects the user’s interests. 4) CiteSeer (79) – does not require the author to do anything to list his paper. This is done automatically by the CiteSeer agents – may miss some. Also provides an API to utilise the database capabilities (80). 5) The OAI - The Open Archive Initiative - http://www.openarchives.org/ - more 6) Citulike [193] – not necessarily a repository, but a Web based archiving and linking site which enable users to organise their citations and join common groups to explore other similar user’s citations. As well as some proposed: 1) Cameron’s Universal Citation Database – See Doc 82. Authors must submit in a specific format. The works should be organised by publication. -> this makes sense. Sources for indexing literature face three main costs: 1) Entering the data about the docs onto a database; purchase of the doc, preparing and key entering data. 2) The addition of the extra data associated with the indexing of the doc. – used to aid search and evaluation. 3) Marketing and distribution. There are a number of archiving/indexing systems available which may (or may not) provide the user with a method by which to create their own repository. Harvester (81) is one such tool which provides the user with an integrated set of customisable tools for gathering info from diverse repositories, building topic-specific content indexing, flexible searching and caching facilities. Further Sub-Research CiteSeer: CiteSeer is a web based system which performs Autonomous Citation Indexing of scientific publications on the web. It provides the facility to browse by citation links. It also incorporates the technology to ‘remember’ where researcher browse, and uses this information to create a profile for user’s interests which will be used for tracking. This means that researchers will benefit from previous researchers who have similar interests as them; this helps to avoid duplication in researching time. Papers are linked by heterogeneous measures including the ‘trodden path’ method. CiteSeer uses two general measures for determining paper relevance: 1) Constraint matching: Allows a user to describe an interesting paper by specifying constraints such as keywords, age and classification. . Keyword Matching: This is facilitated by the extraction of keywords from certain areas of the articles. Citation Links: This work sin both forward and backward directions to allow researchers to traverse paths between relating articles. Use of Metadata: Since metadata is a descriptive tag associated with a document, it may provide useful information about the relevance of a scientific publication. One such metadata is URLs where a document was sourced. A researcher can place a ‘watch’ on a URL and be notified in any new article appear on that site. 2) Feature relatedness: Allows a user to specify a set of papers that are interesting, and CiteSeer tries to find papers that are related to the set There may be relevant papers which are not located by method 1. The researcher can specify paper or papers relevant to his interests. The relatedness can be calculated by i) identifying features of the documents that represent useful semantic information and ii) creating functions of these features having a range space in which distances represent meaningful semantic distances. Text Readiness: Using TFIDF to compare two bodies of papers determines how related they are. Currently this is tuned by hand. Citation Relatedness: If 2 articles cite some of the same articles, then it can be assumed that these cited articles are similar in context. CiteSeer uses SE’s, web-crawling and mail list monitoring to continuously search for new publications. These can be offered to the user the next time he/she starts a session. They can also be emailed to the user. CS consists of 3 basic components: 1) A focused crawler/harvester 2) The archive and indexes 3) A query interface The harvesters crawls the Web for relevant documents in PDF and Postscript formats. These are indexed using Autonomous Citation Indexing (ACI) and key information is extracted. Document Classification: Document Classification (DC) is an important aspect of any RecS. Without accurate classification of documents, it would be impossible to organise them into any order usefulness. One of the problems encountered by researchers in this field such as Balabonovich [BALA97] is the quantity of documents needed to be classified before any automated classification system has be compared against them. An alternative to this was offered by Balabonovich were the system started with zero documents and learned to provide better and better recommendations as time went on. Document classification is also important in repositories. The Citebase [184) Web interface is an experiment into organising research paper references into a useable format. By indexing the references of all articles held in the Open Archive Initiative catalogue (OAI), which can be found at www.openarchives.org, Citebase has created a searchable archive of references, allowing the user to see both backward and forward references. The OAI is an initiative promoting more open access to scholarly communications allowing authors to self-archive their own work in this repository. The notion of an open-access database containing research papers is not new, but has historicall add little success. As the momentum for support grows, it looks increasingly likely to happen; as Hitchcock et al. [HITC02] stated, 'Authors are well aware of the potential benefits of open access, but how can they be persuaded to act in pursuit of these benefits?'. User Profiling/Grouping of Users: Profiles change and users have multiple interests; 1) Keywords may be averaged which has the disadvantage of altering the weighting of words too broadly. 2) Interests can be explicitly labelled by the user. This may be found to be annoying to the user. Implicit learning may be a better approach. Information filtering focuses on the analysis of item content and the development of a personal user interest profile – Doc 99. Grouping: Are groups to be made persistent or ephemeral? Is the group in for the long haul or just for one project? Are groups to be private or public? How many groups can a user be a member of? - What is the saturation point where a member is in too many groups - Should a user not be grouped but instead have a neighbourhood – users have more than one hobby. Why should a user have to explicitly choose to be in a cycling group? If the user shows interest in cycling articles, should this include cycling as part of his/her neighbourhood. Who creates groups? In most cases, an administrator would decide and define groups. Who can join them? In most cases, anyone may join. There may be rules attached. Membership should be simple. Who manages groups? In most cases, an administrator would manage groups but the process may be mainly automated. How is privacy handled? Membership to any group results in at least a partial loss of privacy purely by its nature. Most groups have differing levels of membership policy with regard to privacy. These range from one extreme to another. It is worth noting that in general, the higher the privacy, the more difficult it is to join/create groups. A certain amount of information must be known about a user for a recommender system to work. Whether this info is passed onto other members depends upon the profile/style of the ‘group’ In some case, spam for example, a user may not be aware that they are members of a group! Doc 108 (O’Connor) has some interesting results of research regarding the creation of user groups. What is the nature of a group? How are groups formed? How are recommendations computed for groups? What GUIs are best? What are the privacy issues? Level of Awareness between users. Bayesian classifier Nearest Neighbour PEBLS Decision Trees Neural Nets TF-IDF This is used when comparing the similarity of two documents. If we are trying to establish how related document A is to Document B. One of the main methods is TFIDF – term frequency x inverse document frequency. Referred to in Doc 79. The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. – wikipedia. WIKIPEDIA: There are many different formulas used to calculate tf–idf. Term frequency (TF) = # of times the word appears in a document / the # of total words in the document. If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100). Document Frequency (DF) = the # of documents containing cow / the total # of documents in the collection. So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10,000,000). The final tf-idf score is then calculated by dividing the term frequency by the document frequency. For our example, the tf-idf score for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to take the log of the document frequency Doc 93 stated ‘TF-IDF is an efficient and simple alogorithm for matching words in a query to documents that are relevant to that query.’ Agents Doc 76 sets a good scenario describing what is expected of an Agent: A user wishes to arrange a trip from a town in the UK to a city on the East coast of the US. Traditionally, the user would have need to interact with hoteliers, car rentals, flight agents, rail agents etc. With so much info available online, but owned by different stake holders, the agent is assigned the task of dealing with each organisation ( or orgs Agent) ; much like a secretary. For this to work, each org would need to be operating a standard agent to communicate with the users agent. Another problem is how will our agent know where to get the best deals? Humans learn from each other. The agent would need a mentor to go to who would direct it to the better sites and away from redundant ones. Until this technology is standardised, agents are very limited in their capabilities. This offers an opening for research. Agents are solutions to specific problems e.g. distributed autonomous experts – medical. Server side Agents may specialise in specific topics – Doc 81 – Specific topic users will be allocated to an Agent so that it can concentrate the vocabulary and peculiarities relating to the one topic. There are a number of Companies operating under the umbrella of supplying Agents and their services; Jango, Autonomy, Verity and AgentSoft are some of them. It remains to be seen if they can make a profit over a longer period. Agents are notoriously independent of one another in respect of creation and compatibility. The intro of KQML and CORBA (see below) is a beginning in addressing this problem. Standardisation is the key to the success of Agents as unless they communicate with one another, they are less useful and their take-up will be limited. Some agents like information agents promise so much yet offer so little. They are mainly context specific; i.e. they will specialise in a niche problem. A down side to agents is the extra workload traffic on the internet. Menczer [MENC] stated that 'More active robots would improve accuracy at enormous costs in network load' and went on to say that there is a need to complement index-based SEs with intelligent browsing agents. Classification of Agents: 1) Service Agents: provides a service to the web community. 2) User Agent: provides a service to the individual. Known as Personal and Information Agents. The big barrier to these lies in the lack of research into the agents cognitive ability. Maes (Doc59) suggested the agent learning approach, but 13 years later, we are not much further forward. We may also classify them as: (Defined in Doc 76) 1) Multi agent systems 2) Autonomous interface agents/information agents We can classify them by scope: There is also a distinction between mobile and stationary agents. Java would make a good mobile agent. 1) Limited scope agent – limited to local PC 2) Global scope – can traverse networks. We may also classify them by intelligence (as Casasola in (85)): 1) Looking over the shoulder – may take a long time to be of use due to learning curve. 2) Search systems which exhibit an above average degree of learning capability. 3) Personal search agents which collect info in an unsupervised way. There is an issue of balance and reality here; if the task is so complex, how can we expect an agent to understand it if the user struggles to, it the task is mundane, is it necessary to have an agent model the user to such a degree? There is also the issue of security, trust, privacy etc. Doc 74 suggests that most current recommendation systems exploit the knowledge of several users. This approach has several shortcomings; lack of expert knowledge, proactive systems may end in an overflow of mildly interesting suggestions, and sometimes explicit feedback is required which is not always convenient. Questions about Agents – see Doc72 1) Will we fail to incorporate substantial intelligence into the agent? 2) Does the cost of deployment outweigh the benefit? 3) Will users find it acceptable that an agent not necessarily their own request info from them; even dates of travel for booking holidays. 4) What is the difference between an agent and a program? A good definition of an agent is found in Doc 91: An autonomous agent is a system situated within a part of an environment that senses that environment and acts on it, in pursuit of its own agenda and so as to affect what is senses in the future. -> Therefore an agent must have the capacity to: 1) Sense it environment 2) Act upon its environment with a view to altering it – self starting 3) Learn – through collaboration and feedback. Flexible – alters dynamically. Communicative. Adaptive 4) Own an agenda – goal oriented ->Agents are programs. Doc 91 states that ‘All software Agents are programs but not all programs are Agents’ -> Should an agent be independent of the web browser? Franklin, Stan, Graesser, Art, Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents, Professional Journal, 1996 Input: It must be possible to program or give input to an Agent so that is aware of its tasks. Etzioni (123) stated that ‘Goal orientation is only useful if it is easier for a user to specify the request than to carry out the activity for herself.’. There may be a command line interface or the Agent may need to be programmed to complete a task. If we consider the concept that Agents should act like programmed procedures and be responsible for one process only, then there may not be a need for too much input other than keywords in a UI. In the creation of Agents, we have several key aspects to consider: 1) Agent organisation. The Agent may need wrappers or plug-ins to enable it to cooperate with the Domain it is employed within; an Agent for the travel industry will need to interface with booking systems, an Agent for research will need to be able to communicate with a repository such as CiteSeer. 2) Agent knowledge, learning capabilities and scope. Each Agent must have an ontology domain; it must be able to comprehend its brief and must have clear communication between its learning capability and its knowledge. It should be able to improve itself over time; this implies a certain amount of learning must be able to take place. 3) Communication language and protocol. Where Agents are designed to communicate, they must share a common language and protocol. 4) Query processing and UI. An agent must allow the user to be able to supply it with queries, set boundaries and create options. This might be in the form of command lines or a GUI. Shopping Agents Two main aims; 1) Locating products from different providers 2) Extracting and comparing information. An example is ShopBot which learns through offline processing creating vendor descriptions. Preliminary research shows that ShopBot improves the ability to shop by up to 4 times. Web Crawlers Although not always classed as Agents, they may be classified as such. Web Crawlers (WC), sometimes called robots are software tools which are designed to visit Web Sites and extract information as they go. In most cases, the WC does not leave the machine it started on; i.e. they do not physically travel, unlike viruses. WC are usually designed for a set purpose such as to archive or index web pages. They may be used by organisations to test their own web pages for broken links. Most WCs adhere to the Robot’s Exclusion Standard which is a voluntary protocol. There is currently no official standard, and is purely advisory. The concept was suggested by Robot writers in 1994. Web page authors can state where a robot may not go on their site by including a robots.txt file which states the ‘disallowed’ areas. Any robot traversing the site should check with the robots.txt file first. A good web-crawler should have (a) a good crawling strategy – i.e. which page to crawl next, and (b) an efficient crawling architecture; reliable, manageable and fast at downloading pages. Agents for Matching Query Subjects against Human Experts Ask an expert! By scanning BBs and forums, experts in certain fields can be identified. E.g. ContactFinder – Krulwich – scans messages and retrieves subjects and the human specialists in each field who seems to resolve issues. -> I don’t know what the specialist’s reaction to this would be! Filtering Agents May be used to filter mail or news article in a news group. Sometimes in the form of altered HTML documents. The Agent may do more than just filter – this may be just one of its tasks. An example is the SIFT (Stanford Information Filtering Tool) which is used as a filter for newsfeeds and filters them according to user profile. Agent Technologies KQML – Knowledge, Query and Manipulation Language KQML is concerned with the run time interaction between agents. COBRA – Common Object Request Broker Architecture COBRA allows apps to communicate regardless of their location and developer. Agents may come into their own with the SWeb especially in the field of educational information retrieval. [Ande04 - 177] suggested that future agents may be able to assist teachers in keeping them up to date with their particular area of interest. He also presents an illustration of agent technology where they will be able to track student’s progress and even mark online coursework on the teacher’s behalf. This will only be possible if the agent can make use of SW information. Filtering The term ‘signal to noise ratio’ is used in chat rooms and Usenet discussions to refer to meaningful discourse (signal) versus worthless blather (noise). It also describes the results obtained from a search engine, where the signal (relevant information) is very small compared to the noise (useless information). Source: http://www.answers.com/topic/signal-to-noise-ratio Indexing: This can be described as a method of describing or identifying a document in terms of its subject content. Term Weighting: Typically, Boolean queries result in a true or false value. Weightings allow the value to be represented by a real number. TW is a playoff between complexity and computational time. The most common weightings are: Binary: Term frequency: Log entropy: Similarity: A metric associated with the similarity between a query and a document, or 2 documents. Hierarchical clustering Single linkage Complete linkage Average linkage Hierarchical Clustering: This involves the grouping of similar articles or objects. Documents may be clustered according to the contained terms, co-occurring citations, co-occurring references. It is important to implement an appropriate method of clustering based on relevant attributes, complexity and processing power available. There are other similar methods such as hashing and sequential filing. Searching/Search Engines The main properties of a meta SE are: 1) A single unified interface for the user to enter a query. 2) When a query is submitted, it send that query to one or more other search engines in parallel. 3) It may filter the resulting list. One such met-search engine is MetaCrawler and can be found at www.metacrawler.com. The main limitations with SE’s are: (Doc123 Etzioni) 1) Keywords can be awkward; difficult to home in on required info. 2) The indices are not personalised - most queries get many false hits. 3) The number of false hits grows with the size of the Internet. 4) Indexing agents cannot completely index the web due to its size. -> There is a limit to how many searches a user will do before giving up, therefore for a SE to be of use, it must locate the info before this point. Doc 75 suggested that with collaborative searching: Compared with a standard search scenario, 1) Where a limited number of previous search scenarios were available, search performance was degraded. 2) Where more search scenarios were available, search performance showed significant increase. The searching paradigm associated with our system could be augmented by ‘Citation Searching’ tools such as CiFi (83). One of the problems with searching is the ambiguity of keywords within a search. Liu (et all) [Liu03] approached this issue by augmenting the query with other contextual keywords. This has the problem though, of the additional keywords becoming more important in the search than the original topic. This can be addressed by what is in effect a multiple stage query; locate pages relating to main keyword, and then search for subsets of that query containing secondary keywords. A second problem with search engines is the manner in which they index ‘snapshots’ of the web. The web does not scales well for indexing. Because of its dynamic nature, search engine indexes are out of date as soon as the data is collected. Menczer [Menc2002/Doc 103] attempted to address this issue with the creation of his MySpiders agents which are capable of doing real-time querying against the web. One example of a CS SE is Yarrow [Chen2000 - 118], a meta-search system which learns from the user with very little feedback. It is capable of augmenting queries before sending them to up to 8 of the popular search engines. Its interface is through a web browser. Search engines return many thousands of documents when queried far exceeding the amount needed or indeed comprehendible by human beings. The tools available for filtering the results based on Web page content are still at a basic stage. It is in this area that the benefits of the SW will be best realised. Adaptive searching With the help of an Agent, preferred words can be altered or more emphasis put on chosen ones. On example is Amalthaea (Mentioned in 85) – a multi agent system for info retrieval. The main drawback is that user’s need to evaluate the retrieved docs. Problems with AltaVista (or other webcrawler type of) searching: 1) Indexing of www pages may be out of date due to size of web. 2) Size of search engines index will become huge – poss. Too slow. 3) If common words are used as keywords, potentially many useless docs may be returned. 4) It is often difficult to determine from the information returned by a search whether the pages actually contain the specific citation. 5) Databases such as CiteSeer are not accessed and indexed. Meta Search Engines Provide the user with an interface to multiple underlying search engines and then merge the results through post processing. Popular ones include: Dogpile: www.dogpile.com Vivisimo: www.vivisimo.com The post retrieval processing varies according to the engine used. Pirolli in Doc 95 proposed that navigation could benefit from web pages being classified and indexed according to other metrics such as textual content, connectivity structure, access stats, file details etc. These collections could then be indexed and visualised to provide better searching browsing. -> limited possibilities yet novel concept. Collaborative Filtering Collaborative filtering (CF) works in such a way that by identifying documents a user has found useful in the past, it is possible to make recommendations of other similar documents. CF is difficult is the users changes interests regularly, or the area of interest is either rarely documented or extremely highly documented. More static, but highly used domains are more successfully recommended. One such example is the video rental market. PolyLens [COSL01] , a tool for recommending films according to user profile is one such system. Colaborative Filtering Algorithms Standard CF algorithms work by querying a dataset as a rating matrix. The columns may represent ‘items’ in the environment such as a keyword. The rows are ‘users’. CF works by attempting to fill in the gaps in the matrix by comparing similarities and patterns. Options: 1) MovieLens used Movies as the items(columns), and users as the rows. 2) Paper Authors as the users and citations as items. – Doc 107. This benefited from not having a startup problem. The weakness is that Authors may cover more than one subject, so link an Author with a subject would be difficult. 3) A paper would represent a user, a citation is an item. The rating would be some measure of co-citation. The basic co-citation metric is a count of the number of times the two were found as refs in papers (ref to the original paper, and ref to the citation). Recommender Algorithms: Collaborative Filtering: Co-citation matching – for each citation in the matrix, the algorithm counts the number of times other citations were co-cited with it. Highest being best User-item – This is the original k-nearest neighbour algorithm. NN computes the distance between users based on their preference history. Predictions of how much a user will like an item are estimated by taking the weighted average of the opinions of a set of nearest neighbours for that item. It compares papers to create a neighbourhood of the most similar papers to the target paper. The algorithm counts the number of times neighbours make a citation, with the count weighted by the similarity of each neighbour to the target paper. Highest is best. Item-item – compares citations (columns) to create a neighbourhood Naïve Bayesian Classifier – calculates probabilities that any given citation in the dataset is related to the input set. Non Collaborative Filtering: Localised Citation Graph Search – uses keyword similarities between the target paper’s extract and the titles of papers nearby in the citation graph. For a target paper, it makes a list of papers that cite the target, papers co-cited with the target, and papers cited by items in the set. Uses TF/IDF. Keyword Search – the title of the paper is sent to a search engine, and the results are limited to those which also appear I the matrix. Models: Models are built, usually offline initially, and their aim is to be small, fast and accurate. Same as above! Techniques: 1) Bayesian networks based on a training set with a decision tree at each node and edges representing user information. 2) Dimensionality Reduction. Low dimensional space is created where latent relationships between users or items can be discovered. 3) Clustering techniques identify groups of users who appear to have similar preferences. 4) Horting: A graph based technique; nodes are users, and edges between the nodes indicate the degree of similarity between two users. Roles – Permissions, levels accessibility, moderators, adjudicators, expert who rate etc. Recommender Systems Recommender Systems (RS) have been researched in depth in the past by notable names such as Balabonovich [BALA97] and Bollacker [BOLL99]. They have explored how the similarities between users or between articles can be exploited. One of the fundamental problems with recommender systems is where as a result of the systems processing, it recommends inappropriate items. This has a negative affect on the user’s opinion of the system. The Web and Teaching It is necessary to move from a point of distributed intelligence and learning to one of collaboration and sharing. To do this on a large scale, there are two challenges; (1) defined and accepted standards for authoring the semantic web and for retrieving the data, and (2) interoperability amongst educational establishments. For both of these to be achieved, the correct tools need to be obtainable. One of the main obstacles to being able to fully utilise the Web for teaching is that it is still highly unstructured; educational material is highly distributed and relatively unordered. This makes it difficult in identifying, extracting and linking sources of learning materials. Most Computer Supported Collaborative Learning (CSCL) focuses on the students and the methods by which the syllabi are presented to them. These efforts have had little widespread effect on the educational practises (see Doc 113). The presentation tools used on the Internet must be used in conjunction with a development environment which the Teachers can use to prepare the presentation of info. Stahl (113) described how these tools can be integrated and made better use of. Collaborative Learning: CL can achieve much; Social interaction encourages deeper learning which in turn helps the student to create their own knowledge Encourages engagement between students Makes learning less disciplinary and more open. Technology is seen as the key to CL as well as the key to future learning. Research in general is collaborative in its nature. However, the tools needed to successfully implement CSCL are lagging behind the needs of teaching. Where CL takes place, both the student and the Teacher need to be able to assess progress in a definitive environment. There are a number of problems associated with CSCL: 1) Locating: There is no systematic guide showing where a student or teacher should look on the Web for information. There are a number of repositories around which specialise in certain subjects, and there are a number of repositories (such as CiteSeer) which offer articles on numerous subjects. Neither of these can differentiate between what a 16 year old student would seek and what a 22 year old student would seek. 2) Access: Most Web offerings are either in document form, or are only brief descriptions of activities. They do not provide the resources needed to take part in an activity. Neither do they provide a productive context in which the student may work. 3) Adapting: Students respond to being able to alter an activity to; suit their methods of work or reorder work. 4) Sharing: Students like to share their work. CL should allow a student to display, propose and justify their work, and be able to view other’s work. Teachers should be able to mark work centrally. Created work (where of sufficient standard) should be made available as part of the curriculum in future use. Stahl(122) summarised by saying ‘We needs a vision of how networked computers can facilitate the discussion of all with all that does not require the coordination of a manager or teacher and can support the collaborative building of knowledge that is not restricted to the skills, memories and efforts of individuals.’ There are a number of general Web Based Education Systems (WBES) such as Blackboard and WebCT which allow tutors to create Web enhanced courses . There are also a number of more specialised systems such as ElmArt, Pat, AIMS and AHA textbooks. More global efforts include EdNa and ARIADNE – for refs see doc 147. The ARIADNE Foundation – http://www.ariadne-eu.org creates tools and methods for the creation, and management of pedagogical artefacts. They use the LOM (See semantic) specification as mentioned earlier (or later!). Another project is The Advanced Distributed Learning Initiative (ADL) which is sponsored by the Office of the Secretary of Defence. Its aim is to the interoperability of learning tools and material on a global scale. The initiative is unique in so much as it allows the use of materials from various sources such as for example ARIADNE as discussed. Does the fact that children have grown up with computer games affect how they should be taught. Article 168 examines this concept. 94% of teenagers believe that the Internet helps them with technology. Instant messaging is seen as a natural form of communication. The paper suggests that it may take 60 hours to read war and peace, similar to playing a game. The game allows the user to become immersed and therefore knowledge retention is enhanced. This has lead to researchers considering the use of games within education. Whilst the mental image of game playing being solitary, the reality is that it is highly social. ‘Games encourage collaboration among players and thus provide a context for peer to peer teaching and for the emergence of communities’ [Squire 2003]. The notion that people learn through reward or punishment has been overtaken by the constructivist theory where people learn by assembling facts, experience and practise. In a class-room, students are estimated to ask 0.1 questions per hour (168). Consider this in comparison to a Web Blog about a game; users post and read many articles, this is a form of questioning. After all, games have been used for a long time in early school learning. Games have many pedagogical aspects: Activates prior learning: games require facts to achieve success Context: knowing which techniques/objects to apply - and what results are expected. Feedback and Assessment: Scoring, levels and awards are part of the game Transfer: transfer of learning from other parts of life like school. Experimental: for each action there is a reaction - feedback Social: distributed communities Squire suggests the use of games such as Civilization to be considered for their suitability for educational purposes. Might a student learn more about the ways of the middle-ages by playing a historically accurate game than reading a text book? Civilization involves maps, economics, asset management and personnel skills. One of the key aspects to TEL is to ensure that information overload is kept to a minimum. This caveat was discussed by [Mont06 - 173] in his presentation on the differences between what can be perceived as expected from Adaptive Hypermedia (AH) system, and what is currently available. Any system which aids learning must itself be simple to understand with minimal effort involved in being productive quickly. AH systems are about composing a system environment around what the user deems as personally suitable. He stated that The vision of a Web which works in hand with education is one of futuristic expectation and enthusiasm. There is much in the way of promises being proffered about this subject, but research is still lagging behind expectations. It may be quite a few years before solid advancements are seen. There are 2 main issues involved: 1 Technological - The effectiveness of tools in being able to store and retrieve information. The effectiveness of agent in being able to assist humans successfully in learning and information retrieval. There are many Agents currently available which offer to automate certain tasks, but many of these are placeholders for something more effective. Is an Agent which can be left to browse the Web and report back the most suitable 10,000 Web pages more useful than one than has the ability to return a limited number of human recommended pages? Advancements in technology sometimes hinder progress as well as help. The technology of the Web has been fairly static for a number of years, HTML has been extended, DSL have much larger capacity and wireless technology has prospered. But the underlying technology i.e. TCP/IP, HTTP, networks and Web browsers have not changed much. Research into the education Web has also been stalled for a number of years and it is accepted that these work hand in hand. 2 Conceptual – Sometimes, concept is more important than technology; the concept of the semantic web is relatively simple to understand, but its impact may far outweigh any technical advancement which may emerge in the short term. The concept of an open access repository of learning material which is readily available is easy to comprehend, but is still not available. There are currently a number of Web-based educational applications which successfully support the exchange of activities, material and collaboration among members of European higher educational establishments. One such application is The Universal Project. This can be found at http://www.ist-universal.org. The Universal Project is an open repository for learning resources on the Web - Quantifying the benefits of computer aided learning environments – One important aspect to the use of computer aided learning environments is in the use of metrics in measuring how each one is effective in comparison with one another. This is a difficult evaluation as it is a new area, and many of the developments are still ongoing. It is also difficult because the systems may have differing aims and be designed for differing target groups. The role of evaluation may be best addressed by partitioning a system into functions, such as 1) User interface 2) Expandability 3) Variety of domain subjects and content 4) Adaptability to target groups Virtual Research Environments (VRE) – VRE’s comprise of digital infrastructure and services which enable research to take place. They are normally associated with an institutional supporting structure and at minimum offer administration and management facilities for the sharing and reuse of tools, data and results. Such sharing assumes the implementation of standards for data representation, indexing and access. In most cases, the VREs are owned and managed by the research communities - isolation would be very unrealistic. One such example of a VRE under development is The Building a VRE for the Humanities Project (BVREH) which is being undertaken within the Humanities Division at the University of Oxford. The main aim of this project is not in creating a system in its entirety, but through surveying, establishing a set of priorities for VRE developments. Fraser [FRAS05 - 186] states that with regard to the BVREH Project ‘initial results suggest that the overall priorities …are central hosting and the curation of digital components of research projects, and the potential for a VRE to facilitate communications’. Head [HEAD07] states that there are 3 stages essential to quality research: 1) Plotting the course for research 2) Crafting the quality research paper 3) Preparing the paper, and adhering to grading criteria and citation standards. Head also goes on to state in the students opinions, their chances of succeeding in a research project would be improved by: 1) The opportunity of turning in draft papers which are reviewed and returned 2) Individual sessions with librarians for help with narrowing down topics. 3) On-to-one coaching with professors focussing on how to overcome obstacles. In her study, she found that the majority of students were not as reliant upon SE as prior studies had suggested. Only 1 in 10 students reported using SEs for conducting research. This could be a result of the overwhelming amount of information returned for SEs. This is supported by her statement ‘At first, a majority of students in our discussion groups reported using Yahoo or Google as their first step in their research process. However, further discussion with the participants revealed their search engine searches often proved useless.' Shaw [SHAW03] stated that there are 6 steps to online research: 1) Questioning – the student must understand the assignment. 2) Planning – duration, where to look, # of sources, who will they be working with? 3) Gathering – use as many sources as possible. Make aware of the issue of reliability of information on the Web. 4) Sorting and sifting – order information into categories. 5) Synthesising – map all the information into one report. Possibly use a concept mapping method. 6) Evaluating – Does complete paper meet requirements, be prepared to re-write parts. These 6 points confirm Head’s recommendations, but breaks them down slightly more. Communication/ knowledge sharing within Learning Despite the technology of the Web and tolls such as agents and CSCW, the main method of communication in education will remain human to human interaction. Email has become an important aspect of most educational establishments even from primary school level. Education already makes use of stored interactions such as video, Power Point presentations and blogs. The use of CSCW encourages the organising and recording of interactions and enables the creation of workgroups who can interact with one another in real time. The addition of semantic information to these stored interactions can only help in the indexing, storage and retrieval of such sources. Most research materials held on the Web are shielded behind toll-gates which restrict their circulation. Subscription to these journal portals is increasingly a burden on Universities. One solution to this is in authors self archiving their works. The best way of doing this previously was in placing the articles on their own web-sites; this is not a particularly effective method. With a system of OAI-compliant archiving, there is a way to improve scholarly communication and to increase the likelihood of the sought after notion of peer-reviewed material. The SHERPA [185] project seeks to place these repositories at the institutional level by creating a national repository. Search facilities will be created by other projects. The main challenge is a cultural one; convincing academics to be involved in the initiative. There is a widely held view that freely available research material on the Web is of low quality, for this reason, SHERPA aims to place a priority on collected refereed material. Adaptive Hypermedia Adaptive Hypermedia Systems (AHS) allow the media being displayed to adapt to the needs of the user. A number of attributes may have impact such as (taken from 179 Lawl05]): 1) Cultural Background. Languages, dialect measures and weights can all be altered to suit. 2) User Preferences - The user may be allowed to make changes to the interface though menus. 3) Communications styles - The way in which instructions are passed to a user differ; some may like instructions to be explicit, other may prefer to be given more vague instructions. Disabled people may need specific delivery measures. 4) Cognitive and learning styles - Users differ in the manner in which they prefer to learn; for example, graphs versus figures. 5) Prior Knowledge and Expertise - There is a link between learning objects and the knowledge that is required to understand them. The level of prior knowledge may dictate the content and speed of that objects of learning are made available to the user. There are a number of methods of adapting the information made available in a system. These methods are listed below: 1) Adaptive navigation – This is where the displayed hypertext links adapt to the user. An example of this is Amazon, the online book store. Different users will see links to different books on the Web page which may match their browsing history. 2) Adaptive Presentation – The displayed content can be altered in many ways; layout, font, colours, content etc. An example of this is the Syskill & Webert Agent (Doc 90) which attempted to learn from user’s feedback to create a separate profile for each topic the user is interested in. The Agent would then adapt it suggested links to more closely suit the user. 3) Historical Adaptation – This attempts to give the user some context of time as they progress through a series of modules. It may display landmarks to give the user a notion of where they are within the larger picture of an ongoing project. AH will be driven by the inclusion of Metadata content in Websites and documents, although this is somewhat hindered by the lack of uniformity in metadata standards. There are a number of repositories which currently store and encourage the re-use of educational information such as ARAIDNE, ARROW and Lydia. Some of these are public, others commercial. AH System architecture There are a number of elements needed in creating a complete AH system. Following are some of the key elements: 1) Metadata standards – The entire system needs to converse in a universally accepted language; OWL is one such standard are put forward by the W3 organisation. This enables material to be described in a readable way which can be interpreted by all associated systems. 2) Collection Agents - One of the key features of a AH system is the use of a search Agent to locate educational information on the web. The use of metadata will enhance this functionality as it will be easier to classify information. 3) Repositories – For educational material to be widely accessible, repositories need to be publicly available, at least to educational establishments. They need to be organised in such a way that the material can be readily located, and the materials organised logically. 4) Content Delivery System – There needs to be a method of providing and delivering material in an adaptive manner. The method should support both teacher and student and offer functionality for the provision of materials, as well as displaying, marking and monitoring progress and feedback. AH systems rely on the fact that they possess to some degree, a level of intelligence. For the software to adapt it output, it must be able to demonstrate reasoning to decide curriculum sequencing, selective provision of content and analysis of student work. Devedzix [DEVE03 – 182] stated that future systems should be able to demonstrate better content- and theory-oriented intelligence and should pay more attention to reusability and interoperability. This can be achieved through the use of Semantic ontologies such as OWL, but it remains the responsibility of developers to make good use of this metadata to meet these needs. The Semantic Web Although Web author's choices worldwide are largely uncoordinated, they are anything but uncorrelated; there itself organisation amongst domains, blogs and forums, there are language standards, and there is a very successful set of links between all the pages. For the size of the Web, it is actually quite well organised. This is not to say that it could not be more organised, and given time, this will happen. The SW is part of this process of improvement. The notion of a Semantic Web (SW) was first articulated in 2001 by Berners-Lee et al. [doc 151]. A SW is one where individual Web pages contain two types of information; that which the viewer can see, and that which is transparent but can be manipulated by computers. To enable web pages to contain this extra information, they must be created in a language other than basic HTML. This language must be capable of being both backwards compatible to existing web pages and extensible enough to allow the additional semantic information to be added to the Web page. The semantic Web can be envisioned as a Web universe parallel to standard documents which contains semantic information. The goal of the SW was described by Flake et al. [FLAK03 - 180] as “to have humans make the web more digestible for computers’. One of the stumbling blocks of the expansion of the SW is the lack of formal ontologies. The W3C has made concerted efforts to formalise methods of specifying, developing and deploying languages which can be used as foundations for semantic processing. An example of this is the Resource Description Framework (see below). The RDF is assisted by the creation of Triple stores which are repositories for RDF contents, and by extraction languages such as GRDDL www.w3.org/2004/01/rdxh/spec for accessing the RDF information. Search Engines: There are a number of search engines which exploit the benefits of the SW: (1) the SHOE search (64 http://www.cs.umd.edu/projects/plus/SHOE/search) which utilises the extensions added to HTML by SHOE, (2) Swoogle (Semantic Web Ontology … http://swoogle.umbc.edu/ doc 167) benefits from the RDF tags to web pages to assist searching, and (3) Semantic Web Search (http://www.semanticwebsearch.com/query/) also uses RDF to locate web pages and allows automated queries to be ran against. Swoogle helps user locate ontologies and terms, and also serves agents seeking knowledge. Navigating the SW is quite different from the conventional Web; firstly, new tools are required to extract and display the SW. Secondly, the SW is disjointed in its coverage in comparison to the conventional Web. Thirdly, the SW was never intended to replace the information found in the conventional Web. But to augment it to allows that information to be more easily categorised, indexed and located and displayed. The SW will prove to be more useful in certain scenarios; the classification and indexing of scientific literature will probably make the most of the SW, but organisations such as Companies wishing to merely have a Web presence may not gain much at all. Conventional web navigation and ranking models are not suitable for the Semantic Web for 2 reasons: 1 They do not differentiate between SWD's and other Web pages 2 They do not parse and use the internal structure of a SWD. For more flexibility in document relationships, OWL was created. It uses the links provided by RDF to allow ontologies (network of relationships) to be distributed across systems - MORE Conventional SE’s cannot utilise the power offered by the SW, so a more specialised SE such as Swoogle (doc 167) must be used. Swoogle automatically finds appropriate ontologies, gathers instance data and characterises the SW by identifying structural relationships. Searches can be done via a simple GUI and APIs provide agents support. Currently, Swoogle have indexed around 11000 Semantic Web Documents (SWD). There has been a slow uptake of these technologies, and without exposure, success is restricted. Consider a Search Engine with very few indexed pages; how often would it get used? Despite this fact, it has given the opportunity for new techniques to be developed and tested which may be beneficial in the long term. Metadata available on the Web is still lacking in educational circles - 146 There are a number of UK government initiatives in place such as the Integrated Public Sector Vocabulary and the UK Office of Public Sector Information. The benefits of the SW are now widely accepted amongst key communities, such as education. Will the benefits outweigh the costs? Extra coding time, new skills and maintenance issues are all a concerns for organisations that may be reluctant to adopt the new methods. “The vision of the educational semantic web is based on three fundamental accordances. The first is the capacity for effective information storage and retrieval. The second is the capacity for non-human autonomous agents to augment the learning and information retrieval and processing power of human beings. The third affordance is the capacity of the Internet to support, extend and expand communications capabilities of humans in multiple formats across the bounds of time and space.” (Anderson and Whitelock 2004) Sparck Jones [157] argues that Semantic processing is not the direction to take and that it will only offer superficial benefits in the end; she argues that using natural language processing to categorise and extract information from the Web offers more potential. Unfortunately, she offers scant practical suggestions as to how this can be implemented. The SW has proven to be more difficult in establishing standards than the conventional Web was. This is evident in the time that it has taken for organisations such WC3 to establish standards, and in the lack of organisations who are involved in creating SW pages and tools. Human issues: It is probable that most semantic information placed on the Web will be put there by ordinary users, not specialists. Might this lead to incorrect information being used? The idea that the new SW is being fed inappropriate or wrong information is enough to upset many network managers. There is also a tendency in educational circles to believe that the educational transaction experience should be overwhelmingly human and not machine assisted. There are a number of project attempting to harness the benefits of the semantic web, one being The Learning Object Metadata working group. This has been undertaken by the IEEE Learning Technology Standards Committee, who aim to develop a draft standard for learning object metadata. Similar to OWL, it sets about describing characteristics of the learning object such as title, language and size, and groups them into categories such as educational or technical. By defining a workable structure for educational objects, it is hoped that it is easier for software to locate, manage and use these objects. There is a lot of respect for the notion of a Semantic Web, but there is also evidence to suggest that a percentage of researchers are concerned in the amount of expectation being placed on its shoulders. Wilks [WILK06] argues that the SW may just be a renamed version of traditional Artificial Intelligence knowledge representation which comes complete with its own problems and challenges. Resource Description Framework (RDF) RDF is a language for representing information about resources on the Web. It was intended for representing metadata about Web resources such as author, title and modification date of a Web page. RDF is intended to be used in conjunction with application which can extract the additional information for processing. URLs are a particular kind of URI. URIs are not limited to identifying things that have network locations such as people, organisations, books. RDF uses XML in order to make them machine readable. The following example is taken from http://www.w3.org/TR/rdf-primer/ Figure F illustrates how by including several statements within the XML, additional information can be gathered about a Web page. In this example, the Web page is http://www.example.org/index.html which is the subject, http://www.example.org/terms/creation-date is the predicate, and ‘August 16, 1999’ is the object. We can view this as ‘http://www.example.org/terms/creation-date’ being the link which joins the date to the Web Page. Figure F: Several Statements about the Same Resource An alternative method of displaying this is using triple notation. Each statement consists of the subject, predicate and the object. "August 16, 1999" "en" This can also be displayed in a shorthand version, which allows namespaces (variables) to be used such as: prefix rdf:, namespace URI: http://www.w3.org/1999/02/22-rdf-syntax-ns# prefix rdfs:, namespace URI: http://www.w3.org/2000/01/rdf-schema# prefix dc:, namespace URI: http://purl.org/dc/elements/1.1/ prefix owl:, namespace URI: http://www.w3.org/2002/07/owl# prefix ex:, namespace URI: http://www.example.org/ (or http://www.example.com/) prefix xsd:, namespace URI: http://www.w3.org/2001/XMLSchema# The shorthand version of the triples would then appear as: ex:index.html dc:creator exstaff:85740 ex:index.html exterms:creation-date "August 16, 1999" ex:index.html dc:language "en" Note where the symbol ex has initially been defined in the top set and then used in the bottom set. The triple can be incorporated by using RDF/XML coding into the Web page. It might look something like the following: August 16, 1999 en The first 3 lines set up the namespaces. We then have 2 sets of triples establishing the meta-data links. To check out: Infoharness: provides access to indexed, relational data through a front end which converts user queries into a form the database can accept. Harvest: A set of linked ‘gathering’ stations which collect, index and store information. Queries go via these stations initially to seek a answers to queries. RBSE Spider: The spider operates much as any spider, but it is the storage of the results which is better. SQL queries are supported against fields. Clever: Hyperlink based personalised web search. Alexa Ringo MetaWeb – Client side processing slows this down. Object-Lens: ref’d in Doc 59. A semi autonomous agent consists of a collection of user programmed rules. For e.g. the user can set up rules for dealing with emails. Blue Squirrel’s WebSeeker: www.bluesquirrel.com Copernic 2001: www.copernic.com Excalibur: www.excalib.com Infofinder: Ref’d in Doc 85 – requests the user supply some sample docs before initiating a search. Amalthaea: Ref’d in Doc 85 Look into Noodletools