International Journal of Computer Engineering in Research Trends

Back to Current Issues

Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)

Ms.Isha Mahajan, Ms. Harjinder Kaur, Dr. Darshan Kumar, ,

Affiliations
Department of Computer Science & Engineering SSIET, Dinanagar - 143531, Distt. Gurdaspur, Punjab (India)

:10.22362/ijcert/2017/v4/i6/xxxx [UNDER PROCESS]

Abstract

As the World Wide Web is becoming gigantic day by day, the number of web pages is increasing into billions around the world. To make searching much easier for users, search engines came into existence. Search engines are used to find specific information on the WWW. Without search engines, it would be almost impossible for us to locate anything on the Web unless or until we know a specific URL address. Every search engine maintains a central repository or databases of HTML documents in indexed form. Whenever a user query comes, searching is performed within that database of indexed web pages. The size of a repository of every search engine cannot keep each page available on the WWW. So it is desired that only the most relevant and important pages be stored in the database to increase the efficiency of search engines. This search engine database is maintained by special software called â€œCrawler.â€ A Crawler is a software that traverses the web and downloads web pages. Web Crawlers are also known as â€œWeb Spiders,â€ â€œRobots,â€ â€œInternet Bots,â€ â€œAgentsâ€ and automatic Indexersâ€ etc. Broad search engines, as well as many more specialized search tools, rely on web crawlers to acquire large collections of pages for indexing and analysis. Since the Web is a distributed, dynamic and rapidly growing information resource, a crawler cannot download all pages. It is almost impossible for crawlers to crawl the whole web pages from World Wide Web. Crawlers crawl the only fraction of web pages from World Wide Web. So a crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages. The crawler is an important module of a search engine. The quality of a crawler directly affects the searching quality of search engines. In our Work, we propose to improve the crawling of a web crawler, to crawl only relevant and important pages from WWW, which will lead to reduced server overheads. With our proposed architecture we will also be optimizing the crawled data by removing least used or never browsed pages. The crawler needs a huge memory space or database for storing page content etc, by not storing irrelevant and unimportant pages and never removing accessed pages, we will be saving a lot of memory space that will eventually speed up the queries to the database. In our approach, we propose to use Extended Weighted page rank based on visits of links algorithm to sort the search results, which will reduce the search space for users, by providing mostly visited pages and most time devoted pages by the user on the top of search results list. Hence reducing search space for the user.

Citation

Isha Mahajan et.al, â€œEnhancement in Crawling and Searching(Using Extended Weighted Page Rank Algorithm based on VOL)â€, International Journal of Computer Engineering In Research Trends, 4(6):pp:202-230,June-2017.

Keywords : Web Crawler, Extended Weighted Page Rank based on Visits of links, Weighted Page Rank, Page Rank, Page Rank based on visit of links, Search Engine, Crawling, bot, Information Retrieval Engine, Page Reading Time, User Attention Time, World Wide Web, Inlinks, Outlines, Web informational retrieval, online search.

References

[1]	Internet World Stats survey  report available at - << http://www.internetworldstats.com/stats.htm >>.
[2]	Pew Research centerâ€™s Internet and American Life Project Survey  report available at  - << http://www.pewinternet.org/2012/03/09/main-findings-11/ >>.
[3]	Average Traffic a website receives from a Search Engine is << http://moz.com/community/q/what-is-the-average-percentage-of-traffic-from-search-engines-that-a-website-receives >>
[4]	Size of World Wide Web is available at  << http://www.worldwidewebsize.com/ >>
[5]	Carlos Castillo, Mauricio Marin, Andrea Rodrigue and Ricardo Baeza-Yates, â€œScheduling Algorithms for Web Crawlingâ€ Proceedings of the Web Media & LA-Web 2004, 0-7695-2237-8 Â©2004 IEEE, Pages 10-17.
[6]	S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280 (5360) : 98â€“100, 1998.
[7]	Introduction to Web Crawler is available at - << http://en.wikipedia.org/wiki/Web_crawler >>
[8]	Introduction to Web Crawler is available at - << http://searchsoa.techtarget.com/definition/crawler >>
[9]	Amit Chawla and Rupali Ahuja, â€œCrawling the Web : Discovery and Maintenance of Large-Scale Web Dataâ€, International Journal of Advances in Engineering Science (IJAES), ISSN: 2231- 0347, Volume-3, Pages 62-66, July 2013.
[10]	Sachin Gupta, Sashi Tarun and Pankaj Sharma, â€œControlling access of Bots and Spamming Botsâ€, International Journal of Computer and Electronics Research (IJCER), ISSN: 2278-5795, vol. 3,issue 2, April 2014.
[11]	Sonal Tuteja, â€œEnhancement in Weighted PageRank Algorithm Using VOLâ€, IOSR Journal of Computer Engineering (IOSR-JCE), ISSN: 2278-0661, vol. 2, issue 6, pp. 135-141, Sept-Oct 2013.
[12]	Shweta Agarwal and Bharat Bhushan Agarwal, â€œAn Improvement on Page Ranking Based on Visits of Linksâ€, International Journal of Science and Research (IJSR), ISSN: 2319-7064, vol. 2, issue 6, pp. 265-268, June 2013.
[13]	S. Brin, and Page L., â€œThe Anatomy of a Large Scale Hypertextual Web Search Engineâ€, Computer Network and ISDN Systems, vol. 30, issue 1-7, pp. 107-117, 1998. 
[14]	Wenpu Xing and Ali Ghorbani, â€œWeighted PageRank Algorithmâ€, Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR â€™04), IEEE, 2004.
[15]	Gyanendra Kumar, Neelam Duahn, and Sharma A. K., â€œPage Ranking Based on Number of Visits of Web Pagesâ€, International Conference on Computer & Communication Technology (ICCCT)-2011, 978-1-4577-1385-9. 
[16]	Neelam Tyagi and Simple Sharma, â€œWeighted Page Rank Algorithm Based on Number of Visits of Links of Web Pageâ€, International Journal of Soft Computing and Engineering (IJSCE), ISSN: 2231-2307, vol. 2, issue 3, pp. 441â€“446, July 2012.
[17]	Animesh Tripathy and Prashanta K Patra, â€œA Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithmâ€, Asia-Pacific Services Computing Conference, 978-0-7695-3473-2/08 Â© 2008 IEEE, Pages 513-518.
[18]	Lay-Ki Soon, Yee-Ern Ku  and Sang Ho Lee, â€œWeb Crawler with URL Signature â€“ A Performance Studyâ€, 4th Conference on Data Mining and Optimization (DMO) 978-1-4673-2718-3/12 Â©2012 IEEE,  Pages 127-130.
[19]	Farha R. Qureshi and Amer Ahmed Khan, â€œURL Signature with body text normalization in a web crawlerâ€, International Journal of Societal Applications of Computer Science (IJSACS), ISSN 2319 â€“ 8443, vol. 2, issue 3, Pages 309-312, March 2013.
[20]	Saurabh Pakhidde , Jaya Rajurkar and Prashant Dahiwale,  â€œContent Relevance Prediction Algorithm in Web Crawlers to Enhance Web Searchâ€, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), ISSN: 2278 â€“ 1323, vol 3, issue 3, March 2014.
[21]	Prashant Dahiwale, Pritam Bhowmik, Tejaswini Bhorkar and Shraddha Shahare, â€œRank Crawler : A Web Crawler with Relevance Prediction Mechanism for True Web Analysisâ€, International Journal of Advance Foundation and Research in Computer (IJAFRC), ISSN: 2348-4853, vol. 1,issue 4, April 2014.
[22]	Information on HTTP_Referer is available at - << http://en.wikipedia.org/wiki/HTTP_referer >>.
[23]	Information on Url Normalization is available at - << http://en.wikipedia.org/wiki/Url_normalization >>.
[24]	Information on MD5 Hashing Algorithm is available at - << http://en.wikipedia.org/wiki/MD5 >>.
[25]	Introduction to WHOIS is available at - << http://en.wikipedia.org/wiki/Whois >>.
[26]	Sachin Gupta and Pallvi Mahajan, â€œImprovement in Weighted Page Rank based on Visits of Links (VOL) Algorithmâ€, International Journal of Computer and Communications Engineering Research (IJCCER), ISSN: 2321-4198, Vol. 2, Issue 3, Pages 119-124, May 2014.
[27]	Sachin Gupta and Sashi Tarun, â€œExtended Architecture of Web Crawlerâ€, International Journal Of Computer & Electronics Research (IJCER), ISSN: 2278-5795, Vol. 3, Issue 3, Pages 147-169, June 2014.
[28]	Isha Mahajan, Harjinder Kaur and Dr. Darshan Kumar, â€œExtended Weighted Page Rank based on VOL by finding User Activities Time and Page Reading Timeâ€, International Journal of Engineering Works (IJEW), ISSN: 2409-2770, Vol. 7, Issue 2, Pages 41-48, Feb 2017.
[29]	Introduction to Code Minification is available at - << https://developers.google.com/speed/docs/insights/MinifyResources >>.
[30]	Javascript Code Minification Api is available at - << https://javascript-minifier.com/ >>.
[31]	Introduction to Cron Jobs is available at - << https://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800 >>.
[32]	Domain age calculating Api is available at - << https://github.com/99webtools/PHP-Domain-Age >>.
[33]	Mubasheera Tazeen, Shasikala.Ch, Dr.S.Prem Kumar,â€ Ontology Based PMSE with Manifold Preferenceâ€, International Journal of Computer Engineering In Research Trends (IJCERT),ISSN:2349-7084,Vol 1,Issue 1,Pages 15-21,July 2014.
[34]	Sachin Desale, Akhtar Rasool, Sushil Andhale, Priti Rane,â€ Heuristic and Meta-Heuristic Algorithms and Their Relevance to the Real World: A Surveyâ€, International Journal of Computer Engineering In Research Trends (IJCERT),ISSN:2349-7084,Vol  2,Issue  5,Pages 296-304,May  2015.

DOI Link : Not yet assigned

Download :

V4I6002.pdf

Refbacks : There are currently no refbacks

Announcements

Authors are not required to pay any article-processing charges (APC) for their article to be published open access in Journal IJCERT. No charge is involved in any stage of the publication process, from administrating peer review to copy editing and hosting the final article on dedicated servers. This is free for all authors.

News & Events

Latest issue :Volume 10 Issue 1 Articles In press

A plagiarism check will be implemented for all the articles using world-renowned software. Turnitin.

Digital Object Identifier will be assigned for all the articles being published in the Journal from September 2016 issue, i.e. Volume 3, Issue 9, 2016.

IJCERT is a member of the prestigious.Each of the IJCERT articles has its unique DOI reference.
DOI Prefix : 10.22362/ijcert

IJCERT is member of The Publishers International Linking Association, Inc. (“PILA”)

Emerging Sources Citation Index (in process)

IJCERT title is under evaluation by Scopus.

Key Dates

☞ INVITING SUBMISSIONS FOR THE NEXT ISSUE :

☞ LAST DATE OF SUBMISSION : 31st March 2023

☞ SUBMISSION TO FIRST DECISION : In 7 Days

☞ FINAL DECISION : IN 3 WEEKS FROM THE DAY OF SUBMISSION

Important Announcements

All the authors, conference coordinators, conveners, and guest editors kindly check their articles' originality before submitting them to IJCERT. If any material is found to be duplicate submission or sent to other journals when the content is in the process with IJCERT, fabricated data, cut and paste (plagiarized), at any stage of processing of material, IJCERT is bound to take the following actions.
1. Rejection of the article.
2. The author will be blocked for future communication with IJCERT if duplicate articles are submitted.
3. A letter regarding this will be posted to the Principal/Director of the Institution where the study was conducted.
4. A List of blacklisted authors will be shared among the Chief Editors of other prestigious Journals
We have been screening articles for plagiarism with a world-renowned tool: Turnitin However, it is only rejected if found plagiarized. This more stern action is being taken because of the illegal behavior of a handful of authors who have been involved in ethical misconduct. The Screening and making a decision on such articles costs colossal time and resources for the journal. It directly delays the process of genuine materials.

Citation Index

Citations Indices	All
Citations	1026
h-index	14
i10-index	20
Source: Google Scholar

Acceptance Rate (By Year)

Acceptance Rate (By Year)
Year	Rate
2021	10.8%
2020	13.6%
2019	15.9%
2018	14.5%
2017	16.6%
2016	15.8%
2015	18.2%
2014	20.6%

 Conference Proposal

 Peer Review

 Digital Library

 Register here

 Conference Proposal

 Special Issue Proposal

 Conference Proposal Guidelines

 Upcoming Conference

DOI:10.22362/ijcert

Welcome to IJCERT