-
-
Notifications
You must be signed in to change notification settings - Fork 218
Expand file tree
/
Copy pathpackage.json
More file actions
139 lines (139 loc) · 5.96 KB
/
package.json
File metadata and controls
139 lines (139 loc) · 5.96 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
{
"name": "scrape-it",
"description": "A Node.js scraper for humans.",
"keywords": [
"scrape",
"it",
"a",
"scraping",
"module",
"for",
"humans"
],
"license": "MIT",
"version": "6.1.12",
"main": "lib/index.js",
"types": "lib/index.d.ts",
"scripts": {
"test": "node test"
},
"author": "Ionică Bizău <bizauionica@gmail.com> (https://ionicabizau.net)",
"contributors": [
"ComFreek <comfreek@outlook.com> (https://github.com/ComFreek)",
"Jim Buck <jim@jimmyboh.com> (https://github.com/JimmyBoh)",
"Non <aomnonpn@gmail.com (https://github.com/fadingNA)"
],
"repository": {
"type": "git",
"url": "git+ssh://git@github.com/IonicaBizau/scrape-it.git"
},
"bugs": {
"url": "https://github.com/IonicaBizau/scrape-it/issues"
},
"homepage": "https://github.com/IonicaBizau/scrape-it#readme",
"blah": {
"h_img": "https://i.imgur.com/j3Z0rbN.png",
"cli": "scrape-it-cli",
"description": [
"----",
"",
"<p align=\"center\">",
"Sponsored with :heart: by:",
"<br/><br/>",
"<h3 align='center'><a href='https://serpapi.com/?utm_source=scrape-it'>SerpApi.com</a></h3>",
"<a href='https://serpapi.com/?utm_source=scrape-it'><img title='SerpApi' src='https://i.imgur.com/JXqpoEE.png'/></a>",
"<br/>",
"<br/>",
"",
"<h3 align='center'><a href='http://scrapeless.com/?utm_source=scrape-it'>Scrapeless.com</a></h3>",
"",
"[](https://www.scrapeless.com/en?utm_source=scrape-it)",
"",
"[Scrapeless](http://scrapeless.com/?utm_source=scrape-it) – Easy web scraping toolkit for businesses and developers",
"",
"⚡ [Scraping Browser](https://www.scrapeless.com/en/product/scraping-browser?utm_source=scrape-it):",
"",
" 1. Web browsing capabilities for AI agents and applications",
" - Collect data at scale for agents without being blocked",
" - Simulate user behavior using advanced browser tools",
" - Build agent applications with real-time and historical web data",
" 2. Unlock any scale with unlimited parallel jobs",
" 3. High-performance web unlocking built directly into the browser",
" 4. Compatible with Puppeteer and Playwright",
"",
"⚡ [Deep SerpApi](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=scrape-it): One-click Google search data monitoring, supporting 15+ SERP scenarios such as academic/Google Store/Maps, $0.1/thousand queries, 0.2s response. Recently, Scrapeless has officially launched [MCP Server](https://github.com/scrapeless-ai/scrapeless-mcp-server), which can help large prediction models easily capture the latest data and ensure the accuracy of the results.",
"",
"⚡ [Scraping API](https://www.scrapeless.com/en/product/scraping-api?utm_source=scrape-it): Easily obtain public content such as TikTok, Shopee, Amazon, Walmart, etc. Covering structured data of 8+ vertical industries such as e-commerce/social media, ready to use. Only billed by the number of successful calls.",
"",
"⚡ [Universal Scraping API](https://www.scrapeless.com/en/product/universal-scraping-api?utm_source=scrape-it): Intelligent IP rotation + real user fingerprint, success rate up to 99%. No more worrying about network blockades and crawling obstacles.",
"",
"⚠️ Exclusive for open source projects: Submit the Repo link to apply for 100,000 free Deep SerpApi queries!<br/>",
"📌 [Try it now](https://app.scrapeless.com/passport/login?utm_source=scrape-it) | [Documentation](https://docs.scrapeless.com/en/scraping-browser/quickstart/introduction/?utm_source=scrape-it)",
"----"
],
"installation": [
{
"h2": "FAQ"
},
{
"p": "Here are some frequent questions and their answers."
},
{
"h3": "1. How to parse scrape pages?"
},
{
"p": "`scrape-it` has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:"
},
{
"ol": [
"**The ajax response is in JSON format.** In this case, you can make the request directly, without needing a scraping library.",
"**The ajax response gives you HTML back.** Instead of calling the main website (e.g. example.com), pass to `scrape-it` the ajax url (e.g. `example.com/api/that-endpoint`) and you will you will be able to parse the response",
"**The ajax request is so complicated that you don't want to reverse-engineer it.** In this case, use a headless browser (e.g. Google Chrome, Electron, PhantomJS) to load the content and then use the `.scrapeHTML` method from scrape it once you get the HTML loaded on the page."
]
},
{
"h3": "2. Crawling"
},
{
"p": "There is no fancy way to crawl pages with `scrape-it`. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the `.scrapeHTML` method to scrape the local files."
},
{
"h3": "3. Local files"
},
{
"p": "Use the `.scrapeHTML` to parse the HTML read from the local files using `fs.readFile`."
}
]
},
"dependencies": {
"assured": "^1.0.16",
"cheerio": "^1.1.0",
"cheerio-req": "^2.0.1",
"scrape-it-core": "^1.0.2"
},
"devDependencies": {
"@types/cheerio": "^1.0.0",
"@types/node": "^24.0.14",
"lien": "^4.0.0",
"tester": "^2.0.0",
"ts-node": "^10.9.2",
"typescript": "^5.8.3"
},
"files": [
"bin/",
"app/",
"lib/",
"dist/",
"src/",
"scripts/",
"resources/",
"menu/",
"cli.js",
"index.js",
"index.d.ts",
"package-lock.json",
"bloggify.js",
"bloggify.json",
"bloggify/"
]
}