-
Notifications
You must be signed in to change notification settings - Fork 60
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #29 from GaiaNet-AI/add-firecrawl
Add firecrawl
- Loading branch information
Showing
2 changed files
with
48 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
--- | ||
sidebar_position: 5 | ||
--- | ||
|
||
# Knowledge base from a URL | ||
|
||
In this section, we will discuss how to create a vector collection snapshot from a Web URL. First, we will parse url to a structured markdown file. Then, we will follow the steps from [Knowledge base from a markdown file](markdown.md) to create embedding for your URL. | ||
|
||
## Parse the URL content to a markdown file | ||
|
||
Firecrawl can crawl and convert any website into LLM-ready markdown or structured data. It also supports crawling a URL and all accessible subpages. | ||
|
||
> To use Firecrawl, you need to sign up on [Firecrawl](https://firecrawl.dev/) and get an API key. | ||
First, install the dependencies. We are assuming that you already have Node.JS 20+ installed. | ||
|
||
``` | ||
git clone https://github.com/JYC0413/firecrawl-integration.git | ||
cd firecrawl-integration | ||
npm install | ||
``` | ||
|
||
Then, export the API key in the terminal. | ||
|
||
``` | ||
export FIRECRAWL_KEY="your_api_key_here" | ||
``` | ||
|
||
next, we can use the following command line to run the service. | ||
|
||
``` | ||
node crawlWebToMd.js | ||
``` | ||
|
||
After the application is running successfully, you will see the prompt appear on the Terminal. | ||
|
||
 | ||
|
||
You can type your URL in the terminal right now. Here we have two choices. | ||
|
||
* Multiple pages: input your link with `/` at the end, the program will crawl and convert the page and its subpages to one single markdown file. This way will cost lots of API token usage. | ||
* One single page: input your link without `/` at the end. the program will crawl and convert the current page to one single markdown file. | ||
|
||
The output markdown file will be located in this folder named `output.md`. | ||
|
||
## Create embeddings from the markdown files | ||
|
||
Please follow the tutorial [Knowledge base from a markdown file](markdown.md) to convert your markdown file to a snapshot of embeddings that can be imported into a GaiaNet node. |