Loic.dotnetcore.DotnetSpider
1.0.0
dotnet add package Loic.dotnetcore.DotnetSpider --version 1.0.0
NuGet\Install-Package Loic.dotnetcore.DotnetSpider -Version 1.0.0
<PackageReference Include="Loic.dotnetcore.DotnetSpider" Version="1.0.0" />
<PackageVersion Include="Loic.dotnetcore.DotnetSpider" Version="1.0.0" />
<PackageReference Include="Loic.dotnetcore.DotnetSpider" />
paket add Loic.dotnetcore.DotnetSpider --version 1.0.0
#r "nuget: Loic.dotnetcore.DotnetSpider, 1.0.0"
#:package Loic.dotnetcore.DotnetSpider@1.0.0
#addin nuget:?package=Loic.dotnetcore.DotnetSpider&version=1.0.0
#tool nuget:?package=Loic.dotnetcore.DotnetSpider&version=1.0.0
DotnetSpider
免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。
DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework.
If you want get latest beta packages, you should add the myget feed:
<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />
DESIGN
DEVELOP ENVIROMENT
Visual Studio 2017 (15.3 or later) or Jetbrains Rider
Docker
MySql
docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
Redis (option)
docker run --name redis -d -p 6379:6379 --restart always redis
SqlServer
docker run --name sqlserver -d -p 1433:1433 --restart always -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
PostgreSQL (option)
docker run --name postgres -d -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
MongoDb (option)
docker run --name mongo -d -p 27017:27017 --restart always mongo
RabbitMQ
docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \ -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \ rabbitmq:3-management
Docker remote api for mac
docker run -d --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
HBase
docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
MORE DOCUMENTS
https://github.com/dotnetcore/DotnetSpider/wiki
SAMPLES
Please see the Project DotnetSpider.Sample in the solution.
BASE USAGE
ADDITIONAL USAGE: Configurable Entity Spider
public class EntitySpider : Spider
{
public static async Task RunAsync()
{
var builder = Builder.CreateDefaultBuilder<EntitySpider>();
builder.UseSerilog();
builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
await builder.Build().RunAsync();
}
public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
options, services, logger)
{
}
protected override async Task InitializeAsync(CancellationToken stoppingToken)
{
AddDataFlow(new DataParser<CnblogsEntry>());
AddDataFlow(GetDefaultStorage());
await AddRequestsAsync(
new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string> {{"网站", "博客园"}}),
new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string> {{"网站", "博客园"}}));
}
protected override (string Id, string Name) GetIdAndName()
{
return (ObjectId.NewId.ToString(), "博客园");
}
[Schema("cnblogs", "news")]
[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
[FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})]
public class CnblogsEntry : EntityBase<CnblogsEntry>
{
protected override void Configure()
{
HasIndex(x => x.Title);
HasIndex(x => new {x.WebSite, x.Guid}, true);
}
public int Id { get; set; }
[Required]
[StringLength(200)]
[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
public string Category { get; set; }
[Required]
[StringLength(200)]
[ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
public string WebSite { get; set; }
[StringLength(200)]
[ValueSelector(Expression = "//title")]
[ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
public string Title { get; set; }
[StringLength(40)]
[ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
public string Guid { get; set; }
[ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
public string News { get; set; }
[ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
public string Url { get; set; }
[ValueSelector(Expression = ".//div[@class='entry_summary']")]
public string PlainText { get; set; }
[ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
public DateTime CreationTime { get; set; }
}
}
Distributed spider
Puppeteer downloader
Coming soon
NOTICE
when you use redis scheduler, please update your redis config:
timeout 0
tcp-keepalive 60
Dependencies
Package | License |
---|---|
Bert.RateLimiters | Apache 2.0 |
MessagePack | MIT |
Newtonsoft.Json | MIT |
Dapper | Apache 2.0 |
HtmlAgilityPack | MIT |
ZCJ.HashedWheelTimer | MIT |
murmurhash | Apache 2.0 |
Serilog.AspNetCore | Apache 2.0 |
Serilog.Sinks.Console | Apache 2.0 |
Serilog.Sinks.RollingFile | Apache 2.0 |
Serilog.Sinks.PeriodicBatching | Apache 2.0 |
MongoDB.Driver | Apache 2.0 |
MySqlConnector | MIT |
AutoMapper.Extensions.Microsoft.DependencyInjection | MIT |
Docker.DotNet | MIT |
BuildBundlerMinifier | Apache 2.0 |
Pomelo.EntityFrameworkCore.MySql | MIT |
Quartz.AspNetCore | Apache 2.0 |
Quartz.AspNetCore.MySqlConnector | Apache 2.0 |
Npgsql | PostgreSQL License |
RabbitMQ.Client | Apache 2.0 |
Polly | BSD 3-C |
Buy me a coffee
AREAS FOR IMPROVEMENTS
QQ Group: 477731655 Email: zlzforever@163.com
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 is compatible. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net5.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last Updated |
---|---|---|
1.0.0 | 0 | 7/31/2021 |