首页 » Andorid » 安卓例子 » 正文

干货集中营数据收集、内容抓取以及搜索接口改进项目

Heroku

Ganks for gank.io

This project is simply modified from another project named Ganks-for-andoirdweekly.net of mine.

This project fetches daily newsletters created by gank.io, which shares technical ganks(干货) every weekday.

It not only parses the post items in one daily issue, but also extracts the main content of each post item's web page for you. Sounds good?

This project also provides a web search API based on Lucene and these ganks, and the web application is now deployed on Heroku platform. see the site

Since I'm currently in free plan of Heroku, so this site is 18/24 housr available, good luck!

中文简介:Ganks for gank.io这个项目主要是利用Gank的API来获取干货列表,除此之外,该项目还利用dragnet等开源工具提取每一个干货的目标网页内容,最终利用LuceneSpark等开源工具提供一个高效简洁的干货搜索接口,并将其部署在Heroku平台。 网站预览
如果你对我的开发工作感兴趣的话记得在Github上Follow我哟,或者关注我的博客

The website included

The website included is deployed to Heroku, see the preview site.

The simple website included in this project is just a page showing the statistics information about the ganks, besides that, it provides a web search API based on Lucene and these ganks.
You can find out more interesting usages with the result data. :-)

1.Run mvn exec:java -Dexec.mainClass="web.WebServer

2.Open http://0.0.0.0:4567/ in your browser, and you will see this web page.

image

Set up

Please make sure you have Java、Maven and Python installed.

How to use

Option 0

Simply use file src/main/resources/gankio.json as the result data, but it will not be auto-updated.

By the way, for the apps using this project, I will try my best to update the result data as frequent as possible.

Option 1

Simply run update.sh to get the result data.

Option 2

Using Maven to compile and run this project

1.Run mvn compile to compile this project;

2.Run mvn exec:java -Dexec.mainClass="data.GankDataHanlder" to start loading these posts and generating the final result data.

3.Then you will see the result data in file src/main/resources/gankio.json in JSON format.

Option 3

Build and run this project in your favorite IDE

1.Run src/main/java/data/GankDataHanlder.java;

2.Then you will see the result data in file src/main/resources/gankio.json in JSON format.

Two main models

The models are not changed from Ganks-for-andoirdweekly.net in order to make future integration easy.

  1. GankIssue represents a daily issue, eg, gank.io daily Issue 2016-05-13

  2. GankItem represents a daily post item, eg, MaryPopup - Expand your view with no problem

The result data

The root of the json data is a JSON array containing all the weekly issues posted in gank.io.

items in each weekly issue stands for the post items in this issue, and each item has its urlsummarycontent, etc.

[
    {
        "id":"2016-05-13",
        "items":[
            {
                "content":"Using Blocks to Realize the Strategy Pattern Jul 10th, 2015 There’s this saying that the Strategy pattern can be realized in Swift using blocks. Without blocks, a Strategy object implements usually one required method of an interface (or protocol) to encapsulate a variation of behavior. This behavior can be switched at runtime. It’s like au00A0plug-in. Well, blocks can do the same. They can become attributes of an object and be switched out. They also capture context if necessary, which may sometimes be a bonus. The only drawback is that blocks can’t encapsulate state of their own except the capturedu00A0context. Here’s a real-world example from a recent project. It’s a work break timer. It deals with two types of timers, realized via dispatch queues: one for work, and one for breaks. The break timer should restart when it’s prolonged, the work timer should continue to tick if it’s started, or start itself if it wasn’tu00A0already. Here’s a Strategy-based version of the difference in prolongationu00A0behavior: protocol TimerProlongationStrategy { func prolong ( timer : TimerType ) } struct StartOnceTimerProlongationStrategy : TimerProlongationStrategy { func prolong ( timer : TimerType ) { if timer . isActive { return } timer . start () } } struct ResetTimerProlongationStrategy : TimerProlongationStrategy { func prolong ( timer : TimerType ) { if timer . isActive { timer . prolong () return } timer . stop () timer . start () } } That’s very verbose, but it’s straightforward tou00A0use: class TimerCoordinator { var workTimer : Timer ! var breakTimer : Timer ! init ( workDuration : Minutes , breakDuration : Minutes ) { self . workTimer = Timer ( duration : workDuration . seconds , scheduler : self , prolongationStrategy : StartOnceTimerProlongationStrategy (), block : finishWork ) self . breakTimer = Timer ( duration : breakDuration . seconds , scheduler : self , prolongationStrategy : ResetTimerProlongationStrategy (), block : finishBreak ) } } Instead of setting up two Timer types, I can use one type and delegate variation to the prolongationStrategy u00A0attribute. With blocks put in place of Strategy objects, it would look likeu00A0this: class TimerCoordinator { var workTimer : Timer ! var breakTimer : Timer ! init ( workDuration : Minutes , breakDuration : Minutes ) { self . workTimer = Timer ( duration : workDuration . seconds , scheduler : self , prolongationStrategy : { timer in if timer . isActive { return } timer . start () }, block : finishWork ) self . breakTimer = Timer ( duration : breakDuration . seconds , scheduler : self , prolongationStrategy : { timer in if timer . isActive { timer . prolong () return } timer . stop () timer . start () }, block : finishBreak ) } } That does read even worse than the versionu00A0before! But notice that I’ve referenced finishWork and finishBreak respectively as the last argument of the initializer. Instead of a () -> Void block, I pass in the reference to a method ofu00A0 TimerCoordinator . Strategies don’t have to be realized as in-line blocks or objects. They can be realized as methods or free functions,u00A0too. Using functions (because methods don’t make much sense for this use case), the full code will look likeu00A0this: func startOnceTimerProlongationStrategy ( timer : Timer ) { if timer . isActive { return } timer . start () } func resetTimerProlongationStrategy ( timer : Timer ) { if timer . isActive { timer . prolong () return } timer . stop () timer . start () } class TimerCoordinator { var workTimer : Timer ! var breakTimer : Timer ! init ( workDuration : Minutes , breakDuration : Minutes ) { self . workTimer = Timer ( duration : workDuration . seconds , scheduler : self , prolongationStrategy : startOnceTimerProlongationStrategy , block : finishWork ) self . breakTimer = Timer ( duration : breakDuration . seconds , scheduler : self , prolongationStrategy : resetTimerProlongationStrategy , block : finishBreak ) } } This gets around inline blocks which are hard to read and doesn’t introduce unnecessaryu00A0objects. Blocks are nice as they are, but functions as first-class citizens of Swift are even nicer because handles to them can be passed instead ofu00A0blocks. Using functions for this will work only if you don’t need to have stateful Strategy instances. In my case, the Strategy objects were simple wrappers around real functions, so it workedu00A0nicely.",
                "id":"56cc6d23421aa95caa707a1c",
                "source":"Gank.io #237 (2016-05-13)",
                "tags":[],
                "title":"使用 block 实现策略模式",
                "type":"iOS",
                "url":"http://christiantietze.de/posts/2015/07/strategy-blocks/"
            },
        ...

        ],
        "num":237,
        "title":"Gank.io #237 (2016-05-13)",
        "url":"http://gank.io/2016/05/13"
    },
    ...
]

Apps using it

Girl

The libraries used

Many famous open source libraries are used in this project, including crawler4j, fastjson, jsoup, velocity, spark and so on. Among these libraries, dragnet is the most important one, it is a Python library used to extract the content of a web page.

License

The MIT License (MIT)

Copyright (c) 2016 Hujiawei

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
下载 (0)

发表评论